Introduction Random Forest
The steps to predict using Random Forest:
* Step 1
* Step 2
Problem Defination
Determine whether a given email is spam or not.
Data Description
1. Title: SPAM E-mail Database
Relevant Information: The “spam” concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography… Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word ‘george’ and the area code ‘650’ are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter.
For background on spam:
Cranor, Lorrie F., LaMacchia, Brian A. Spam!
Communications of the ACM, 41(8):74-83, 1998.Number of Instances: 4601 (1813 Spam = 39.4%)
Number of Attributes: 58 (57 continuous, 1 nominal class label)
Attribute Information: The last column of ‘spambase.data’ denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.
Most of the attributes indicate whether a particular word or character was frequently occuring in the e-mail. The run-length attributes (55-57) measure the length of sequences of consecutive capital letters. For the statistical measures of each attribute, see the end of this file. Here are the definitions of the attributes:
48 continuous real [0,100] attributes of type word_freq_WORD = percentage of words in the e-mail that match WORD, i.e. 100 * (number of times the WORD appears in the e-mail) / total number of words in e-mail. A “word” in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string.
6 continuous real [0,100] attributes of type char_freq_CHAR = percentage of characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurences) / total characters in e-mail
1 continuous real [1,…] attribute of type capital_run_length_average = average length of uninterrupted sequences of capital letters
1 continuous integer [1,…] attribute of type capital_run_length_longest = length of longest uninterrupted sequence of capital letters
1 continuous integer [1,…] attribute of type capital_run_length_total = sum of length of uninterrupted sequences of capital letters = total number of capital letters in the e-mail
1 nominal {0,1} class attribute of type spam = denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.
Missing Attribute Values: None
Class Distribution: Spam 1813 (39.4%) Non-Spam 2788 (60.6%)
Setup
Load Libs
library(plyr)
library(tidyr)
library(dplyr)
library(ggplot2)
#install.packages("caret")
library(caret)
#install.packages("randomForest")
library(randomForest)
#install.packages("gbm")
library(gbm)
#install.packages("corrgram")
library(corrgram)
Functions
detectNA <- function(inp) {
sum(is.na(inp))
}
detectCor <- function(x) {
cor(as.numeric(dfrDataset[, x]),
as.numeric(dfrDataset$status),
method="spearman")
}
Load Dataset
dfrDataset <- read.csv("/Users/Charu/Desktop/Machine Learning/spambase.csv", header=T, stringsAsFactors=T)
head(dfrDataset)
## word_freq_make word_freq_address word_freq_all word_freq_3d
## 1 0.00 0.64 0.64 0
## 2 0.21 0.28 0.50 0
## 3 0.06 0.00 0.71 0
## 4 0.00 0.00 0.00 0
## 5 0.00 0.00 0.00 0
## 6 0.00 0.00 0.00 0
## word_freq_our word_freq_over word_freq_remove word_freq_internet
## 1 0.32 0.00 0.00 0.00
## 2 0.14 0.28 0.21 0.07
## 3 1.23 0.19 0.19 0.12
## 4 0.63 0.00 0.31 0.63
## 5 0.63 0.00 0.31 0.63
## 6 1.85 0.00 0.00 1.85
## word_freq_order word_freq_mail word_freq_receive word_freq_will
## 1 0.00 0.00 0.00 0.64
## 2 0.00 0.94 0.21 0.79
## 3 0.64 0.25 0.38 0.45
## 4 0.31 0.63 0.31 0.31
## 5 0.31 0.63 0.31 0.31
## 6 0.00 0.00 0.00 0.00
## word_freq_people word_freq_report word_freq_addresses word_freq_free
## 1 0.00 0.00 0.00 0.32
## 2 0.65 0.21 0.14 0.14
## 3 0.12 0.00 1.75 0.06
## 4 0.31 0.00 0.00 0.31
## 5 0.31 0.00 0.00 0.31
## 6 0.00 0.00 0.00 0.00
## word_freq_business word_freq_email word_freq_you word_freq_credit
## 1 0.00 1.29 1.93 0.00
## 2 0.07 0.28 3.47 0.00
## 3 0.06 1.03 1.36 0.32
## 4 0.00 0.00 3.18 0.00
## 5 0.00 0.00 3.18 0.00
## 6 0.00 0.00 0.00 0.00
## word_freq_your word_freq_font word_freq_000 word_freq_money word_freq_hp
## 1 0.96 0 0.00 0.00 0
## 2 1.59 0 0.43 0.43 0
## 3 0.51 0 1.16 0.06 0
## 4 0.31 0 0.00 0.00 0
## 5 0.31 0 0.00 0.00 0
## 6 0.00 0 0.00 0.00 0
## word_freq_hpl word_freq_george word_freq_650 word_freq_lab
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## word_freq_labs word_freq_telnet word_freq_857 word_freq_data
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## word_freq_415 word_freq_85 word_freq_technology word_freq_1999
## 1 0 0 0 0.00
## 2 0 0 0 0.07
## 3 0 0 0 0.00
## 4 0 0 0 0.00
## 5 0 0 0 0.00
## 6 0 0 0 0.00
## word_freq_parts word_freq_pm word_freq_direct word_freq_cs
## 1 0 0 0.00 0
## 2 0 0 0.00 0
## 3 0 0 0.06 0
## 4 0 0 0.00 0
## 5 0 0 0.00 0
## 6 0 0 0.00 0
## word_freq_meeting word_freq_original word_freq_project word_freq_re
## 1 0 0.00 0 0.00
## 2 0 0.00 0 0.00
## 3 0 0.12 0 0.06
## 4 0 0.00 0 0.00
## 5 0 0.00 0 0.00
## 6 0 0.00 0 0.00
## word_freq_edu word_freq_table word_freq_conference char_freq_.
## 1 0.00 0 0 0.00
## 2 0.00 0 0 0.00
## 3 0.06 0 0 0.01
## 4 0.00 0 0 0.00
## 5 0.00 0 0 0.00
## 6 0.00 0 0 0.00
## char_freq_..1 char_freq_..2 char_freq_..3 char_freq_..4 char_freq_..5
## 1 0.000 0 0.778 0.000 0.000
## 2 0.132 0 0.372 0.180 0.048
## 3 0.143 0 0.276 0.184 0.010
## 4 0.137 0 0.137 0.000 0.000
## 5 0.135 0 0.135 0.000 0.000
## 6 0.223 0 0.000 0.000 0.000
## capital_run_length_average capital_run_length_longest
## 1 3.756 61
## 2 5.114 101
## 3 9.821 485
## 4 3.537 40
## 5 3.537 40
## 6 3.000 15
## capital_run_length_total status
## 1 278 1
## 2 1028 1
## 3 2259 1
## 4 191 1
## 5 191 1
## 6 54 1
Dataframe Stucture
str(dfrDataset)
## 'data.frame': 4601 obs. of 58 variables:
## $ word_freq_make : num 0 0.21 0.06 0 0 0 0 0 0.15 0.06 ...
## $ word_freq_address : num 0.64 0.28 0 0 0 0 0 0 0 0.12 ...
## $ word_freq_all : num 0.64 0.5 0.71 0 0 0 0 0 0.46 0.77 ...
## $ word_freq_3d : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_our : num 0.32 0.14 1.23 0.63 0.63 1.85 1.92 1.88 0.61 0.19 ...
## $ word_freq_over : num 0 0.28 0.19 0 0 0 0 0 0 0.32 ...
## $ word_freq_remove : num 0 0.21 0.19 0.31 0.31 0 0 0 0.3 0.38 ...
## $ word_freq_internet : num 0 0.07 0.12 0.63 0.63 1.85 0 1.88 0 0 ...
## $ word_freq_order : num 0 0 0.64 0.31 0.31 0 0 0 0.92 0.06 ...
## $ word_freq_mail : num 0 0.94 0.25 0.63 0.63 0 0.64 0 0.76 0 ...
## $ word_freq_receive : num 0 0.21 0.38 0.31 0.31 0 0.96 0 0.76 0 ...
## $ word_freq_will : num 0.64 0.79 0.45 0.31 0.31 0 1.28 0 0.92 0.64 ...
## $ word_freq_people : num 0 0.65 0.12 0.31 0.31 0 0 0 0 0.25 ...
## $ word_freq_report : num 0 0.21 0 0 0 0 0 0 0 0 ...
## $ word_freq_addresses : num 0 0.14 1.75 0 0 0 0 0 0 0.12 ...
## $ word_freq_free : num 0.32 0.14 0.06 0.31 0.31 0 0.96 0 0 0 ...
## $ word_freq_business : num 0 0.07 0.06 0 0 0 0 0 0 0 ...
## $ word_freq_email : num 1.29 0.28 1.03 0 0 0 0.32 0 0.15 0.12 ...
## $ word_freq_you : num 1.93 3.47 1.36 3.18 3.18 0 3.85 0 1.23 1.67 ...
## $ word_freq_credit : num 0 0 0.32 0 0 0 0 0 3.53 0.06 ...
## $ word_freq_your : num 0.96 1.59 0.51 0.31 0.31 0 0.64 0 2 0.71 ...
## $ word_freq_font : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_000 : num 0 0.43 1.16 0 0 0 0 0 0 0.19 ...
## $ word_freq_money : num 0 0.43 0.06 0 0 0 0 0 0.15 0 ...
## $ word_freq_hp : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_hpl : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_george : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_650 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_lab : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_labs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_telnet : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_857 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_data : num 0 0 0 0 0 0 0 0 0.15 0 ...
## $ word_freq_415 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_85 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_technology : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_1999 : num 0 0.07 0 0 0 0 0 0 0 0 ...
## $ word_freq_parts : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_pm : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_direct : num 0 0 0.06 0 0 0 0 0 0 0 ...
## $ word_freq_cs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_meeting : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_original : num 0 0 0.12 0 0 0 0 0 0.3 0 ...
## $ word_freq_project : num 0 0 0 0 0 0 0 0 0 0.06 ...
## $ word_freq_re : num 0 0 0.06 0 0 0 0 0 0 0 ...
## $ word_freq_edu : num 0 0 0.06 0 0 0 0 0 0 0 ...
## $ word_freq_table : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_conference : num 0 0 0 0 0 0 0 0 0 0 ...
## $ char_freq_. : num 0 0 0.01 0 0 0 0 0 0 0.04 ...
## $ char_freq_..1 : num 0 0.132 0.143 0.137 0.135 0.223 0.054 0.206 0.271 0.03 ...
## $ char_freq_..2 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ char_freq_..3 : num 0.778 0.372 0.276 0.137 0.135 0 0.164 0 0.181 0.244 ...
## $ char_freq_..4 : num 0 0.18 0.184 0 0 0 0.054 0 0.203 0.081 ...
## $ char_freq_..5 : num 0 0.048 0.01 0 0 0 0 0 0.022 0 ...
## $ capital_run_length_average: num 3.76 5.11 9.82 3.54 3.54 ...
## $ capital_run_length_longest: int 61 101 485 40 40 15 4 11 445 43 ...
## $ capital_run_length_total : int 278 1028 2259 191 191 54 112 49 1257 749 ...
## $ status : int 1 1 1 1 1 1 1 1 1 1 ...
Dataframe Summary
lapply(dfrDataset, FUN=summary)
## $word_freq_make
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1046 0.0000 4.5400
##
## $word_freq_address
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.213 0.000 14.280
##
## $word_freq_all
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2807 0.4200 5.1000
##
## $word_freq_3d
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.06542 0.00000 42.81000
##
## $word_freq_our
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3122 0.3800 10.0000
##
## $word_freq_over
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0959 0.0000 5.8800
##
## $word_freq_remove
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1142 0.0000 7.2700
##
## $word_freq_internet
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1053 0.0000 11.1100
##
## $word_freq_order
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09007 0.00000 5.26000
##
## $word_freq_mail
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2394 0.1600 18.1800
##
## $word_freq_receive
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05982 0.00000 2.61000
##
## $word_freq_will
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.1000 0.5417 0.8000 9.6700
##
## $word_freq_people
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09393 0.00000 5.55000
##
## $word_freq_report
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05863 0.00000 10.00000
##
## $word_freq_addresses
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0492 0.0000 4.4100
##
## $word_freq_free
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2488 0.1000 20.0000
##
## $word_freq_business
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1426 0.0000 7.1400
##
## $word_freq_email
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1847 0.0000 9.0900
##
## $word_freq_you
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.310 1.662 2.640 18.750
##
## $word_freq_credit
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.08558 0.00000 18.18000
##
## $word_freq_your
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.2200 0.8098 1.2700 11.1100
##
## $word_freq_font
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1212 0.0000 17.1000
##
## $word_freq_000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1016 0.0000 5.4500
##
## $word_freq_money
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09427 0.00000 12.50000
##
## $word_freq_hp
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.5495 0.0000 20.8300
##
## $word_freq_hpl
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2654 0.0000 16.6600
##
## $word_freq_george
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.7673 0.0000 33.3300
##
## $word_freq_650
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1248 0.0000 9.0900
##
## $word_freq_lab
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09892 0.00000 14.28000
##
## $word_freq_labs
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1029 0.0000 5.8800
##
## $word_freq_telnet
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.06475 0.00000 12.50000
##
## $word_freq_857
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04705 0.00000 4.76000
##
## $word_freq_data
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09723 0.00000 18.18000
##
## $word_freq_415
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04784 0.00000 4.76000
##
## $word_freq_85
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1054 0.0000 20.0000
##
## $word_freq_technology
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09748 0.00000 7.69000
##
## $word_freq_1999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.137 0.000 6.890
##
## $word_freq_parts
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0132 0.0000 8.3300
##
## $word_freq_pm
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.07863 0.00000 11.11000
##
## $word_freq_direct
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.06483 0.00000 4.76000
##
## $word_freq_cs
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04367 0.00000 7.14000
##
## $word_freq_meeting
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1323 0.0000 14.2800
##
## $word_freq_original
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0461 0.0000 3.5700
##
## $word_freq_project
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0792 0.0000 20.0000
##
## $word_freq_re
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3012 0.1100 21.4200
##
## $word_freq_edu
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1798 0.0000 22.0500
##
## $word_freq_table
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.000000 0.000000 0.005444 0.000000 2.170000
##
## $word_freq_conference
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.03187 0.00000 10.00000
##
## $char_freq_.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.03857 0.00000 4.38500
##
## $char_freq_..1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.065 0.139 0.188 9.752
##
## $char_freq_..2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.01698 0.00000 4.08100
##
## $char_freq_..3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2691 0.3150 32.4800
##
## $char_freq_..4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.07581 0.05200 6.00300
##
## $char_freq_..5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04424 0.00000 19.83000
##
## $capital_run_length_average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.588 2.276 5.192 3.706 1102.000
##
## $capital_run_length_longest
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 6.00 15.00 52.17 43.00 9989.00
##
## $capital_run_length_total
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 35.0 95.0 283.3 266.0 15840.0
##
## $status
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.394 1.000 1.000
Missing Data
lapply(dfrDataset, FUN=detectNA)
## $word_freq_make
## [1] 0
##
## $word_freq_address
## [1] 0
##
## $word_freq_all
## [1] 0
##
## $word_freq_3d
## [1] 0
##
## $word_freq_our
## [1] 0
##
## $word_freq_over
## [1] 0
##
## $word_freq_remove
## [1] 0
##
## $word_freq_internet
## [1] 0
##
## $word_freq_order
## [1] 0
##
## $word_freq_mail
## [1] 0
##
## $word_freq_receive
## [1] 0
##
## $word_freq_will
## [1] 0
##
## $word_freq_people
## [1] 0
##
## $word_freq_report
## [1] 0
##
## $word_freq_addresses
## [1] 0
##
## $word_freq_free
## [1] 0
##
## $word_freq_business
## [1] 0
##
## $word_freq_email
## [1] 0
##
## $word_freq_you
## [1] 0
##
## $word_freq_credit
## [1] 0
##
## $word_freq_your
## [1] 0
##
## $word_freq_font
## [1] 0
##
## $word_freq_000
## [1] 0
##
## $word_freq_money
## [1] 0
##
## $word_freq_hp
## [1] 0
##
## $word_freq_hpl
## [1] 0
##
## $word_freq_george
## [1] 0
##
## $word_freq_650
## [1] 0
##
## $word_freq_lab
## [1] 0
##
## $word_freq_labs
## [1] 0
##
## $word_freq_telnet
## [1] 0
##
## $word_freq_857
## [1] 0
##
## $word_freq_data
## [1] 0
##
## $word_freq_415
## [1] 0
##
## $word_freq_85
## [1] 0
##
## $word_freq_technology
## [1] 0
##
## $word_freq_1999
## [1] 0
##
## $word_freq_parts
## [1] 0
##
## $word_freq_pm
## [1] 0
##
## $word_freq_direct
## [1] 0
##
## $word_freq_cs
## [1] 0
##
## $word_freq_meeting
## [1] 0
##
## $word_freq_original
## [1] 0
##
## $word_freq_project
## [1] 0
##
## $word_freq_re
## [1] 0
##
## $word_freq_edu
## [1] 0
##
## $word_freq_table
## [1] 0
##
## $word_freq_conference
## [1] 0
##
## $char_freq_.
## [1] 0
##
## $char_freq_..1
## [1] 0
##
## $char_freq_..2
## [1] 0
##
## $char_freq_..3
## [1] 0
##
## $char_freq_..4
## [1] 0
##
## $char_freq_..5
## [1] 0
##
## $capital_run_length_average
## [1] 0
##
## $capital_run_length_longest
## [1] 0
##
## $capital_run_length_total
## [1] 0
##
## $status
## [1] 0
Check output
dfrStatusFreq <- summarise(group_by(dfrDataset, status), count=n())
# boxplot of mpg by car cylinders
ggplot(dfrStatusFreq, aes(x=status, y=count)) +
geom_bar(stat="identity", aes(fill=count)) +
labs(title="spam Frequency Distribution") +
labs(x="status") +
labs(y="Counts")
Find Corelations
## find correlations
vcnCorsData <- abs(sapply(colnames(dfrDataset), detectCor))
summary(vcnCorsData)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.002525 0.148800 0.253300 0.276900 0.354700 1.000000
Show Corelations
vcnCorsData
## word_freq_make word_freq_address
## 0.24069974 0.29750940
## word_freq_all word_freq_3d
## 0.33283147 0.09077776
## word_freq_our word_freq_over
## 0.40913946 0.31864550
## word_freq_remove word_freq_internet
## 0.51877779 0.34379623
## word_freq_order word_freq_mail
## 0.30073703 0.29682394
## word_freq_receive word_freq_will
## 0.35496682 0.14847653
## word_freq_people word_freq_report
## 0.21287588 0.14977533
## word_freq_addresses word_freq_free
## 0.26515743 0.50416922
## word_freq_business word_freq_email
## 0.35290749 0.29909391
## word_freq_you word_freq_credit
## 0.36110406 0.32418657
## word_freq_your word_freq_font
## 0.50159062 0.13797471
## word_freq_000 word_freq_money
## 0.42580256 0.47215455
## word_freq_hp word_freq_hpl
## 0.39981558 0.34188069
## word_freq_george word_freq_650
## 0.35393063 0.22619064
## word_freq_lab word_freq_labs
## 0.22068802 0.24580530
## word_freq_telnet word_freq_857
## 0.20467400 0.16983798
## word_freq_data word_freq_415
## 0.15756347 0.15802818
## word_freq_85 word_freq_technology
## 0.21413087 0.16680254
## word_freq_1999 word_freq_parts
## 0.26070752 0.00252536
## word_freq_pm word_freq_direct
## 0.14721389 0.02813193
## word_freq_cs word_freq_meeting
## 0.14453750 0.19574176
## word_freq_original word_freq_project
## 0.10781412 0.14453744
## word_freq_re word_freq_edu
## 0.07176763 0.19702549
## word_freq_table word_freq_conference
## 0.02266674 0.13903044
## char_freq_. char_freq_..1
## 0.05683530 0.03263555
## char_freq_..2 char_freq_..3
## 0.11122690 0.59785363
## char_freq_..4 char_freq_..5
## 0.56563314 0.26668614
## capital_run_length_average capital_run_length_longest
## 0.48794983 0.51515693
## capital_run_length_total status
## 0.44397367 1.00000000
Plot Corelations
corrgram(dfrDataset)
High Corelations
vcnCorsData[vcnCorsData>0.6]
## status
## 1
Create Column result
dfrDataset <- mutate(dfrDataset, result= ifelse(dfrDataset$status > '0','not spam','spam'))
dfrDataset$result <- as.factor(dfrDataset$status)
table(dfrDataset$status)
##
## 0 1
## 2788 1813
Dataset Split
set.seed(707)
vctTrnRecs <- createDataPartition(y=dfrDataset$status, p=0.5, list=FALSE)
dfrTrnData <- dfrDataset[vctTrnRecs,]
dfrTstData <- dfrDataset[-vctTrnRecs,]
Training Dataset RowCount & ColCount
dim(dfrTrnData)
## [1] 2301 59
Testing Dataset RowCount & ColCount
dim(dfrTstData)
## [1] 2300 59
Training Dataset Head
head(dfrTrnData)
## word_freq_make word_freq_address word_freq_all word_freq_3d
## 4 0.00 0.00 0.00 0
## 10 0.06 0.12 0.77 0
## 11 0.00 0.00 0.00 0
## 12 0.00 0.00 0.25 0
## 15 0.00 0.00 1.42 0
## 18 0.00 0.00 0.00 0
## word_freq_our word_freq_over word_freq_remove word_freq_internet
## 4 0.63 0.00 0.31 0.63
## 10 0.19 0.32 0.38 0.00
## 11 0.00 0.00 0.96 0.00
## 12 0.38 0.25 0.25 0.00
## 15 0.71 0.35 0.00 0.35
## 18 0.00 0.00 0.00 0.00
## word_freq_order word_freq_mail word_freq_receive word_freq_will
## 4 0.31 0.63 0.31 0.31
## 10 0.06 0.00 0.00 0.64
## 11 0.00 1.92 0.96 0.00
## 12 0.00 0.00 0.12 0.12
## 15 0.00 0.71 0.00 0.35
## 18 0.00 0.00 0.00 0.00
## word_freq_people word_freq_report word_freq_addresses word_freq_free
## 4 0.31 0 0.00 0.31
## 10 0.25 0 0.12 0.00
## 11 0.00 0 0.00 0.00
## 12 0.12 0 0.00 0.00
## 15 0.00 0 0.00 5.35
## 18 0.00 0 0.00 0.00
## word_freq_business word_freq_email word_freq_you word_freq_credit
## 4 0 0.00 3.18 0.00
## 10 0 0.12 1.67 0.06
## 11 0 0.96 3.84 0.00
## 12 0 0.00 1.16 0.00
## 15 0 0.00 3.21 0.00
## 18 0 0.00 0.00 0.00
## word_freq_your word_freq_font word_freq_000 word_freq_money
## 4 0.31 0 0.00 0
## 10 0.71 0 0.19 0
## 11 0.96 0 0.00 0
## 12 0.77 0 0.00 0
## 15 2.85 0 0.35 0
## 18 2.11 0 0.70 0
## word_freq_hp word_freq_hpl word_freq_george word_freq_650 word_freq_lab
## 4 0 0 0 0 0
## 10 0 0 0 0 0
## 11 0 0 0 0 0
## 12 0 0 0 0 0
## 15 0 0 0 0 0
## 18 0 0 0 0 0
## word_freq_labs word_freq_telnet word_freq_857 word_freq_data
## 4 0 0 0 0
## 10 0 0 0 0
## 11 0 0 0 0
## 12 0 0 0 0
## 15 0 0 0 0
## 18 0 0 0 0
## word_freq_415 word_freq_85 word_freq_technology word_freq_1999
## 4 0 0 0 0
## 10 0 0 0 0
## 11 0 0 0 0
## 12 0 0 0 0
## 15 0 0 0 0
## 18 0 0 0 0
## word_freq_parts word_freq_pm word_freq_direct word_freq_cs
## 4 0 0 0.00 0
## 10 0 0 0.00 0
## 11 0 0 0.96 0
## 12 0 0 0.00 0
## 15 0 0 0.00 0
## 18 0 0 0.00 0
## word_freq_meeting word_freq_original word_freq_project word_freq_re
## 4 0 0 0.00 0
## 10 0 0 0.06 0
## 11 0 0 0.00 0
## 12 0 0 0.00 0
## 15 0 0 0.00 0
## 18 0 0 0.00 0
## word_freq_edu word_freq_table word_freq_conference char_freq_.
## 4 0 0 0 0.000
## 10 0 0 0 0.040
## 11 0 0 0 0.000
## 12 0 0 0 0.022
## 15 0 0 0 0.000
## 18 0 0 0 0.000
## char_freq_..1 char_freq_..2 char_freq_..3 char_freq_..4 char_freq_..5
## 4 0.137 0 0.137 0.000 0
## 10 0.030 0 0.244 0.081 0
## 11 0.000 0 0.462 0.000 0
## 12 0.044 0 0.663 0.000 0
## 15 0.102 0 0.357 0.000 0
## 18 0.000 0 1.975 0.370 0
## capital_run_length_average capital_run_length_longest
## 4 3.537 40
## 10 1.729 43
## 11 1.312 6
## 12 1.243 11
## 15 1.971 24
## 18 35.461 95
## capital_run_length_total status result
## 4 191 1 1
## 10 749 1 1
## 11 21 1 1
## 12 184 1 1
## 15 205 1 1
## 18 461 1 1
Testing Dataset Head
head(dfrTstData)
## word_freq_make word_freq_address word_freq_all word_freq_3d
## 1 0.00 0.64 0.64 0
## 2 0.21 0.28 0.50 0
## 3 0.06 0.00 0.71 0
## 5 0.00 0.00 0.00 0
## 6 0.00 0.00 0.00 0
## 7 0.00 0.00 0.00 0
## word_freq_our word_freq_over word_freq_remove word_freq_internet
## 1 0.32 0.00 0.00 0.00
## 2 0.14 0.28 0.21 0.07
## 3 1.23 0.19 0.19 0.12
## 5 0.63 0.00 0.31 0.63
## 6 1.85 0.00 0.00 1.85
## 7 1.92 0.00 0.00 0.00
## word_freq_order word_freq_mail word_freq_receive word_freq_will
## 1 0.00 0.00 0.00 0.64
## 2 0.00 0.94 0.21 0.79
## 3 0.64 0.25 0.38 0.45
## 5 0.31 0.63 0.31 0.31
## 6 0.00 0.00 0.00 0.00
## 7 0.00 0.64 0.96 1.28
## word_freq_people word_freq_report word_freq_addresses word_freq_free
## 1 0.00 0.00 0.00 0.32
## 2 0.65 0.21 0.14 0.14
## 3 0.12 0.00 1.75 0.06
## 5 0.31 0.00 0.00 0.31
## 6 0.00 0.00 0.00 0.00
## 7 0.00 0.00 0.00 0.96
## word_freq_business word_freq_email word_freq_you word_freq_credit
## 1 0.00 1.29 1.93 0.00
## 2 0.07 0.28 3.47 0.00
## 3 0.06 1.03 1.36 0.32
## 5 0.00 0.00 3.18 0.00
## 6 0.00 0.00 0.00 0.00
## 7 0.00 0.32 3.85 0.00
## word_freq_your word_freq_font word_freq_000 word_freq_money word_freq_hp
## 1 0.96 0 0.00 0.00 0
## 2 1.59 0 0.43 0.43 0
## 3 0.51 0 1.16 0.06 0
## 5 0.31 0 0.00 0.00 0
## 6 0.00 0 0.00 0.00 0
## 7 0.64 0 0.00 0.00 0
## word_freq_hpl word_freq_george word_freq_650 word_freq_lab
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## 7 0 0 0 0
## word_freq_labs word_freq_telnet word_freq_857 word_freq_data
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## 7 0 0 0 0
## word_freq_415 word_freq_85 word_freq_technology word_freq_1999
## 1 0 0 0 0.00
## 2 0 0 0 0.07
## 3 0 0 0 0.00
## 5 0 0 0 0.00
## 6 0 0 0 0.00
## 7 0 0 0 0.00
## word_freq_parts word_freq_pm word_freq_direct word_freq_cs
## 1 0 0 0.00 0
## 2 0 0 0.00 0
## 3 0 0 0.06 0
## 5 0 0 0.00 0
## 6 0 0 0.00 0
## 7 0 0 0.00 0
## word_freq_meeting word_freq_original word_freq_project word_freq_re
## 1 0 0.00 0 0.00
## 2 0 0.00 0 0.00
## 3 0 0.12 0 0.06
## 5 0 0.00 0 0.00
## 6 0 0.00 0 0.00
## 7 0 0.00 0 0.00
## word_freq_edu word_freq_table word_freq_conference char_freq_.
## 1 0.00 0 0 0.00
## 2 0.00 0 0 0.00
## 3 0.06 0 0 0.01
## 5 0.00 0 0 0.00
## 6 0.00 0 0 0.00
## 7 0.00 0 0 0.00
## char_freq_..1 char_freq_..2 char_freq_..3 char_freq_..4 char_freq_..5
## 1 0.000 0 0.778 0.000 0.000
## 2 0.132 0 0.372 0.180 0.048
## 3 0.143 0 0.276 0.184 0.010
## 5 0.135 0 0.135 0.000 0.000
## 6 0.223 0 0.000 0.000 0.000
## 7 0.054 0 0.164 0.054 0.000
## capital_run_length_average capital_run_length_longest
## 1 3.756 61
## 2 5.114 101
## 3 9.821 485
## 5 3.537 40
## 6 3.000 15
## 7 1.671 4
## capital_run_length_total status result
## 1 278 1 1
## 2 1028 1 1
## 3 2259 1 1
## 5 191 1 1
## 6 54 1 1
## 7 112 1 1
Training Dataset Summary
lapply(dfrTrnData, FUN=summary)
## $word_freq_make
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09593 0.00000 3.03000
##
## $word_freq_address
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.197 0.000 14.280
##
## $word_freq_all
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2805 0.4200 4.3400
##
## $word_freq_3d
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.02612 0.00000 19.16000
##
## $word_freq_our
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2885 0.3600 6.2500
##
## $word_freq_over
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09164 0.00000 2.63000
##
## $word_freq_remove
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1161 0.0000 7.2700
##
## $word_freq_internet
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.114 0.000 11.110
##
## $word_freq_order
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09188 0.00000 5.26000
##
## $word_freq_mail
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2385 0.1700 18.1800
##
## $word_freq_receive
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05615 0.00000 2.06000
##
## $word_freq_will
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.1100 0.5434 0.7800 7.6900
##
## $word_freq_people
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09465 0.00000 5.55000
##
## $word_freq_report
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.06062 0.00000 5.12000
##
## $word_freq_addresses
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04799 0.00000 2.86000
##
## $word_freq_free
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2455 0.0900 16.6600
##
## $word_freq_business
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1405 0.0000 7.1400
##
## $word_freq_email
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1807 0.0000 7.6900
##
## $word_freq_you
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.280 1.674 2.640 18.750
##
## $word_freq_credit
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09603 0.00000 18.18000
##
## $word_freq_your
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.200 0.803 1.280 11.110
##
## $word_freq_font
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1214 0.0000 15.4300
##
## $word_freq_000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1028 0.0000 5.4500
##
## $word_freq_money
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.08717 0.00000 12.50000
##
## $word_freq_hp
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.5438 0.0000 20.8300
##
## $word_freq_hpl
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.262 0.000 10.860
##
## $word_freq_george
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.8086 0.0000 33.3300
##
## $word_freq_650
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1301 0.0000 5.8800
##
## $word_freq_lab
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1014 0.0000 14.2800
##
## $word_freq_labs
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09868 0.00000 5.88000
##
## $word_freq_telnet
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.06855 0.00000 12.50000
##
## $word_freq_857
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04944 0.00000 4.76000
##
## $word_freq_data
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1086 0.0000 18.1800
##
## $word_freq_415
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05072 0.00000 4.76000
##
## $word_freq_85
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1093 0.0000 20.0000
##
## $word_freq_technology
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1041 0.0000 7.6900
##
## $word_freq_1999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1312 0.0000 5.0500
##
## $word_freq_parts
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.01097 0.00000 6.45000
##
## $word_freq_pm
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.08171 0.00000 9.75000
##
## $word_freq_direct
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.06669 0.00000 4.76000
##
## $word_freq_cs
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0428 0.0000 7.1400
##
## $word_freq_meeting
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1341 0.0000 14.2800
##
## $word_freq_original
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04902 0.00000 3.57000
##
## $word_freq_project
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09887 0.00000 20.00000
##
## $word_freq_re
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3006 0.1000 20.0000
##
## $word_freq_edu
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1691 0.0000 22.0500
##
## $word_freq_table
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.00498 0.00000 1.91000
##
## $word_freq_conference
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.02907 0.00000 10.00000
##
## $char_freq_.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04104 0.00000 4.38500
##
## $char_freq_..1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.065 0.137 0.189 5.277
##
## $char_freq_..2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.01572 0.00000 4.08100
##
## $char_freq_..3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2799 0.3180 32.4800
##
## $char_freq_..4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.07774 0.05600 5.30000
##
## $char_freq_..5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.03824 0.00000 13.13000
##
## $capital_run_length_average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.562 2.250 5.578 3.689 1102.000
##
## $capital_run_length_longest
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 6.00 14.00 50.22 43.00 2204.00
##
## $capital_run_length_total
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 35.0 94.0 273.1 264.0 9163.0
##
## $status
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.4037 1.0000 1.0000
##
## $result
## 0 1
## 1372 929
Testing Dataset Summary
lapply(dfrTstData, FUN=summary)
## $word_freq_make
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1132 0.0000 4.5400
##
## $word_freq_address
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2291 0.0000 14.2800
##
## $word_freq_all
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2808 0.4200 5.1000
##
## $word_freq_3d
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1047 0.0000 42.8100
##
## $word_freq_our
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3359 0.4125 10.0000
##
## $word_freq_over
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1002 0.0000 5.8800
##
## $word_freq_remove
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1123 0.0000 7.2700
##
## $word_freq_internet
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09655 0.00000 5.88000
##
## $word_freq_order
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.08826 0.00000 3.33000
##
## $word_freq_mail
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2403 0.1400 11.1100
##
## $word_freq_receive
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0635 0.0000 2.6100
##
## $word_freq_will
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.1000 0.5400 0.8025 9.6700
##
## $word_freq_people
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09321 0.00000 5.55000
##
## $word_freq_report
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05663 0.00000 10.00000
##
## $word_freq_addresses
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05042 0.00000 4.41000
##
## $word_freq_free
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2522 0.1300 20.0000
##
## $word_freq_business
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1447 0.0000 5.1200
##
## $word_freq_email
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1887 0.0000 9.0900
##
## $word_freq_you
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.325 1.650 2.642 14.280
##
## $word_freq_credit
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.07512 0.00000 6.25000
##
## $word_freq_your
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.2500 0.8165 1.2700 10.7100
##
## $word_freq_font
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.121 0.000 17.100
##
## $word_freq_000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1005 0.0000 4.3200
##
## $word_freq_money
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1014 0.0000 9.0900
##
## $word_freq_hp
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.5552 0.0000 20.0000
##
## $word_freq_hpl
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2688 0.0000 16.6600
##
## $word_freq_george
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.726 0.000 33.330
##
## $word_freq_650
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1196 0.0000 9.0900
##
## $word_freq_lab
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09644 0.00000 11.11000
##
## $word_freq_labs
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.107 0.000 4.760
##
## $word_freq_telnet
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.06095 0.00000 4.76000
##
## $word_freq_857
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04466 0.00000 4.76000
##
## $word_freq_data
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.08586 0.00000 8.33000
##
## $word_freq_415
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04495 0.00000 4.76000
##
## $word_freq_85
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1015 0.0000 4.7600
##
## $word_freq_technology
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09083 0.00000 4.76000
##
## $word_freq_1999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1427 0.0000 6.8900
##
## $word_freq_parts
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.01543 0.00000 8.33000
##
## $word_freq_pm
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.07554 0.00000 11.11000
##
## $word_freq_direct
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.06297 0.00000 4.76000
##
## $word_freq_cs
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04453 0.00000 7.14000
##
## $word_freq_meeting
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1306 0.0000 14.2800
##
## $word_freq_original
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04318 0.00000 3.57000
##
## $word_freq_project
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05951 0.00000 5.00000
##
## $word_freq_re
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3019 0.1200 21.4200
##
## $word_freq_edu
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1905 0.0000 16.7000
##
## $word_freq_table
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.000000 0.000000 0.005909 0.000000 2.170000
##
## $word_freq_conference
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.03467 0.00000 8.33000
##
## $char_freq_.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.03611 0.00000 4.18700
##
## $char_freq_..1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0650 0.1411 0.1860 9.7520
##
## $char_freq_..2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.01823 0.00000 2.77700
##
## $char_freq_..3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2583 0.3113 9.5750
##
## $char_freq_..4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.07388 0.04725 6.00300
##
## $char_freq_..5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05024 0.00000 19.83000
##
## $capital_run_length_average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.608 2.294 4.805 3.746 667.000
##
## $capital_run_length_longest
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 6.00 15.00 54.13 44.00 9989.00
##
## $capital_run_length_total
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 35.0 97.0 293.5 270.0 15840.0
##
## $status
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3843 1.0000 1.0000
##
## $result
## 0 1
## 1416 884
Create Model - Random Forest (Default)
## set seed
set.seed(707)
# mtry
myMtry=sqrt(ncol(dfrTrnData)-2)
myNtrees=500
# start time
vctProcStrt <- proc.time()
# random forest (default)
#mdlRndForDef <- randomForest(result~.-status, data=dfrTrnData)
mdlRndForDef <- randomForest(result~.-status, data=dfrTrnData,
mtry=myMtry, ntree=myNtrees)
# end time
vctProcEnds <- proc.time()
# print
print(paste("Model Created ...",vctProcEnds[1] - vctProcStrt[1]))
## [1] "Model Created ... 5.896"
View Model - Default Random Forest
mdlRndForDef
##
## Call:
## randomForest(formula = result ~ . - status, data = dfrTrnData, mtry = myMtry, ntree = myNtrees)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 8
##
## OOB estimate of error rate: 4.87%
## Confusion matrix:
## 0 1 class.error
## 0 1333 39 0.02842566
## 1 73 856 0.07857912
View Model Summary - Default Random Forest
summary(mdlRndForDef)
## Length Class Mode
## call 5 -none- call
## type 1 -none- character
## predicted 2301 factor numeric
## err.rate 1500 -none- numeric
## confusion 6 -none- numeric
## votes 4602 matrix numeric
## oob.times 2301 -none- numeric
## classes 2 -none- character
## importance 57 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 2301 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
Prediction - Test Data - Random Forest (Default)
vctRndForDef <- predict(mdlRndForDef, newdata=dfrTstData)
cmxRndForDef <- confusionMatrix(vctRndForDef, dfrTstData$result)
cmxRndForDef
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1358 65
## 1 58 819
##
## Accuracy : 0.9465
## 95% CI : (0.9365, 0.9554)
## No Information Rate : 0.6157
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8868
## Mcnemar's Test P-Value : 0.5885
##
## Sensitivity : 0.9590
## Specificity : 0.9265
## Pos Pred Value : 0.9543
## Neg Pred Value : 0.9339
## Prevalence : 0.6157
## Detection Rate : 0.5904
## Detection Prevalence : 0.6187
## Balanced Accuracy : 0.9428
##
## 'Positive' Class : 0
##
Create Model - Random Forest (RFM)
## set seed
set.seed(707)
# start time
vctProcStrt <- proc.time()
# random forest (default)
myControl <- trainControl(method="cv", number=10, repeats=3)
myMetric <- "Accuracy"
myMtry <- sqrt(ncol(dfrTrnData)-2)
#myNtrees <- 500
myTuneGrid <- expand.grid(.mtry=myMtry)
mdlRndForRfm <- train(result~.-status, data=dfrTrnData, method="rf",
verbose=F, metric=myMetric, trControl=myControl,
tuneGrid=myTuneGrid)
# end time
vctProcEnds <- proc.time()
# print
print(paste("Model Created ...",vctProcEnds[1] - vctProcStrt[1]))
## [1] "Model Created ... 54.526"
View Model - Random Forest (RFM)
mdlRndForRfm
## Random Forest
##
## 2301 samples
## 58 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2071, 2071, 2071, 2071, 2071, 2071, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9513307 0.8983645
##
## Tuning parameter 'mtry' was held constant at a value of 7.549834
View Model Summary - Default Random Forest
summary(mdlRndForDef)
## Length Class Mode
## call 5 -none- call
## type 1 -none- character
## predicted 2301 factor numeric
## err.rate 1500 -none- numeric
## confusion 6 -none- numeric
## votes 4602 matrix numeric
## oob.times 2301 -none- numeric
## classes 2 -none- character
## importance 57 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 2301 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
Prediction - Test Data - Random Forest (Default)
vctRndForDef <- predict(mdlRndForDef, newdata=dfrTstData)
cmxRndForDef <- confusionMatrix(vctRndForDef, dfrTstData$result)
cmxRndForDef
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1358 65
## 1 58 819
##
## Accuracy : 0.9465
## 95% CI : (0.9365, 0.9554)
## No Information Rate : 0.6157
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8868
## Mcnemar's Test P-Value : 0.5885
##
## Sensitivity : 0.9590
## Specificity : 0.9265
## Pos Pred Value : 0.9543
## Neg Pred Value : 0.9339
## Prevalence : 0.6157
## Detection Rate : 0.5904
## Detection Prevalence : 0.6187
## Balanced Accuracy : 0.9428
##
## 'Positive' Class : 0
##
Create Model - Random Forest (RFM)
## set seed
set.seed(707)
# start time
vctProcStrt <- proc.time()
# random forest (default)
myControl <- trainControl(method="cv", number=10, repeats=3)
myMetric <- "Accuracy"
myMtry <- sqrt(ncol(dfrTrnData)-2)
#myNtrees <- 500
myTuneGrid <- expand.grid(.mtry=myMtry)
mdlRndForRfm <- train(result~.-status, data=dfrTrnData, method="rf",
verbose=F,metric=myMetric, trControl=myControl,
tuneGrid=myTuneGrid)
# end time
vctProcEnds <- proc.time()
# print
print(paste("Model Created ...",vctProcEnds[1] - vctProcStrt[1]))
## [1] "Model Created ... 54.421"
View Model - Random Forest (RFM)
mdlRndForRfm
## Random Forest
##
## 2301 samples
## 58 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2071, 2071, 2071, 2071, 2071, 2071, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9513307 0.8983645
##
## Tuning parameter 'mtry' was held constant at a value of 7.549834
View Model Summary - Random Forest (RFM)
summary(mdlRndForRfm)
## Length Class Mode
## call 5 -none- call
## type 1 -none- character
## predicted 2301 factor numeric
## err.rate 1500 -none- numeric
## confusion 6 -none- numeric
## votes 4602 matrix numeric
## oob.times 2301 -none- numeric
## classes 2 -none- character
## importance 57 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 2301 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## xNames 57 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 2 -none- character
## param 1 -none- list
Prediction - Test Data - Random Forest (RFM)
vctRndForRfm <- predict(mdlRndForRfm, newdata=dfrTstData)
cmxRndForRfm <- confusionMatrix(vctRndForRfm, dfrTstData$result)
cmxRndForRfm
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1361 63
## 1 55 821
##
## Accuracy : 0.9487
## 95% CI : (0.9389, 0.9574)
## No Information Rate : 0.6157
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8914
## Mcnemar's Test P-Value : 0.5193
##
## Sensitivity : 0.9612
## Specificity : 0.9287
## Pos Pred Value : 0.9558
## Neg Pred Value : 0.9372
## Prevalence : 0.6157
## Detection Rate : 0.5917
## Detection Prevalence : 0.6191
## Balanced Accuracy : 0.9449
##
## 'Positive' Class : 0
##
Create Model - Random Forest (GBM)
## set seed
set.seed(707)
# start time
vctProcStrt <- proc.time()
# random forest (default)
myControl <- trainControl(method="repeatedcv", number=10, repeats=3)
myMetric <- "Accuracy"
myMtry <- sqrt(ncol(dfrTrnData)-2)
#myNtrees <- 500
myTuneGrid <- expand.grid(.mtry=myMtry)
mdlRndForGbm <- train(result~.-status, data=dfrTrnData, method="gbm",
verbose=F, metric=myMetric, trControl=myControl)
# tuneGrid=myTuneGrid, ntree=myNtrees)
# end time
vctProcEnds <- proc.time()
# print
print(paste("Model Created ...",vctProcEnds[1] - vctProcStrt[1]))
## [1] "Model Created ... 56.971"
View Model - Random Forest (GBM)
mdlRndForGbm
## Stochastic Gradient Boosting
##
## 2301 samples
## 58 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 2071, 2071, 2071, 2071, 2071, 2071, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.9100444 0.8092008
## 1 100 0.9297477 0.8520115
## 1 150 0.9368460 0.8674410
## 2 50 0.9300337 0.8527330
## 2 100 0.9403205 0.8749272
## 2 150 0.9422020 0.8790075
## 3 50 0.9361176 0.8660397
## 3 100 0.9422027 0.8790742
## 3 150 0.9448038 0.8845381
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
View Model Summary - Random Forest (GBM)
summary(mdlRndForGbm)
## var rel.inf
## char_freq_..4 char_freq_..4 21.33414947
## char_freq_..3 char_freq_..3 18.20329693
## word_freq_remove word_freq_remove 12.10762354
## word_freq_your word_freq_your 7.97368354
## word_freq_hp word_freq_hp 7.38705712
## word_freq_free word_freq_free 6.25976663
## capital_run_length_longest capital_run_length_longest 4.73039630
## capital_run_length_average capital_run_length_average 3.39442327
## word_freq_george word_freq_george 2.40361452
## capital_run_length_total capital_run_length_total 2.36404283
## word_freq_edu word_freq_edu 2.19214809
## word_freq_money word_freq_money 1.84339567
## word_freq_our word_freq_our 1.84028169
## word_freq_you word_freq_you 1.05016021
## word_freq_000 word_freq_000 0.94263837
## word_freq_1999 word_freq_1999 0.93439454
## word_freq_meeting word_freq_meeting 0.64850358
## word_freq_internet word_freq_internet 0.63113886
## word_freq_re word_freq_re 0.58854735
## word_freq_email word_freq_email 0.41132557
## word_freq_receive word_freq_receive 0.30729744
## char_freq_. char_freq_. 0.26937478
## word_freq_business word_freq_business 0.24949150
## char_freq_..1 char_freq_..1 0.21284875
## word_freq_over word_freq_over 0.20501187
## word_freq_hpl word_freq_hpl 0.19665683
## word_freq_will word_freq_will 0.17120902
## word_freq_font word_freq_font 0.16624945
## word_freq_mail word_freq_mail 0.16585011
## word_freq_650 word_freq_650 0.16258298
## word_freq_technology word_freq_technology 0.15822045
## word_freq_credit word_freq_credit 0.15029950
## word_freq_report word_freq_report 0.08795041
## word_freq_all word_freq_all 0.06550245
## word_freq_project word_freq_project 0.05689405
## word_freq_3d word_freq_3d 0.05645784
## word_freq_conference word_freq_conference 0.04319885
## word_freq_data word_freq_data 0.02197469
## word_freq_order word_freq_order 0.01234096
## word_freq_make word_freq_make 0.00000000
## word_freq_address word_freq_address 0.00000000
## word_freq_people word_freq_people 0.00000000
## word_freq_addresses word_freq_addresses 0.00000000
## word_freq_lab word_freq_lab 0.00000000
## word_freq_labs word_freq_labs 0.00000000
## word_freq_telnet word_freq_telnet 0.00000000
## word_freq_857 word_freq_857 0.00000000
## word_freq_415 word_freq_415 0.00000000
## word_freq_85 word_freq_85 0.00000000
## word_freq_parts word_freq_parts 0.00000000
## word_freq_pm word_freq_pm 0.00000000
## word_freq_direct word_freq_direct 0.00000000
## word_freq_cs word_freq_cs 0.00000000
## word_freq_original word_freq_original 0.00000000
## word_freq_table word_freq_table 0.00000000
## char_freq_..2 char_freq_..2 0.00000000
## char_freq_..5 char_freq_..5 0.00000000
Prediction - Test Data - Random Forest (GBM)
vctRndForGbm <- predict(mdlRndForGbm, newdata=dfrTstData)
cmxRndForGbm <- confusionMatrix(vctRndForGbm, dfrTstData$result)
cmxRndForGbm
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1357 65
## 1 59 819
##
## Accuracy : 0.9461
## 95% CI : (0.9361, 0.955)
## No Information Rate : 0.6157
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8859
## Mcnemar's Test P-Value : 0.6534
##
## Sensitivity : 0.9583
## Specificity : 0.9265
## Pos Pred Value : 0.9543
## Neg Pred Value : 0.9328
## Prevalence : 0.6157
## Detection Rate : 0.5900
## Detection Prevalence : 0.6183
## Balanced Accuracy : 0.9424
##
## 'Positive' Class : 0
##
Create Model - Random Forest (OOB)
## set seed
set.seed(707)
# start time
vctProcStrt <- proc.time()
# random forest (default)
myControl <- trainControl(method="oob", number=10, repeats=3)
myMetric <- "Accuracy"
myMtry <- 10 #sqrt(ncol(dfrTrnData)-2)
myNtrees <- 500
myTuneGrid <- expand.grid(.mtry=myMtry)
mdlRndForoob <- train(result~.-status, data=dfrTrnData,
verbose=F, metric=myMetric, trControl=myControl,
tuneGrid=myTuneGrid, ntree=myNtrees)
# end time
vctProcEnds <- proc.time()
# print
print(paste("Model Created ...",vctProcEnds[1] - vctProcStrt[1]))
## [1] "Model Created ... 12.302"
View Model - Random Formbest (OOB)
mdlRndForoob
## Random Forest
##
## 2301 samples
## 58 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling results:
##
## Accuracy Kappa
## 0.9491525 0.8939668
##
## Tuning parameter 'mtry' was held constant at a value of 10
View Model Summary - Random Forest (OOB)
summary(mdlRndForoob)
## Length Class Mode
## call 6 -none- call
## type 1 -none- character
## predicted 2301 factor numeric
## err.rate 1500 -none- numeric
## confusion 6 -none- numeric
## votes 4602 matrix numeric
## oob.times 2301 -none- numeric
## classes 2 -none- character
## importance 57 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 2301 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## xNames 57 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 2 -none- character
## param 2 -none- list
Prediction - Test Data - Random Forest (OOB)
vctRndForoob <- predict(mdlRndForoob, newdata=dfrTstData)
cmxRndForoob <- confusionMatrix(vctRndForoob, dfrTstData$result)
cmxRndForoob
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1358 67
## 1 58 817
##
## Accuracy : 0.9457
## 95% CI : (0.9356, 0.9546)
## No Information Rate : 0.6157
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8849
## Mcnemar's Test P-Value : 0.4743
##
## Sensitivity : 0.9590
## Specificity : 0.9242
## Pos Pred Value : 0.9530
## Neg Pred Value : 0.9337
## Prevalence : 0.6157
## Detection Rate : 0.5904
## Detection Prevalence : 0.6196
## Balanced Accuracy : 0.9416
##
## 'Positive' Class : 0
##
Thank You
print("Thank You")
## [1] "Thank You"