Introduction Random Forest
The steps to predict using Random Forest:
* Step 1
* Step 2
Problem Definition
The “spam” concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography… Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word ‘george’ and the area code ‘650’ are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter.
So Our Problem Statement is-
To create the model to predict the mail is Spam or Not.
Test the model on test dataset
Check the accuracy.
Data Location
This file: ‘spambase.DOCUMENTATION’ at the UCI Machine Learning Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html
Data Description
Number of Attributes: 58 (57 continuous, 1 nominal class label)
The last column of ‘spambase.data’ denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.
Most of the attributes indicate whether a particular word or character was frequently occuring in the e-mail. The run-length attributes (55-57) measure the length of sequences of consecutive capital letters. For the statistical measures of each attribute, see the end of this file. Here are the definitions of the attributes:
48 continuous real [0,100] attributes of type word_freq_WORD = percentage of words in the e-mail that match WORD, i.e. 100 * (number of times the WORD appears in the e-mail) / total number of words in e-mail. A “word” in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string.
6 continuous real [0,100] attributes of type char_freq_CHAR = percentage of characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurences) / total characters in e-mail
1 continuous real [1,…] attribute of type capital_run_length_average = average length of uninterrupted sequences of capital letters
1 continuous integer [1,…] attribute of type capital_run_length_longest = length of longest uninterrupted sequence of capital letters
1 continuous integer [1,…] attribute of type capital_run_length_total = sum of length of uninterrupted sequences of capital letters = total number of capital letters in the e-mail
1 nominal {0,1} class attribute of type spam = denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.
Setup
Load Libs
library(plyr)
library(tidyr)
library(dplyr)
library(ggplot2)
#install.packages("caret")
library(caret)
#install.packages("randomForest")
library(randomForest)
#install.packages("gbm")
library(gbm)
#install.packages("corrgram")
library(corrgram)
#install.packages("corrplot")
library(corrplot)
Functions
detectNA <- function(inp) {
sum(is.na(inp))
}
detectCor <- function(x) {
cor(as.numeric(dfrDataset[, x]),
as.numeric(dfrDataset$status),
method="spearman")
}
Load Dataset
setwd("C:/Users/SarveshKumar/Desktop/R/machine learning/assignment/random forest")
dfrDataset <- read.csv("./spambase.csv", header=T, stringsAsFactors=T)
head(dfrDataset)
## word_freq_make word_freq_address word_freq_all word_freq_3d
## 1 0.00 0.64 0.64 0
## 2 0.21 0.28 0.50 0
## 3 0.06 0.00 0.71 0
## 4 0.00 0.00 0.00 0
## 5 0.00 0.00 0.00 0
## 6 0.00 0.00 0.00 0
## word_freq_our word_freq_over word_freq_remove word_freq_internet
## 1 0.32 0.00 0.00 0.00
## 2 0.14 0.28 0.21 0.07
## 3 1.23 0.19 0.19 0.12
## 4 0.63 0.00 0.31 0.63
## 5 0.63 0.00 0.31 0.63
## 6 1.85 0.00 0.00 1.85
## word_freq_order word_freq_mail word_freq_receive word_freq_will
## 1 0.00 0.00 0.00 0.64
## 2 0.00 0.94 0.21 0.79
## 3 0.64 0.25 0.38 0.45
## 4 0.31 0.63 0.31 0.31
## 5 0.31 0.63 0.31 0.31
## 6 0.00 0.00 0.00 0.00
## word_freq_people word_freq_report word_freq_addresses word_freq_free
## 1 0.00 0.00 0.00 0.32
## 2 0.65 0.21 0.14 0.14
## 3 0.12 0.00 1.75 0.06
## 4 0.31 0.00 0.00 0.31
## 5 0.31 0.00 0.00 0.31
## 6 0.00 0.00 0.00 0.00
## word_freq_business word_freq_email word_freq_you word_freq_credit
## 1 0.00 1.29 1.93 0.00
## 2 0.07 0.28 3.47 0.00
## 3 0.06 1.03 1.36 0.32
## 4 0.00 0.00 3.18 0.00
## 5 0.00 0.00 3.18 0.00
## 6 0.00 0.00 0.00 0.00
## word_freq_your word_freq_font word_freq_000 word_freq_money word_freq_hp
## 1 0.96 0 0.00 0.00 0
## 2 1.59 0 0.43 0.43 0
## 3 0.51 0 1.16 0.06 0
## 4 0.31 0 0.00 0.00 0
## 5 0.31 0 0.00 0.00 0
## 6 0.00 0 0.00 0.00 0
## word_freq_hpl word_freq_george word_freq_650 word_freq_lab
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## word_freq_labs word_freq_telnet word_freq_857 word_freq_data
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## word_freq_415 word_freq_85 word_freq_technology word_freq_1999
## 1 0 0 0 0.00
## 2 0 0 0 0.07
## 3 0 0 0 0.00
## 4 0 0 0 0.00
## 5 0 0 0 0.00
## 6 0 0 0 0.00
## word_freq_parts word_freq_pm word_freq_direct word_freq_cs
## 1 0 0 0.00 0
## 2 0 0 0.00 0
## 3 0 0 0.06 0
## 4 0 0 0.00 0
## 5 0 0 0.00 0
## 6 0 0 0.00 0
## word_freq_meeting word_freq_original word_freq_project word_freq_re
## 1 0 0.00 0 0.00
## 2 0 0.00 0 0.00
## 3 0 0.12 0 0.06
## 4 0 0.00 0 0.00
## 5 0 0.00 0 0.00
## 6 0 0.00 0 0.00
## word_freq_edu word_freq_table word_freq_conference char_freq_.
## 1 0.00 0 0 0.00
## 2 0.00 0 0 0.00
## 3 0.06 0 0 0.01
## 4 0.00 0 0 0.00
## 5 0.00 0 0 0.00
## 6 0.00 0 0 0.00
## char_freq_..1 char_freq_..2 char_freq_..3 char_freq_..4 char_freq_..5
## 1 0.000 0 0.778 0.000 0.000
## 2 0.132 0 0.372 0.180 0.048
## 3 0.143 0 0.276 0.184 0.010
## 4 0.137 0 0.137 0.000 0.000
## 5 0.135 0 0.135 0.000 0.000
## 6 0.223 0 0.000 0.000 0.000
## capital_run_length_average capital_run_length_longest
## 1 3.756 61
## 2 5.114 101
## 3 9.821 485
## 4 3.537 40
## 5 3.537 40
## 6 3.000 15
## capital_run_length_total status
## 1 278 1
## 2 1028 1
## 3 2259 1
## 4 191 1
## 5 191 1
## 6 54 1
Observations
1. Data loaded successfully.
2. 4601 records found in the data set.
Dataframe Stucture
str(dfrDataset)
## 'data.frame': 4601 obs. of 58 variables:
## $ word_freq_make : num 0 0.21 0.06 0 0 0 0 0 0.15 0.06 ...
## $ word_freq_address : num 0.64 0.28 0 0 0 0 0 0 0 0.12 ...
## $ word_freq_all : num 0.64 0.5 0.71 0 0 0 0 0 0.46 0.77 ...
## $ word_freq_3d : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_our : num 0.32 0.14 1.23 0.63 0.63 1.85 1.92 1.88 0.61 0.19 ...
## $ word_freq_over : num 0 0.28 0.19 0 0 0 0 0 0 0.32 ...
## $ word_freq_remove : num 0 0.21 0.19 0.31 0.31 0 0 0 0.3 0.38 ...
## $ word_freq_internet : num 0 0.07 0.12 0.63 0.63 1.85 0 1.88 0 0 ...
## $ word_freq_order : num 0 0 0.64 0.31 0.31 0 0 0 0.92 0.06 ...
## $ word_freq_mail : num 0 0.94 0.25 0.63 0.63 0 0.64 0 0.76 0 ...
## $ word_freq_receive : num 0 0.21 0.38 0.31 0.31 0 0.96 0 0.76 0 ...
## $ word_freq_will : num 0.64 0.79 0.45 0.31 0.31 0 1.28 0 0.92 0.64 ...
## $ word_freq_people : num 0 0.65 0.12 0.31 0.31 0 0 0 0 0.25 ...
## $ word_freq_report : num 0 0.21 0 0 0 0 0 0 0 0 ...
## $ word_freq_addresses : num 0 0.14 1.75 0 0 0 0 0 0 0.12 ...
## $ word_freq_free : num 0.32 0.14 0.06 0.31 0.31 0 0.96 0 0 0 ...
## $ word_freq_business : num 0 0.07 0.06 0 0 0 0 0 0 0 ...
## $ word_freq_email : num 1.29 0.28 1.03 0 0 0 0.32 0 0.15 0.12 ...
## $ word_freq_you : num 1.93 3.47 1.36 3.18 3.18 0 3.85 0 1.23 1.67 ...
## $ word_freq_credit : num 0 0 0.32 0 0 0 0 0 3.53 0.06 ...
## $ word_freq_your : num 0.96 1.59 0.51 0.31 0.31 0 0.64 0 2 0.71 ...
## $ word_freq_font : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_000 : num 0 0.43 1.16 0 0 0 0 0 0 0.19 ...
## $ word_freq_money : num 0 0.43 0.06 0 0 0 0 0 0.15 0 ...
## $ word_freq_hp : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_hpl : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_george : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_650 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_lab : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_labs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_telnet : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_857 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_data : num 0 0 0 0 0 0 0 0 0.15 0 ...
## $ word_freq_415 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_85 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_technology : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_1999 : num 0 0.07 0 0 0 0 0 0 0 0 ...
## $ word_freq_parts : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_pm : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_direct : num 0 0 0.06 0 0 0 0 0 0 0 ...
## $ word_freq_cs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_meeting : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_original : num 0 0 0.12 0 0 0 0 0 0.3 0 ...
## $ word_freq_project : num 0 0 0 0 0 0 0 0 0 0.06 ...
## $ word_freq_re : num 0 0 0.06 0 0 0 0 0 0 0 ...
## $ word_freq_edu : num 0 0 0.06 0 0 0 0 0 0 0 ...
## $ word_freq_table : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_conference : num 0 0 0 0 0 0 0 0 0 0 ...
## $ char_freq_. : num 0 0 0.01 0 0 0 0 0 0 0.04 ...
## $ char_freq_..1 : num 0 0.132 0.143 0.137 0.135 0.223 0.054 0.206 0.271 0.03 ...
## $ char_freq_..2 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ char_freq_..3 : num 0.778 0.372 0.276 0.137 0.135 0 0.164 0 0.181 0.244 ...
## $ char_freq_..4 : num 0 0.18 0.184 0 0 0 0.054 0 0.203 0.081 ...
## $ char_freq_..5 : num 0 0.048 0.01 0 0 0 0 0 0.022 0 ...
## $ capital_run_length_average: num 3.76 5.11 9.82 3.54 3.54 ...
## $ capital_run_length_longest: int 61 101 485 40 40 15 4 11 445 43 ...
## $ capital_run_length_total : int 278 1028 2259 191 191 54 112 49 1257 749 ...
## $ status : int 1 1 1 1 1 1 1 1 1 1 ...
#RF deals with non numeric as well as numeric data
Obsrvations
58 columns are of class Numbers or Integers.
Dataframe Summary
lapply(dfrDataset, FUN=summary)
## $word_freq_make
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1046 0.0000 4.5400
##
## $word_freq_address
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.213 0.000 14.280
##
## $word_freq_all
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2807 0.4200 5.1000
##
## $word_freq_3d
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.06542 0.00000 42.81000
##
## $word_freq_our
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3122 0.3800 10.0000
##
## $word_freq_over
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0959 0.0000 5.8800
##
## $word_freq_remove
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1142 0.0000 7.2700
##
## $word_freq_internet
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1053 0.0000 11.1100
##
## $word_freq_order
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09007 0.00000 5.26000
##
## $word_freq_mail
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2394 0.1600 18.1800
##
## $word_freq_receive
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05982 0.00000 2.61000
##
## $word_freq_will
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.1000 0.5417 0.8000 9.6700
##
## $word_freq_people
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09393 0.00000 5.55000
##
## $word_freq_report
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05863 0.00000 10.00000
##
## $word_freq_addresses
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0492 0.0000 4.4100
##
## $word_freq_free
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2488 0.1000 20.0000
##
## $word_freq_business
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1426 0.0000 7.1400
##
## $word_freq_email
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1847 0.0000 9.0900
##
## $word_freq_you
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.310 1.662 2.640 18.750
##
## $word_freq_credit
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.08558 0.00000 18.18000
##
## $word_freq_your
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.2200 0.8098 1.2700 11.1100
##
## $word_freq_font
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1212 0.0000 17.1000
##
## $word_freq_000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1016 0.0000 5.4500
##
## $word_freq_money
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09427 0.00000 12.50000
##
## $word_freq_hp
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.5495 0.0000 20.8300
##
## $word_freq_hpl
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2654 0.0000 16.6600
##
## $word_freq_george
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.7673 0.0000 33.3300
##
## $word_freq_650
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1248 0.0000 9.0900
##
## $word_freq_lab
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09892 0.00000 14.28000
##
## $word_freq_labs
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1029 0.0000 5.8800
##
## $word_freq_telnet
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.06475 0.00000 12.50000
##
## $word_freq_857
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04705 0.00000 4.76000
##
## $word_freq_data
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09723 0.00000 18.18000
##
## $word_freq_415
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04784 0.00000 4.76000
##
## $word_freq_85
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1054 0.0000 20.0000
##
## $word_freq_technology
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09748 0.00000 7.69000
##
## $word_freq_1999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.137 0.000 6.890
##
## $word_freq_parts
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0132 0.0000 8.3300
##
## $word_freq_pm
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.07863 0.00000 11.11000
##
## $word_freq_direct
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.06483 0.00000 4.76000
##
## $word_freq_cs
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04367 0.00000 7.14000
##
## $word_freq_meeting
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1323 0.0000 14.2800
##
## $word_freq_original
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0461 0.0000 3.5700
##
## $word_freq_project
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0792 0.0000 20.0000
##
## $word_freq_re
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3012 0.1100 21.4200
##
## $word_freq_edu
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1798 0.0000 22.0500
##
## $word_freq_table
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.000000 0.000000 0.005444 0.000000 2.170000
##
## $word_freq_conference
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.03187 0.00000 10.00000
##
## $char_freq_.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.03857 0.00000 4.38500
##
## $char_freq_..1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.065 0.139 0.188 9.752
##
## $char_freq_..2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.01698 0.00000 4.08100
##
## $char_freq_..3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2691 0.3150 32.4780
##
## $char_freq_..4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.07581 0.05200 6.00300
##
## $char_freq_..5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04424 0.00000 19.82900
##
## $capital_run_length_average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.588 2.276 5.191 3.706 1102.500
##
## $capital_run_length_longest
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 6.00 15.00 52.17 43.00 9989.00
##
## $capital_run_length_total
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 35.0 95.0 283.3 266.0 15841.0
##
## $status
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.394 1.000 1.000
Missing Data
lapply(dfrDataset, FUN=detectNA)
## $word_freq_make
## [1] 0
##
## $word_freq_address
## [1] 0
##
## $word_freq_all
## [1] 0
##
## $word_freq_3d
## [1] 0
##
## $word_freq_our
## [1] 0
##
## $word_freq_over
## [1] 0
##
## $word_freq_remove
## [1] 0
##
## $word_freq_internet
## [1] 0
##
## $word_freq_order
## [1] 0
##
## $word_freq_mail
## [1] 0
##
## $word_freq_receive
## [1] 0
##
## $word_freq_will
## [1] 0
##
## $word_freq_people
## [1] 0
##
## $word_freq_report
## [1] 0
##
## $word_freq_addresses
## [1] 0
##
## $word_freq_free
## [1] 0
##
## $word_freq_business
## [1] 0
##
## $word_freq_email
## [1] 0
##
## $word_freq_you
## [1] 0
##
## $word_freq_credit
## [1] 0
##
## $word_freq_your
## [1] 0
##
## $word_freq_font
## [1] 0
##
## $word_freq_000
## [1] 0
##
## $word_freq_money
## [1] 0
##
## $word_freq_hp
## [1] 0
##
## $word_freq_hpl
## [1] 0
##
## $word_freq_george
## [1] 0
##
## $word_freq_650
## [1] 0
##
## $word_freq_lab
## [1] 0
##
## $word_freq_labs
## [1] 0
##
## $word_freq_telnet
## [1] 0
##
## $word_freq_857
## [1] 0
##
## $word_freq_data
## [1] 0
##
## $word_freq_415
## [1] 0
##
## $word_freq_85
## [1] 0
##
## $word_freq_technology
## [1] 0
##
## $word_freq_1999
## [1] 0
##
## $word_freq_parts
## [1] 0
##
## $word_freq_pm
## [1] 0
##
## $word_freq_direct
## [1] 0
##
## $word_freq_cs
## [1] 0
##
## $word_freq_meeting
## [1] 0
##
## $word_freq_original
## [1] 0
##
## $word_freq_project
## [1] 0
##
## $word_freq_re
## [1] 0
##
## $word_freq_edu
## [1] 0
##
## $word_freq_table
## [1] 0
##
## $word_freq_conference
## [1] 0
##
## $char_freq_.
## [1] 0
##
## $char_freq_..1
## [1] 0
##
## $char_freq_..2
## [1] 0
##
## $char_freq_..3
## [1] 0
##
## $char_freq_..4
## [1] 0
##
## $char_freq_..5
## [1] 0
##
## $capital_run_length_average
## [1] 0
##
## $capital_run_length_longest
## [1] 0
##
## $capital_run_length_total
## [1] 0
##
## $status
## [1] 0
Observations
1. 58 columns not detected with a single NA record in the ‘spambase’ dataset.
Check output
dfrQltyFreq <- summarize(group_by(dfrDataset, status), count=n())
# boxplot
ggplot(dfrQltyFreq, aes(x=status, y=count)) +
geom_bar(stat="identity", aes(fill=count)) +
labs(title="Status Frequency Distribution") +
labs(x="Status") +
labs(y="Counts")
Find Corelations
## find correlations
vcnCorsData <- abs(sapply(colnames(dfrDataset), detectCor)) #absolute value
summary(vcnCorsData)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.002525 0.148801 0.253256 0.276879 0.354708 1.000000
Show Corelations
vcnCorsData
## word_freq_make word_freq_address
## 0.24069974 0.29750940
## word_freq_all word_freq_3d
## 0.33283147 0.09077776
## word_freq_our word_freq_over
## 0.40913946 0.31864550
## word_freq_remove word_freq_internet
## 0.51877779 0.34379623
## word_freq_order word_freq_mail
## 0.30073703 0.29682394
## word_freq_receive word_freq_will
## 0.35496682 0.14847653
## word_freq_people word_freq_report
## 0.21287588 0.14977533
## word_freq_addresses word_freq_free
## 0.26515743 0.50416922
## word_freq_business word_freq_email
## 0.35290749 0.29909391
## word_freq_you word_freq_credit
## 0.36110406 0.32418657
## word_freq_your word_freq_font
## 0.50159062 0.13797471
## word_freq_000 word_freq_money
## 0.42580256 0.47215455
## word_freq_hp word_freq_hpl
## 0.39981558 0.34188069
## word_freq_george word_freq_650
## 0.35393063 0.22619064
## word_freq_lab word_freq_labs
## 0.22068802 0.24580530
## word_freq_telnet word_freq_857
## 0.20467400 0.16983798
## word_freq_data word_freq_415
## 0.15756347 0.15802818
## word_freq_85 word_freq_technology
## 0.21413087 0.16680254
## word_freq_1999 word_freq_parts
## 0.26070752 0.00252536
## word_freq_pm word_freq_direct
## 0.14721389 0.02813193
## word_freq_cs word_freq_meeting
## 0.14453750 0.19574176
## word_freq_original word_freq_project
## 0.10781412 0.14453744
## word_freq_re word_freq_edu
## 0.07176763 0.19702549
## word_freq_table word_freq_conference
## 0.02266674 0.13903044
## char_freq_. char_freq_..1
## 0.05683530 0.03263555
## char_freq_..2 char_freq_..3
## 0.11122690 0.59785363
## char_freq_..4 char_freq_..5
## 0.56563314 0.26668614
## capital_run_length_average capital_run_length_longest
## 0.48794983 0.51515693
## capital_run_length_total status
## 0.44397367 1.00000000
Plot Corelations
corrgram(dfrDataset)
Observations
1. Correlation has been ploted between ‘status’ variable & others.
2. With respect to ‘status’ variable, a medium to high correlation can be seen for some variables
More than Medium Corelations
vcnCorsData[vcnCorsData>0.4]
## word_freq_our word_freq_remove
## 0.4091395 0.5187778
## word_freq_free word_freq_your
## 0.5041692 0.5015906
## word_freq_000 word_freq_money
## 0.4258026 0.4721546
## char_freq_..3 char_freq_..4
## 0.5978536 0.5656331
## capital_run_length_average capital_run_length_longest
## 0.4879498 0.5151569
## capital_run_length_total status
## 0.4439737 1.0000000
Create Column IsSpam
dfrDataset <- mutate(dfrDataset, IsSpam= ifelse(dfrDataset$status ==0,'Not Spam',
'Spam'))
dfrDataset$IsSpam <- as.factor(dfrDataset$IsSpam)
table(dfrDataset$IsSpam)
##
## Not Spam Spam
## 2788 1813
#By now u should have checked for correlation and remove columns if required and checked for data imbalance
#Columns with more NA remove that too.
Observations
1. New column ‘IsSpam’ added w.r.t ‘status’ variable using If-Else condition. 2. #Not Spam = 2788 and #Spam = 1813 entries found.
Dataset Split
set.seed(707)
vctTrnRecs <- createDataPartition(y=dfrDataset$status, p=0.8, list=FALSE)
dfrTrnData <- dfrDataset[vctTrnRecs,]
dfrTstData <- dfrDataset[-vctTrnRecs,]
Observations
1. Split dataset into training & test datasets.
2. 80% data is now a part of training data and 20% data now in test data on which train results will be checked for prediction.
Training Dataset RowCount & ColCount
dim(dfrTrnData)
## [1] 3681 59
Observations
1. Dimensions of the training data set shows 3681 data records in it. which is 80%
Testing Dataset RowCount & ColCount
dim(dfrTstData)
## [1] 920 59
Observations
1. Dimensions of the test data set shows 920 data records in it which is 20%
Training Dataset Head
head(dfrTrnData)
## word_freq_make word_freq_address word_freq_all word_freq_3d
## 2 0.21 0.28 0.50 0
## 3 0.06 0.00 0.71 0
## 4 0.00 0.00 0.00 0
## 10 0.06 0.12 0.77 0
## 11 0.00 0.00 0.00 0
## 12 0.00 0.00 0.25 0
## word_freq_our word_freq_over word_freq_remove word_freq_internet
## 2 0.14 0.28 0.21 0.07
## 3 1.23 0.19 0.19 0.12
## 4 0.63 0.00 0.31 0.63
## 10 0.19 0.32 0.38 0.00
## 11 0.00 0.00 0.96 0.00
## 12 0.38 0.25 0.25 0.00
## word_freq_order word_freq_mail word_freq_receive word_freq_will
## 2 0.00 0.94 0.21 0.79
## 3 0.64 0.25 0.38 0.45
## 4 0.31 0.63 0.31 0.31
## 10 0.06 0.00 0.00 0.64
## 11 0.00 1.92 0.96 0.00
## 12 0.00 0.00 0.12 0.12
## word_freq_people word_freq_report word_freq_addresses word_freq_free
## 2 0.65 0.21 0.14 0.14
## 3 0.12 0.00 1.75 0.06
## 4 0.31 0.00 0.00 0.31
## 10 0.25 0.00 0.12 0.00
## 11 0.00 0.00 0.00 0.00
## 12 0.12 0.00 0.00 0.00
## word_freq_business word_freq_email word_freq_you word_freq_credit
## 2 0.07 0.28 3.47 0.00
## 3 0.06 1.03 1.36 0.32
## 4 0.00 0.00 3.18 0.00
## 10 0.00 0.12 1.67 0.06
## 11 0.00 0.96 3.84 0.00
## 12 0.00 0.00 1.16 0.00
## word_freq_your word_freq_font word_freq_000 word_freq_money
## 2 1.59 0 0.43 0.43
## 3 0.51 0 1.16 0.06
## 4 0.31 0 0.00 0.00
## 10 0.71 0 0.19 0.00
## 11 0.96 0 0.00 0.00
## 12 0.77 0 0.00 0.00
## word_freq_hp word_freq_hpl word_freq_george word_freq_650 word_freq_lab
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 10 0 0 0 0 0
## 11 0 0 0 0 0
## 12 0 0 0 0 0
## word_freq_labs word_freq_telnet word_freq_857 word_freq_data
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 10 0 0 0 0
## 11 0 0 0 0
## 12 0 0 0 0
## word_freq_415 word_freq_85 word_freq_technology word_freq_1999
## 2 0 0 0 0.07
## 3 0 0 0 0.00
## 4 0 0 0 0.00
## 10 0 0 0 0.00
## 11 0 0 0 0.00
## 12 0 0 0 0.00
## word_freq_parts word_freq_pm word_freq_direct word_freq_cs
## 2 0 0 0.00 0
## 3 0 0 0.06 0
## 4 0 0 0.00 0
## 10 0 0 0.00 0
## 11 0 0 0.96 0
## 12 0 0 0.00 0
## word_freq_meeting word_freq_original word_freq_project word_freq_re
## 2 0 0.00 0.00 0.00
## 3 0 0.12 0.00 0.06
## 4 0 0.00 0.00 0.00
## 10 0 0.00 0.06 0.00
## 11 0 0.00 0.00 0.00
## 12 0 0.00 0.00 0.00
## word_freq_edu word_freq_table word_freq_conference char_freq_.
## 2 0.00 0 0 0.000
## 3 0.06 0 0 0.010
## 4 0.00 0 0 0.000
## 10 0.00 0 0 0.040
## 11 0.00 0 0 0.000
## 12 0.00 0 0 0.022
## char_freq_..1 char_freq_..2 char_freq_..3 char_freq_..4 char_freq_..5
## 2 0.132 0 0.372 0.180 0.048
## 3 0.143 0 0.276 0.184 0.010
## 4 0.137 0 0.137 0.000 0.000
## 10 0.030 0 0.244 0.081 0.000
## 11 0.000 0 0.462 0.000 0.000
## 12 0.044 0 0.663 0.000 0.000
## capital_run_length_average capital_run_length_longest
## 2 5.114 101
## 3 9.821 485
## 4 3.537 40
## 10 1.729 43
## 11 1.312 6
## 12 1.243 11
## capital_run_length_total status IsSpam
## 2 1028 1 Spam
## 3 2259 1 Spam
## 4 191 1 Spam
## 10 749 1 Spam
## 11 21 1 Spam
## 12 184 1 Spam
Testing Dataset Head
head(dfrTstData)
## word_freq_make word_freq_address word_freq_all word_freq_3d
## 1 0.00 0.64 0.64 0
## 5 0.00 0.00 0.00 0
## 6 0.00 0.00 0.00 0
## 7 0.00 0.00 0.00 0
## 8 0.00 0.00 0.00 0
## 9 0.15 0.00 0.46 0
## word_freq_our word_freq_over word_freq_remove word_freq_internet
## 1 0.32 0 0.00 0.00
## 5 0.63 0 0.31 0.63
## 6 1.85 0 0.00 1.85
## 7 1.92 0 0.00 0.00
## 8 1.88 0 0.00 1.88
## 9 0.61 0 0.30 0.00
## word_freq_order word_freq_mail word_freq_receive word_freq_will
## 1 0.00 0.00 0.00 0.64
## 5 0.31 0.63 0.31 0.31
## 6 0.00 0.00 0.00 0.00
## 7 0.00 0.64 0.96 1.28
## 8 0.00 0.00 0.00 0.00
## 9 0.92 0.76 0.76 0.92
## word_freq_people word_freq_report word_freq_addresses word_freq_free
## 1 0.00 0 0 0.32
## 5 0.31 0 0 0.31
## 6 0.00 0 0 0.00
## 7 0.00 0 0 0.96
## 8 0.00 0 0 0.00
## 9 0.00 0 0 0.00
## word_freq_business word_freq_email word_freq_you word_freq_credit
## 1 0 1.29 1.93 0.00
## 5 0 0.00 3.18 0.00
## 6 0 0.00 0.00 0.00
## 7 0 0.32 3.85 0.00
## 8 0 0.00 0.00 0.00
## 9 0 0.15 1.23 3.53
## word_freq_your word_freq_font word_freq_000 word_freq_money word_freq_hp
## 1 0.96 0 0 0.00 0
## 5 0.31 0 0 0.00 0
## 6 0.00 0 0 0.00 0
## 7 0.64 0 0 0.00 0
## 8 0.00 0 0 0.00 0
## 9 2.00 0 0 0.15 0
## word_freq_hpl word_freq_george word_freq_650 word_freq_lab
## 1 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## 7 0 0 0 0
## 8 0 0 0 0
## 9 0 0 0 0
## word_freq_labs word_freq_telnet word_freq_857 word_freq_data
## 1 0 0 0 0.00
## 5 0 0 0 0.00
## 6 0 0 0 0.00
## 7 0 0 0 0.00
## 8 0 0 0 0.00
## 9 0 0 0 0.15
## word_freq_415 word_freq_85 word_freq_technology word_freq_1999
## 1 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## 7 0 0 0 0
## 8 0 0 0 0
## 9 0 0 0 0
## word_freq_parts word_freq_pm word_freq_direct word_freq_cs
## 1 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## 7 0 0 0 0
## 8 0 0 0 0
## 9 0 0 0 0
## word_freq_meeting word_freq_original word_freq_project word_freq_re
## 1 0 0.0 0 0
## 5 0 0.0 0 0
## 6 0 0.0 0 0
## 7 0 0.0 0 0
## 8 0 0.0 0 0
## 9 0 0.3 0 0
## word_freq_edu word_freq_table word_freq_conference char_freq_.
## 1 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## 7 0 0 0 0
## 8 0 0 0 0
## 9 0 0 0 0
## char_freq_..1 char_freq_..2 char_freq_..3 char_freq_..4 char_freq_..5
## 1 0.000 0 0.778 0.000 0.000
## 5 0.135 0 0.135 0.000 0.000
## 6 0.223 0 0.000 0.000 0.000
## 7 0.054 0 0.164 0.054 0.000
## 8 0.206 0 0.000 0.000 0.000
## 9 0.271 0 0.181 0.203 0.022
## capital_run_length_average capital_run_length_longest
## 1 3.756 61
## 5 3.537 40
## 6 3.000 15
## 7 1.671 4
## 8 2.450 11
## 9 9.744 445
## capital_run_length_total status IsSpam
## 1 278 1 Spam
## 5 191 1 Spam
## 6 54 1 Spam
## 7 112 1 Spam
## 8 49 1 Spam
## 9 1257 1 Spam
Training Dataset Summary
lapply(dfrTrnData, FUN=summary)
## $word_freq_make
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09885 0.00000 4.54000
##
## $word_freq_address
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2027 0.0000 14.2800
##
## $word_freq_all
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.276 0.400 5.100
##
## $word_freq_3d
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.07702 0.00000 42.81000
##
## $word_freq_our
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3066 0.3700 10.0000
##
## $word_freq_over
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09361 0.00000 3.57000
##
## $word_freq_remove
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1128 0.0000 7.2700
##
## $word_freq_internet
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1111 0.0000 11.1100
##
## $word_freq_order
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09092 0.00000 5.26000
##
## $word_freq_mail
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2315 0.1400 18.1800
##
## $word_freq_receive
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.06053 0.00000 2.61000
##
## $word_freq_will
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.090 0.533 0.780 7.690
##
## $word_freq_people
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09264 0.00000 5.55000
##
## $word_freq_report
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05711 0.00000 10.00000
##
## $word_freq_addresses
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0495 0.0000 4.4100
##
## $word_freq_free
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2504 0.1000 20.0000
##
## $word_freq_business
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 0.14 0.00 7.14
##
## $word_freq_email
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1867 0.0000 9.0900
##
## $word_freq_you
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.310 1.681 2.670 18.750
##
## $word_freq_credit
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.08879 0.00000 18.18000
##
## $word_freq_your
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.2000 0.8052 1.2800 11.1100
##
## $word_freq_font
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1237 0.0000 17.1000
##
## $word_freq_000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1015 0.0000 5.4500
##
## $word_freq_money
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09806 0.00000 12.50000
##
## $word_freq_hp
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.5447 0.0000 20.8300
##
## $word_freq_hpl
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2772 0.0000 16.6600
##
## $word_freq_george
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.7443 0.0000 33.3300
##
## $word_freq_650
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1324 0.0000 9.0900
##
## $word_freq_lab
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09591 0.00000 14.28000
##
## $word_freq_labs
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1049 0.0000 5.8800
##
## $word_freq_telnet
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.06735 0.00000 12.50000
##
## $word_freq_857
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04803 0.00000 4.76000
##
## $word_freq_data
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1003 0.0000 18.1800
##
## $word_freq_415
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04888 0.00000 4.76000
##
## $word_freq_85
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.111 0.000 20.000
##
## $word_freq_technology
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1027 0.0000 7.6900
##
## $word_freq_1999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1399 0.0000 6.8900
##
## $word_freq_parts
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.01477 0.00000 8.33000
##
## $word_freq_pm
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0794 0.0000 9.7500
##
## $word_freq_direct
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.06296 0.00000 4.76000
##
## $word_freq_cs
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04437 0.00000 7.14000
##
## $word_freq_meeting
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1302 0.0000 14.2800
##
## $word_freq_original
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04856 0.00000 3.57000
##
## $word_freq_project
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.08274 0.00000 20.00000
##
## $word_freq_re
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3028 0.1200 21.4200
##
## $word_freq_edu
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1815 0.0000 22.0500
##
## $word_freq_table
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.000000 0.000000 0.005186 0.000000 2.170000
##
## $word_freq_conference
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.02867 0.00000 10.00000
##
## $char_freq_.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0403 0.0000 4.3850
##
## $char_freq_..1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0650 0.1398 0.1890 9.7520
##
## $char_freq_..2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.01525 0.00000 4.08100
##
## $char_freq_..3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2693 0.3110 32.4780
##
## $char_freq_..4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.07428 0.05000 5.30000
##
## $char_freq_..5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.03956 0.00000 13.12900
##
## $capital_run_length_average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.576 2.250 5.180 3.697 1102.500
##
## $capital_run_length_longest
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 6.00 14.00 52.23 43.00 9989.00
##
## $capital_run_length_total
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 35.0 93.0 282.7 266.0 15841.0
##
## $status
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3953 1.0000 1.0000
##
## $IsSpam
## Not Spam Spam
## 2226 1455
Testing Dataset Summary
lapply(dfrTstData, FUN=summary)
## $word_freq_make
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1274 0.0000 4.0000
##
## $word_freq_address
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2542 0.0000 14.2800
##
## $word_freq_all
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2994 0.5000 4.5400
##
## $word_freq_3d
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.01904 0.00000 7.18000
##
## $word_freq_our
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3349 0.4300 7.1400
##
## $word_freq_over
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1051 0.0000 5.8800
##
## $word_freq_remove
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1199 0.0000 7.2700
##
## $word_freq_internet
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.08196 0.00000 4.00000
##
## $word_freq_order
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.08664 0.00000 3.33000
##
## $word_freq_mail
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2709 0.2200 11.1100
##
## $word_freq_receive
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05701 0.00000 2.00000
##
## $word_freq_will
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.1650 0.5763 0.8625 9.6700
##
## $word_freq_people
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09909 0.00000 2.94000
##
## $word_freq_report
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.06469 0.00000 5.55000
##
## $word_freq_addresses
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04803 0.00000 2.31000
##
## $word_freq_free
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2427 0.1100 20.0000
##
## $word_freq_business
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1529 0.0000 4.8700
##
## $word_freq_email
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.177 0.000 4.160
##
## $word_freq_you
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.310 1.585 2.500 14.000
##
## $word_freq_credit
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.07271 0.00000 6.25000
##
## $word_freq_your
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.270 0.828 1.260 8.000
##
## $word_freq_font
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1112 0.0000 17.1000
##
## $word_freq_000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1022 0.0000 3.3800
##
## $word_freq_money
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.07909 0.00000 4.41000
##
## $word_freq_hp
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.5688 0.0000 20.0000
##
## $word_freq_hpl
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2181 0.0000 7.6900
##
## $word_freq_george
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.8594 0.0000 33.3300
##
## $word_freq_650
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09478 0.00000 4.76000
##
## $word_freq_lab
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1109 0.0000 10.0000
##
## $word_freq_labs
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09458 0.00000 4.76000
##
## $word_freq_telnet
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05438 0.00000 4.76000
##
## $word_freq_857
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04312 0.00000 4.76000
##
## $word_freq_data
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.08496 0.00000 8.33000
##
## $word_freq_415
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04366 0.00000 4.76000
##
## $word_freq_85
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.08305 0.00000 4.76000
##
## $word_freq_technology
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.07655 0.00000 4.76000
##
## $word_freq_1999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.125 0.000 3.700
##
## $word_freq_parts
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.000000 0.000000 0.006913 0.000000 1.560000
##
## $word_freq_pm
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.07554 0.00000 11.11000
##
## $word_freq_direct
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.07235 0.00000 4.76000
##
## $word_freq_cs
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04086 0.00000 4.75000
##
## $word_freq_meeting
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1408 0.0000 7.6900
##
## $word_freq_original
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.03627 0.00000 1.69000
##
## $word_freq_project
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.06501 0.00000 4.54000
##
## $word_freq_re
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.295 0.070 16.660
##
## $word_freq_edu
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1731 0.0000 9.0900
##
## $word_freq_table
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.000000 0.000000 0.006478 0.000000 2.120000
##
## $word_freq_conference
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04469 0.00000 8.33000
##
## $char_freq_.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.03168 0.00000 3.67200
##
## $char_freq_..1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0675 0.1360 0.1782 4.2710
##
## $char_freq_..2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0239 0.0000 2.7770
##
## $char_freq_..3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2681 0.3352 5.8280
##
## $char_freq_..4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.08193 0.06125 6.00300
##
## $char_freq_..5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.06296 0.00000 19.82900
##
## $capital_run_length_average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.625 2.333 5.236 3.796 443.666
##
## $capital_run_length_longest
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 16.00 51.94 44.00 1325.00
##
## $capital_run_length_total
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 37.0 101.5 285.5 259.2 9088.0
##
## $status
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3891 1.0000 1.0000
##
## $IsSpam
## Not Spam Spam
## 562 358
Random Forest Approach
myCtrl <- trainControl(method=“cv”, number=10, repeats=3)
Increase in the iterations increases accruacy and thus takes time to make the model. for train we require caret package, for RF we require RF package.
m1 <- train(predictor~., data=dataFrame, method=“rf”,
verbose=F, trControl=myCtrl) ## Boost Method cvCtrl <- trainControl(method=“repeatedcv”, number=10, repeats=3) m2 <- train(predictor~., data=dataFrame, method=“gmb”, verbose=F, trControl=cvCtrl) ## Custom Algorithm … notice method is not mentioned here myCtrl <- trainControl(method=“oob”, number=10, repeats=3) m3 <- train(predictor~., data=dataFrame, tuneGrid=data.frame(mtry=10), trControl=myCtrl)
We could also try one of these if required
myCtrl <- trainControl(method=“cv”, number=10, repeats=3) o1 <- train(predictor~., data=dataFrame, method=“svm”, verbose=F, trControl=cvCtrl) ## KNN Model myCtrl <- trainControl(method=“cv”, number=10, repeats=3) o2 <- train(predictor~., data=dataFrame, method=“knn”, verbose=F, trControl=cvCtrl) ## Bagged Model myCtrl <- trainControl(method=“cv”, number=10, repeats=3) o3 <- train(predictor~., data=dataFrame, method=“bag”, verbose=F, trControl=cvCtrl)
Create Model - Random Forest (Default)
## set seed
set.seed(707)
# mtry
myMtry=sqrt(ncol(dfrTrnData)-2) #total number of predictor columns. hence u minus 2 coz u not taking taste and quality
myNtrees=500
# start time
vctProcStrt <- proc.time()
#Proc.time gives statistics of the process.
# random forest (default)
#mdlRndForDef <- randomForest(IsSpam~.-status, data=dfrTrnData)
mdlRndForDef <- randomForest(IsSpam~.-status, data=dfrTrnData,
mtry=myMtry, ntree=myNtrees)
# end time
vctProcEnds <- proc.time()
# print
print(paste("Model Created ...",vctProcEnds[1] - vctProcStrt[1]))
## [1] "Model Created ... 10.92"
Observations
1. 10.06 in first run and 9.75 seconds in another run were taken to create the model by Default method.
View Model - Default Random Forest
mdlRndForDef
##
## Call:
## randomForest(formula = IsSpam ~ . - status, data = dfrTrnData, mtry = myMtry, ntree = myNtrees)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 8
##
## OOB estimate of error rate: 4.62%
## Confusion matrix:
## Not Spam Spam class.error
## Not Spam 2162 64 0.02875112
## Spam 106 1349 0.07285223
#Anything above 70% is good.
Observations
1. OOB est is only 4.62% which means that the High Accuracy around 95.38% achieved.
View Model Summary - Default Random Forest
summary(mdlRndForDef)
## Length Class Mode
## call 5 -none- call
## type 1 -none- character
## predicted 3681 factor numeric
## err.rate 1500 -none- numeric
## confusion 6 -none- numeric
## votes 7362 matrix numeric
## oob.times 3681 -none- numeric
## classes 2 -none- character
## importance 57 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 3681 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
#mdlRndForDef[3]
#It gives you the lengthof the vector.
#It prints the attributes of the predictor model.
#If you want to know the values in the attributes.
# Type mdlRndForDef$ntree[1] in the console
Prediction - Test Data - Random Forest (Default)
vctRndForDef <- predict(mdlRndForDef, newdata=dfrTstData)
cmxRndForDef <- confusionMatrix(vctRndForDef, dfrTstData$IsSpam)
cmxRndForDef
## Confusion Matrix and Statistics
##
## Reference
## Prediction Not Spam Spam
## Not Spam 534 20
## Spam 28 338
##
## Accuracy : 0.9478
## 95% CI : (0.9314, 0.9613)
## No Information Rate : 0.6109
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8907
## Mcnemar's Test P-Value : 0.3123
##
## Sensitivity : 0.9502
## Specificity : 0.9441
## Pos Pred Value : 0.9639
## Neg Pred Value : 0.9235
## Prevalence : 0.6109
## Detection Rate : 0.5804
## Detection Prevalence : 0.6022
## Balanced Accuracy : 0.9472
##
## 'Positive' Class : Not Spam
##
#Look for accruacy an 95% CI. The highest accruacy is 75.
#p value < 0.05. hence your model is good.
Observations
1. Reference table shows only 20(not spam predicted as spam) and 28 (spam predicted as not spam). Which is low error rate. 2. P value = 0.31 < 0.05, Reject the NULL Hypothesis at 95% confidence interval. Which means that there is a great dependency between dependent & predictor variables.
3. Accuracy of test data is also very high which is around 94.72% near to train data. Hence model is good.
Create Model - Random Forest (RFM)
## set seed
set.seed(707)
# start time
vctProcStrt <- proc.time()
# random forest (default)
myControl <- trainControl(method="cv", number=10, repeats=3) #change these params to increase accruacy. insted of cv u can write repeated cv, number can be more than 10 and repeats more than 3
myMetric <- "Accuracy" #Consider this as syntax
myMtry <- sqrt(ncol(dfrTrnData)-2) #can change this also.
myNtrees <- 500 #to increase the accruacy increase this. Change only on thing at a time
myTuneGrid <- expand.grid(.mtry=myMtry) #How to configure mtry in train method we use this. consider this as syntax.
mdlRndForRfm <- train(IsSpam~.-status, data=dfrTrnData, method="rf",
verbose=F, metric=myMetric, trControl=myControl,
tuneGrid=myTuneGrid, ntree=myNtrees)
# end time
vctProcEnds <- proc.time()
# print
print(paste("Model Created ...",vctProcEnds[1] - vctProcStrt[1]))
## [1] "Model Created ... 94.19"
Observations
1. It took 94.71 seconds to create the model by Random Forest method.
View Model - Random Forest (RFM)
mdlRndForRfm
## Random Forest
##
## 3681 samples
## 58 predictor
## 2 classes: 'Not Spam', 'Spam'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 3312, 3314, 3312, 3313, 3313, 3313, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9521943 0.8994474
##
## Tuning parameter 'mtry' was held constant at a value of 7.549834
Observations
1. A very high accuracy around 95.21% was seen which is less than default method that gave 95.38%
View Model Summary - Random Forest (RFM)
summary(mdlRndForRfm)
## Length Class Mode
## call 6 -none- call
## type 1 -none- character
## predicted 3681 factor numeric
## err.rate 1500 -none- numeric
## confusion 6 -none- numeric
## votes 7362 matrix numeric
## oob.times 3681 -none- numeric
## classes 2 -none- character
## importance 57 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 3681 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## xNames 57 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 2 -none- character
## param 2 -none- list
Prediction - Test Data - Random Forest (RFM)
vctRndForRfm <- predict(mdlRndForRfm, newdata=dfrTstData)
cmxRndForRfm <- confusionMatrix(vctRndForRfm, dfrTstData$IsSpam)
cmxRndForRfm
## Confusion Matrix and Statistics
##
## Reference
## Prediction Not Spam Spam
## Not Spam 534 21
## Spam 28 337
##
## Accuracy : 0.9467
## 95% CI : (0.9302, 0.9603)
## No Information Rate : 0.6109
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8884
## Mcnemar's Test P-Value : 0.3914
##
## Sensitivity : 0.9502
## Specificity : 0.9413
## Pos Pred Value : 0.9622
## Neg Pred Value : 0.9233
## Prevalence : 0.6109
## Detection Rate : 0.5804
## Detection Prevalence : 0.6033
## Balanced Accuracy : 0.9458
##
## 'Positive' Class : Not Spam
##
Observations
1. Reference table shows only 21(not spam predicted as spam) and 28 (spam predicted as not spam). Which is low error rate. 2. P value = 0.39 < 0.05, Reject the NULL Hypothesis at 95% confidence interval. Which means that there is a great dependency between dependent & predictor variables.
3. Accuracy of test data is also very high which is around 94.58% near to train data. Hence model is good.
Create Model - Random Forest (GBM)- General bagging method. Creating groups of decision trees.GBM is generic. RF model is specific
## set seed
set.seed(707)
# start time
vctProcStrt <- proc.time()
# random forest (default)
myControl <- trainControl(method="repeatedcv", number=10, repeats=3)
myMetric <- "Accuracy"
myMtry <- sqrt(ncol(dfrTrnData)-2)
myNtrees <- 500
myTuneGrid <- expand.grid(.mtry=myMtry)
mdlRndForGbm <- train(IsSpam~.-status, data=dfrTrnData, method="gbm",
verbose=F, metric=myMetric, trControl=myControl)
#ntree=myNtrees)
# tuneGrid=myTuneGrid)
# end time
vctProcEnds <- proc.time()
# print
print(paste("Model Created ...",vctProcEnds[1] - vctProcStrt[1]))
## [1] "Model Created ... 105.14"
Observations
1. It took 97.89 seconds to create the model by Random Forest method which is much higher than default method.
View Model - Random Forest (GBM)
mdlRndForGbm
## Stochastic Gradient Boosting
##
## 3681 samples
## 58 predictor
## 2 classes: 'Not Spam', 'Spam'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 3312, 3314, 3312, 3313, 3313, 3313, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.9125264 0.8130164
## 1 100 0.9307222 0.8532164
## 1 150 0.9367051 0.8663083
## 2 50 0.9312693 0.8546294
## 2 100 0.9405110 0.8744321
## 2 150 0.9432274 0.8803238
## 3 50 0.9363428 0.8654668
## 3 100 0.9449460 0.8839671
## 3 150 0.9471209 0.8886790
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
Observations
1. A very high accuracy of 94.71% but lesser than the default method.
View Model Summary - Random Forest (GBM)
summary(mdlRndForGbm)
## var rel.inf
## char_freq_..3 char_freq_..3 21.71328125
## char_freq_..4 char_freq_..4 19.17273719
## word_freq_remove word_freq_remove 11.66807935
## word_freq_hp word_freq_hp 9.17395240
## word_freq_free word_freq_free 7.94587755
## capital_run_length_longest capital_run_length_longest 5.19101659
## capital_run_length_average capital_run_length_average 4.55363849
## word_freq_your word_freq_your 3.03793007
## word_freq_our word_freq_our 2.49322457
## word_freq_money word_freq_money 2.43499996
## word_freq_george word_freq_george 2.39065969
## word_freq_edu word_freq_edu 1.74627237
## word_freq_1999 word_freq_1999 1.28704265
## capital_run_length_total capital_run_length_total 1.13584861
## word_freq_meeting word_freq_meeting 0.61307150
## word_freq_you word_freq_you 0.59182671
## word_freq_000 word_freq_000 0.56908488
## word_freq_receive word_freq_receive 0.51898504
## word_freq_re word_freq_re 0.44280485
## word_freq_650 word_freq_650 0.40545253
## char_freq_. char_freq_. 0.40499187
## word_freq_business word_freq_business 0.35117130
## word_freq_hpl word_freq_hpl 0.33315065
## word_freq_will word_freq_will 0.31637881
## word_freq_internet word_freq_internet 0.29897659
## word_freq_email word_freq_email 0.23535307
## word_freq_technology word_freq_technology 0.18272811
## char_freq_..1 char_freq_..1 0.15921637
## word_freq_over word_freq_over 0.13902156
## word_freq_font word_freq_font 0.12361794
## word_freq_mail word_freq_mail 0.05960259
## word_freq_report word_freq_report 0.05752083
## word_freq_pm word_freq_pm 0.05072411
## word_freq_project word_freq_project 0.03704862
## word_freq_credit word_freq_credit 0.03502091
## word_freq_order word_freq_order 0.03055389
## word_freq_conference word_freq_conference 0.02965122
## word_freq_3d word_freq_3d 0.02071226
## word_freq_original word_freq_original 0.01781143
## word_freq_address word_freq_address 0.01701609
## word_freq_make word_freq_make 0.01394553
## word_freq_all word_freq_all 0.00000000
## word_freq_people word_freq_people 0.00000000
## word_freq_addresses word_freq_addresses 0.00000000
## word_freq_lab word_freq_lab 0.00000000
## word_freq_labs word_freq_labs 0.00000000
## word_freq_telnet word_freq_telnet 0.00000000
## word_freq_857 word_freq_857 0.00000000
## word_freq_data word_freq_data 0.00000000
## word_freq_415 word_freq_415 0.00000000
## word_freq_85 word_freq_85 0.00000000
## word_freq_parts word_freq_parts 0.00000000
## word_freq_direct word_freq_direct 0.00000000
## word_freq_cs word_freq_cs 0.00000000
## word_freq_table word_freq_table 0.00000000
## char_freq_..2 char_freq_..2 0.00000000
## char_freq_..5 char_freq_..5 0.00000000
Prediction - Test Data - Random Forest (GBM)
vctRndForGbm <- predict(mdlRndForGbm, newdata=dfrTstData)
cmxRndForGbm <- confusionMatrix(vctRndForGbm, dfrTstData$IsSpam)
cmxRndForGbm
## Confusion Matrix and Statistics
##
## Reference
## Prediction Not Spam Spam
## Not Spam 532 19
## Spam 30 339
##
## Accuracy : 0.9467
## 95% CI : (0.9302, 0.9603)
## No Information Rate : 0.6109
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8886
## Mcnemar's Test P-Value : 0.1531
##
## Sensitivity : 0.9466
## Specificity : 0.9469
## Pos Pred Value : 0.9655
## Neg Pred Value : 0.9187
## Prevalence : 0.6109
## Detection Rate : 0.5783
## Detection Prevalence : 0.5989
## Balanced Accuracy : 0.9468
##
## 'Positive' Class : Not Spam
##
Observations
1. Reference table shows only 19(not spam predicted as spam) and 30 (spam predicted as not spam). Which is low error rate. 2. P value = 0.15 < 0.05, Reject the NULL Hypothesis at 95% confidence interval. Which means that there is a great dependency between dependent & predictor variables.
3. Accuracy of test data is also very high which is around 94.68% near to train data. Hence model is good.
Create Model - Random Forest (OOB)
## set seed
set.seed(707)
# start time
vctProcStrt <- proc.time()
# random forest (default)
myControl <- trainControl(method="oob", number=10, repeats=3)
myMetric <- "Accuracy"
myMtry <- 10 #sqrt(ncol(dfrTrnData)-2)
myNtrees <- 500
myTuneGrid <- expand.grid(.mtry=myMtry)
mdlRndForOob <- train(IsSpam~.-status, data=dfrTrnData,
verbose=F, metric=myMetric, trControl=myControl,
tuneGrid=myTuneGrid, ntree=myNtrees)
# end time
vctProcEnds <- proc.time()
# print
print(paste("Model Created ...",vctProcEnds[1] - vctProcStrt[1]))
## [1] "Model Created ... 20.48"
Observations
1. It took 21.5 seconds to create the model by Random Forest method which is much higher than default method.
View Model - Random Formbest (OOB)
mdlRndForOob
## Random Forest
##
## 3681 samples
## 58 predictor
## 2 classes: 'Not Spam', 'Spam'
##
## No pre-processing
## Resampling results:
##
## Accuracy Kappa
## 0.9538169 0.9030031
##
## Tuning parameter 'mtry' was held constant at a value of 10
Observations
1. A very high accuracyof 95.38% but lesser than the default method.
View Model Summary - Random Forest (OOB)
summary(mdlRndForOob)
## Length Class Mode
## call 6 -none- call
## type 1 -none- character
## predicted 3681 factor numeric
## err.rate 1500 -none- numeric
## confusion 6 -none- numeric
## votes 7362 matrix numeric
## oob.times 3681 -none- numeric
## classes 2 -none- character
## importance 57 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 3681 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## xNames 57 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 2 -none- character
## param 2 -none- list
Prediction - Test Data - Random Forest (OOB)
vctRndForOob <- predict(mdlRndForOob, newdata=dfrTstData)
cmxRndForOob <- confusionMatrix(vctRndForOob, dfrTstData$IsSpam)
cmxRndForOob
## Confusion Matrix and Statistics
##
## Reference
## Prediction Not Spam Spam
## Not Spam 534 21
## Spam 28 337
##
## Accuracy : 0.9467
## 95% CI : (0.9302, 0.9603)
## No Information Rate : 0.6109
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8884
## Mcnemar's Test P-Value : 0.3914
##
## Sensitivity : 0.9502
## Specificity : 0.9413
## Pos Pred Value : 0.9622
## Neg Pred Value : 0.9233
## Prevalence : 0.6109
## Detection Rate : 0.5804
## Detection Prevalence : 0.6033
## Balanced Accuracy : 0.9458
##
## 'Positive' Class : Not Spam
##
Observations
1. Reference table shows only 21 (not spam predicted as spam) and 29 (spam predicted as not spam). Which is low error rate. 2. P value = 0.32 < 0.05, Reject the NULL Hypothesis at 95% confidence interval. Which means that there is a great dependency between dependent & predictor variables.
3. Accuracy of test data is also very high which is around 94.49% near to train data. Hence model is good.
End of Project