Introduction
Random Forest
The steps to predict using Random Forest:
* Step 1
* Step 2
Problem Definition
The “spam” concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography… Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word ‘george’ and the area code ‘650’ are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter.
So Our Problem Statement is-
To create the model to predict the mail is Spam or Not.
Test the model on test dataset
Check the accuracy.
Data Location
http://archive.ics.uci.edu/ml/datasets/Spambase?ref=datanews.io
Data Description
Number of Attributes: 58 (57 continuous, 1 nominal class label)
The last column of ‘spambase.data’ denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.
Most of the attributes indicate whether a particular word or character was frequently occuring in the e-mail. The run-length attributes (55-57) measure the length of sequences of consecutive capital letters. For the statistical measures of each attribute, see the end of this file. Here are the definitions of the attributes:
48 continuous real [0,100] attributes of type word_freq_WORD = percentage of words in the e-mail that match WORD, i.e. 100 * (number of times the WORD appears in the e-mail) / total number of words in e-mail. A “word” in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string.
6 continuous real [0,100] attributes of type char_freq_CHAR = percentage of characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurences) / total characters in e-mail
1 continuous real [1,…] attribute of type capital_run_length_average = average length of uninterrupted sequences of capital letters
1 continuous integer [1,…] attribute of type capital_run_length_longest = length of longest uninterrupted sequence of capital letters
1 continuous integer [1,…] attribute of type capital_run_length_total = sum of length of uninterrupted sequences of capital letters = total number of capital letters in the e-mail
1 nominal {0,1} class attribute of type spam = denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.
Setup
Load Libs
library(plyr)
library(tidyr)
library(dplyr)
library(ggplot2)
library(psych)
library(gridExtra)
#install.packages("caret")
library(caret)
#install.packages("randomForest")
library(randomForest)
#install.packages("gbm")
library(gbm)
#install.packages("corrgram")
library(corrgram)
library(corrplot)
Functions
##Function to Check total NA Records in a column
detectNA <- function(inp) {
sum(is.na(inp))
}
##Function to Check Correlation between Status variable & Other variable
detectCor <- function(x) {
cor(as.numeric(dfrDataset[, x]),
as.numeric(dfrDataset$status),
method="spearman")
}
##To check no. of Outliers in a Column
detect_outliers <- function(inp, na.rm=TRUE) {
i.qnt <- quantile(inp, probs=c(.25, .75), na.rm=na.rm)
i.max <- 1.5 * IQR(inp, na.rm=na.rm)
otp <- inp
otp[inp < (i.qnt[1] - i.max)] <- NA
otp[inp > (i.qnt[2] + i.max)] <- NA
#inp <- count(inp[is.na(otp)])
sum(is.na(otp))
}
##Function to Check data except Outliers
Except_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
##Function to Remove Outliers
Remove_Outliers <- function ( z, na.rm = TRUE){
Out <- Except_outliers(z)
Out <-as.data.frame (Out)
z <- Out$Out[match(z, Out$Out)]
z
}
##Function to replace Outliers with Median
Replace_Outliers_to_Median <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- median(x)
y[x > (qnt[2] + H)] <- median(x)
y
}
##Function to replace Outliers with Mean
Replace_Outliers_to_Mean <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- mean(x)
y[x > (qnt[2] + H)] <- mean(x)
y
}
##Boxplot Function
Graph_Boxplot <- function (input, na.rm = TRUE){
Plot <- ggplot(dfrDataset, aes(x="", y=input)) +
geom_boxplot(aes(fill=input), color="green") +
labs(title="Outliers")
Plot
}
##Data Imputation Function
Data_Impute <- function (input){
dfrDataset$input[is.na(dfrDataset$input)] <- mean(dfrDataset$input[!is.na(dfrDataset$input)])
}
Load Dataset
setwd("D:/Welingkar/Trim 4/Machine Learning/Project/Data")
dfrDataset <- read.csv("./Spam_Email_Base.csv", header=T, stringsAsFactors=T)
head(dfrDataset)
## word_freq_make word_freq_address word_freq_all word_freq_3d
## 1 0.00 0.64 0.64 0
## 2 0.21 0.28 0.50 0
## 3 0.06 0.00 0.71 0
## 4 0.00 0.00 0.00 0
## 5 0.00 0.00 0.00 0
## 6 0.00 0.00 0.00 0
## word_freq_our word_freq_over word_freq_remove word_freq_internet
## 1 0.32 0.00 0.00 0.00
## 2 0.14 0.28 0.21 0.07
## 3 1.23 0.19 0.19 0.12
## 4 0.63 0.00 0.31 0.63
## 5 0.63 0.00 0.31 0.63
## 6 1.85 0.00 0.00 1.85
## word_freq_order word_freq_mail word_freq_receive word_freq_will
## 1 0.00 0.00 0.00 0.64
## 2 0.00 0.94 0.21 0.79
## 3 0.64 0.25 0.38 0.45
## 4 0.31 0.63 0.31 0.31
## 5 0.31 0.63 0.31 0.31
## 6 0.00 0.00 0.00 0.00
## word_freq_people word_freq_report word_freq_addresses word_freq_free
## 1 0.00 0.00 0.00 0.32
## 2 0.65 0.21 0.14 0.14
## 3 0.12 0.00 1.75 0.06
## 4 0.31 0.00 0.00 0.31
## 5 0.31 0.00 0.00 0.31
## 6 0.00 0.00 0.00 0.00
## word_freq_business word_freq_email word_freq_you word_freq_credit
## 1 0.00 1.29 1.93 0.00
## 2 0.07 0.28 3.47 0.00
## 3 0.06 1.03 1.36 0.32
## 4 0.00 0.00 3.18 0.00
## 5 0.00 0.00 3.18 0.00
## 6 0.00 0.00 0.00 0.00
## word_freq_your word_freq_font word_freq_000 word_freq_money word_freq_hp
## 1 0.96 0 0.00 0.00 0
## 2 1.59 0 0.43 0.43 0
## 3 0.51 0 1.16 0.06 0
## 4 0.31 0 0.00 0.00 0
## 5 0.31 0 0.00 0.00 0
## 6 0.00 0 0.00 0.00 0
## word_freq_hpl word_freq_george word_freq_650 word_freq_lab
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## word_freq_labs word_freq_telnet word_freq_857 word_freq_data
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## word_freq_415 word_freq_85 word_freq_technology word_freq_1999
## 1 0 0 0 0.00
## 2 0 0 0 0.07
## 3 0 0 0 0.00
## 4 0 0 0 0.00
## 5 0 0 0 0.00
## 6 0 0 0 0.00
## word_freq_parts word_freq_pm word_freq_direct word_freq_cs
## 1 0 0 0.00 0
## 2 0 0 0.00 0
## 3 0 0 0.06 0
## 4 0 0 0.00 0
## 5 0 0 0.00 0
## 6 0 0 0.00 0
## word_freq_meeting word_freq_original word_freq_project word_freq_re
## 1 0 0.00 0 0.00
## 2 0 0.00 0 0.00
## 3 0 0.12 0 0.06
## 4 0 0.00 0 0.00
## 5 0 0.00 0 0.00
## 6 0 0.00 0 0.00
## word_freq_edu word_freq_table word_freq_conference char_freq_.
## 1 0.00 0 0 0.00
## 2 0.00 0 0 0.00
## 3 0.06 0 0 0.01
## 4 0.00 0 0 0.00
## 5 0.00 0 0 0.00
## 6 0.00 0 0 0.00
## char_freq_..1 char_freq_..2 char_freq_..3 char_freq_..4 char_freq_..5
## 1 0.000 0 0.778 0.000 0.000
## 2 0.132 0 0.372 0.180 0.048
## 3 0.143 0 0.276 0.184 0.010
## 4 0.137 0 0.137 0.000 0.000
## 5 0.135 0 0.135 0.000 0.000
## 6 0.223 0 0.000 0.000 0.000
## capital_run_length_average capital_run_length_longest
## 1 3.756 61
## 2 5.114 101
## 3 9.821 485
## 4 3.537 40
## 5 3.537 40
## 6 3.000 15
## capital_run_length_total status
## 1 278 1
## 2 1028 1
## 3 2259 1
## 4 191 1
## 5 191 1
## 6 54 1
Observations
1. Data has been loaded successfully.
2. There are 4601 records in the data set.
Dataframe Stucture
str(dfrDataset)
## 'data.frame': 4601 obs. of 58 variables:
## $ word_freq_make : num 0 0.21 0.06 0 0 0 0 0 0.15 0.06 ...
## $ word_freq_address : num 0.64 0.28 0 0 0 0 0 0 0 0.12 ...
## $ word_freq_all : num 0.64 0.5 0.71 0 0 0 0 0 0.46 0.77 ...
## $ word_freq_3d : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_our : num 0.32 0.14 1.23 0.63 0.63 1.85 1.92 1.88 0.61 0.19 ...
## $ word_freq_over : num 0 0.28 0.19 0 0 0 0 0 0 0.32 ...
## $ word_freq_remove : num 0 0.21 0.19 0.31 0.31 0 0 0 0.3 0.38 ...
## $ word_freq_internet : num 0 0.07 0.12 0.63 0.63 1.85 0 1.88 0 0 ...
## $ word_freq_order : num 0 0 0.64 0.31 0.31 0 0 0 0.92 0.06 ...
## $ word_freq_mail : num 0 0.94 0.25 0.63 0.63 0 0.64 0 0.76 0 ...
## $ word_freq_receive : num 0 0.21 0.38 0.31 0.31 0 0.96 0 0.76 0 ...
## $ word_freq_will : num 0.64 0.79 0.45 0.31 0.31 0 1.28 0 0.92 0.64 ...
## $ word_freq_people : num 0 0.65 0.12 0.31 0.31 0 0 0 0 0.25 ...
## $ word_freq_report : num 0 0.21 0 0 0 0 0 0 0 0 ...
## $ word_freq_addresses : num 0 0.14 1.75 0 0 0 0 0 0 0.12 ...
## $ word_freq_free : num 0.32 0.14 0.06 0.31 0.31 0 0.96 0 0 0 ...
## $ word_freq_business : num 0 0.07 0.06 0 0 0 0 0 0 0 ...
## $ word_freq_email : num 1.29 0.28 1.03 0 0 0 0.32 0 0.15 0.12 ...
## $ word_freq_you : num 1.93 3.47 1.36 3.18 3.18 0 3.85 0 1.23 1.67 ...
## $ word_freq_credit : num 0 0 0.32 0 0 0 0 0 3.53 0.06 ...
## $ word_freq_your : num 0.96 1.59 0.51 0.31 0.31 0 0.64 0 2 0.71 ...
## $ word_freq_font : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_000 : num 0 0.43 1.16 0 0 0 0 0 0 0.19 ...
## $ word_freq_money : num 0 0.43 0.06 0 0 0 0 0 0.15 0 ...
## $ word_freq_hp : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_hpl : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_george : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_650 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_lab : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_labs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_telnet : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_857 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_data : num 0 0 0 0 0 0 0 0 0.15 0 ...
## $ word_freq_415 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_85 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_technology : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_1999 : num 0 0.07 0 0 0 0 0 0 0 0 ...
## $ word_freq_parts : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_pm : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_direct : num 0 0 0.06 0 0 0 0 0 0 0 ...
## $ word_freq_cs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_meeting : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_original : num 0 0 0.12 0 0 0 0 0 0.3 0 ...
## $ word_freq_project : num 0 0 0 0 0 0 0 0 0 0.06 ...
## $ word_freq_re : num 0 0 0.06 0 0 0 0 0 0 0 ...
## $ word_freq_edu : num 0 0 0.06 0 0 0 0 0 0 0 ...
## $ word_freq_table : num 0 0 0 0 0 0 0 0 0 0 ...
## $ word_freq_conference : num 0 0 0 0 0 0 0 0 0 0 ...
## $ char_freq_. : num 0 0 0.01 0 0 0 0 0 0 0.04 ...
## $ char_freq_..1 : num 0 0.132 0.143 0.137 0.135 0.223 0.054 0.206 0.271 0.03 ...
## $ char_freq_..2 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ char_freq_..3 : num 0.778 0.372 0.276 0.137 0.135 0 0.164 0 0.181 0.244 ...
## $ char_freq_..4 : num 0 0.18 0.184 0 0 0 0.054 0 0.203 0.081 ...
## $ char_freq_..5 : num 0 0.048 0.01 0 0 0 0 0 0.022 0 ...
## $ capital_run_length_average: num 3.76 5.11 9.82 3.54 3.54 ...
## $ capital_run_length_longest: int 61 101 485 40 40 15 4 11 445 43 ...
## $ capital_run_length_total : int 278 1028 2259 191 191 54 112 49 1257 749 ...
## $ status : int 1 1 1 1 1 1 1 1 1 1 ...
#RF deals with non numeric as well as numeric data
Obsrvations
1. All the columns are either Numbers or Integers.
Dataframe Summary
lapply(dfrDataset, FUN=summary)
## $word_freq_make
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1046 0.0000 4.5400
##
## $word_freq_address
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.213 0.000 14.280
##
## $word_freq_all
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2807 0.4200 5.1000
##
## $word_freq_3d
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.06542 0.00000 42.81000
##
## $word_freq_our
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3122 0.3800 10.0000
##
## $word_freq_over
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0959 0.0000 5.8800
##
## $word_freq_remove
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1142 0.0000 7.2700
##
## $word_freq_internet
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1053 0.0000 11.1100
##
## $word_freq_order
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09007 0.00000 5.26000
##
## $word_freq_mail
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2394 0.1600 18.1800
##
## $word_freq_receive
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05982 0.00000 2.61000
##
## $word_freq_will
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.1000 0.5417 0.8000 9.6700
##
## $word_freq_people
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09393 0.00000 5.55000
##
## $word_freq_report
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05863 0.00000 10.00000
##
## $word_freq_addresses
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0492 0.0000 4.4100
##
## $word_freq_free
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2488 0.1000 20.0000
##
## $word_freq_business
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1426 0.0000 7.1400
##
## $word_freq_email
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1847 0.0000 9.0900
##
## $word_freq_you
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.310 1.662 2.640 18.750
##
## $word_freq_credit
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.08558 0.00000 18.18000
##
## $word_freq_your
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.2200 0.8098 1.2700 11.1100
##
## $word_freq_font
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1212 0.0000 17.1000
##
## $word_freq_000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1016 0.0000 5.4500
##
## $word_freq_money
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09427 0.00000 12.50000
##
## $word_freq_hp
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.5495 0.0000 20.8300
##
## $word_freq_hpl
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2654 0.0000 16.6600
##
## $word_freq_george
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.7673 0.0000 33.3300
##
## $word_freq_650
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1248 0.0000 9.0900
##
## $word_freq_lab
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09892 0.00000 14.28000
##
## $word_freq_labs
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1029 0.0000 5.8800
##
## $word_freq_telnet
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.06475 0.00000 12.50000
##
## $word_freq_857
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04705 0.00000 4.76000
##
## $word_freq_data
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09723 0.00000 18.18000
##
## $word_freq_415
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04784 0.00000 4.76000
##
## $word_freq_85
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1054 0.0000 20.0000
##
## $word_freq_technology
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09748 0.00000 7.69000
##
## $word_freq_1999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.137 0.000 6.890
##
## $word_freq_parts
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0132 0.0000 8.3300
##
## $word_freq_pm
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.07863 0.00000 11.11000
##
## $word_freq_direct
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.06483 0.00000 4.76000
##
## $word_freq_cs
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04367 0.00000 7.14000
##
## $word_freq_meeting
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1323 0.0000 14.2800
##
## $word_freq_original
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0461 0.0000 3.5700
##
## $word_freq_project
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0792 0.0000 20.0000
##
## $word_freq_re
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3012 0.1100 21.4200
##
## $word_freq_edu
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1798 0.0000 22.0500
##
## $word_freq_table
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.000000 0.000000 0.005444 0.000000 2.170000
##
## $word_freq_conference
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.03187 0.00000 10.00000
##
## $char_freq_.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.03857 0.00000 4.38500
##
## $char_freq_..1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.065 0.139 0.188 9.752
##
## $char_freq_..2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.01698 0.00000 4.08100
##
## $char_freq_..3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2691 0.3150 32.4780
##
## $char_freq_..4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.07581 0.05200 6.00300
##
## $char_freq_..5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04424 0.00000 19.82900
##
## $capital_run_length_average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.588 2.276 5.191 3.706 1102.500
##
## $capital_run_length_longest
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 6.00 15.00 52.17 43.00 9989.00
##
## $capital_run_length_total
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 35.0 95.0 283.3 266.0 15841.0
##
## $status
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.394 1.000 1.000
#summary(dfrDataset)
Exploratory Statistics
lapply(dfrDataset, FUN=describe)
## $word_freq_make
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 4601 0.1 0.31 0 0.03 0 0 4.54 4.54 5.67 49.23 0
##
## $word_freq_address
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 4601 0.21 1.29 0 0.02 0 0 14.28 14.28 10.08 105.48
## se
## X1 0.02
##
## $word_freq_all
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 4601 0.28 0.5 0 0.17 0 0 5.1 5.1 3.01 13.29 0.01
##
## $word_freq_3d
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 4601 0.07 1.4 0 0 0 0 42.81 42.81 26.21 725.34
## se
## X1 0.02
##
## $word_freq_our
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 4601 0.31 0.67 0 0.16 0 0 10 10 4.74 37.88 0.01
##
## $word_freq_over
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 4601 0.1 0.27 0 0.03 0 0 5.88 5.88 5.95 68.34 0
##
## $word_freq_remove
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 4601 0.11 0.39 0 0.02 0 0 7.27 7.27 6.76 75.3
## se
## X1 0.01
##
## $word_freq_internet
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 4601 0.11 0.4 0 0.02 0 0 11.11 11.11 9.72 168.9
## se
## X1 0.01
##
## $word_freq_order
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 4601 0.09 0.28 0 0.01 0 0 5.26 5.26 5.22 46.87 0
##
## $word_freq_mail
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 4601 0.24 0.64 0 0.09 0 0 18.18 18.18 8.48 160.97
## se
## X1 0.01
##
## $word_freq_receive
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 4601 0.06 0.2 0 0.01 0 0 2.61 2.61 5.51 39.59 0
##
## $word_freq_will
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 4601 0.54 0.86 0.1 0.36 0.15 0 9.67 9.67 2.87 12.53
## se
## X1 0.01
##
## $word_freq_people
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 4601 0.09 0.3 0 0.02 0 0 5.55 5.55 6.95 84.81 0
##
## $word_freq_report
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 4601 0.06 0.34 0 0 0 0 10 10 11.75 228.85 0
##
## $word_freq_addresses
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 4601 0.05 0.26 0 0 0 0 4.41 4.41 6.97 57.64 0
##
## $word_freq_free
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 4601 0.25 0.83 0 0.08 0 0 20 20 10.76 196.12
## se
## X1 0.01
##
## $word_freq_business
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 4601 0.14 0.44 0 0.03 0 0 7.14 7.14 5.68 45.6
## se
## X1 0.01
##
## $word_freq_email
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 4601 0.18 0.53 0 0.05 0 0 9.09 9.09 5.41 47.89
## se
## X1 0.01
##
## $word_freq_you
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 4601 1.66 1.78 1.31 1.41 1.94 0 18.75 18.75 1.59 5.25
## se
## X1 0.03
##
## $word_freq_credit
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 4601 0.09 0.51 0 0 0 0 18.18 18.18 14.59 382.42
## se
## X1 0.01
##
## $word_freq_your
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 4601 0.81 1.2 0.22 0.56 0.33 0 11.11 11.11 2.43 8.99
## se
## X1 0.02
##
## $word_freq_font
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 4601 0.12 1.03 0 0 0 0 17.1 17.1 9.97 108.97
## se
## X1 0.02
##
## $word_freq_000
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 4601 0.1 0.35 0 0.01 0 0 5.45 5.45 5.71 46.73
## se
## X1 0.01
##
## $word_freq_money
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 4601 0.09 0.44 0 0.01 0 0 12.5 12.5 14.68 301.59
## se
## X1 0.01
##
## $word_freq_hp
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 4601 0.55 1.67 0 0.15 0 0 20.83 20.83 5.71 43.53
## se
## X1 0.02
##
## $word_freq_hpl
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 4601 0.27 0.89 0 0.05 0 0 16.66 16.66 6.35 63.8
## se
## X1 0.01
##
## $word_freq_george
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 4601 0.77 3.37 0 0.05 0 0 33.33 33.33 5.74 34.15
## se
## X1 0.05
##
## $word_freq_650
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 4601 0.12 0.54 0 0 0 0 9.09 9.09 6.6 58.28
## se
## X1 0.01
##
## $word_freq_lab
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 4601 0.1 0.59 0 0 0 0 14.28 14.28 11.36 174.98
## se
## X1 0.01
##
## $word_freq_labs
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 4601 0.1 0.46 0 0 0 0 5.88 5.88 6.63 51.93
## se
## X1 0.01
##
## $word_freq_telnet
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 4601 0.06 0.4 0 0 0 0 12.5 12.5 12.66 253.84
## se
## X1 0.01
##
## $word_freq_857
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 4601 0.05 0.33 0 0 0 0 4.76 4.76 10.54 127.18 0
##
## $word_freq_data
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 4601 0.1 0.56 0 0 0 0 18.18 18.18 13.18 295.64
## se
## X1 0.01
##
## $word_freq_415
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 4601 0.05 0.33 0 0 0 0 4.76 4.76 10.47 125.75 0
##
## $word_freq_85
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 4601 0.11 0.53 0 0 0 0 20 20 15.22 448.69
## se
## X1 0.01
##
## $word_freq_technology
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 4601 0.1 0.4 0 0 0 0 7.69 7.69 7.67 81.08 0.01
##
## $word_freq_1999
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 4601 0.14 0.42 0 0.03 0 0 6.89 6.89 5.32 42.55
## se
## X1 0.01
##
## $word_freq_parts
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 4601 0.01 0.22 0 0 0 0 8.33 8.33 28.24 910.66 0
##
## $word_freq_pm
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 4601 0.08 0.43 0 0 0 0 11.11 11.11 12.05 215.39
## se
## X1 0.01
##
## $word_freq_direct
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 4601 0.06 0.35 0 0 0 0 4.76 4.76 9.14 99.23
## se
## X1 0.01
##
## $word_freq_cs
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 4601 0.04 0.36 0 0 0 0 7.14 7.14 12.58 193.32
## se
## X1 0.01
##
## $word_freq_meeting
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 4601 0.13 0.77 0 0 0 0 14.28 14.28 9.45 115.53
## se
## X1 0.01
##
## $word_freq_original
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 4601 0.05 0.22 0 0 0 0 3.57 3.57 7.62 78.45 0
##
## $word_freq_project
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 4601 0.08 0.62 0 0 0 0 20 20 18.76 479.1
## se
## X1 0.01
##
## $word_freq_re
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 4601 0.3 1.01 0 0.09 0 0 21.42 21.42 9.14 128.67
## se
## X1 0.01
##
## $word_freq_edu
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 4601 0.18 0.91 0 0 0 0 22.05 22.05 10.12 150.67
## se
## X1 0.01
##
## $word_freq_table
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 4601 0.01 0.08 0 0 0 0 2.17 2.17 19.85 458.73 0
##
## $word_freq_conference
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 4601 0.03 0.29 0 0 0 0 10 10 19.71 536.67 0
##
## $char_freq_.
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 4601 0.04 0.24 0 0 0 0 4.38 4.38 13.7 212.74 0
##
## $char_freq_..1
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 4601 0.14 0.27 0.06 0.09 0.1 0 9.75 9.75 13.57 392.81 0
##
## $char_freq_..2
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 4601 0.02 0.11 0 0 0 0 4.08 4.08 21.07 617.53 0
##
## $char_freq_..3
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 4601 0.27 0.82 0 0.13 0 0 32.48 32.48 18.65 606.53
## se
## X1 0.01
##
## $char_freq_..4
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 4601 0.08 0.25 0 0.03 0 0 6 6 11.16 199.65 0
##
## $char_freq_..5
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 4601 0.04 0.43 0 0 0 0 19.83 19.83 31.04 1216.64
## se
## X1 0.01
##
## $capital_run_length_average
## vars n mean sd median trimmed mad min max range skew
## X1 1 4601 5.19 31.73 2.28 2.63 1.31 1 1102.5 1101.5 23.75
## kurtosis se
## X1 669.35 0.47
##
## $capital_run_length_longest
## vars n mean sd median trimmed mad min max range skew
## X1 1 4601 52.17 194.89 15 23.52 16.31 1 9989 9988 30.74
## kurtosis se
## X1 1478.39 2.87
##
## $capital_run_length_total
## vars n mean sd median trimmed mad min max range skew
## X1 1 4601 283.29 606.35 95 156.6 114.16 1 15841 15840 8.7
## kurtosis se
## X1 145.61 8.94
##
## $status
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 4601 0.39 0.49 0 0.37 0 0 1 1 0.43 -1.81 0.01
#describe(dfrDataset)
Observations
1. Summary can be used to check the distribution of data between different quartiles.
2. Second function is used to check the statistics in the data, what are the different parameters values eg. skewness, Kurtosis, No of Values etc.
Missing Data
lapply(dfrDataset, FUN=detectNA)
## $word_freq_make
## [1] 0
##
## $word_freq_address
## [1] 0
##
## $word_freq_all
## [1] 0
##
## $word_freq_3d
## [1] 0
##
## $word_freq_our
## [1] 0
##
## $word_freq_over
## [1] 0
##
## $word_freq_remove
## [1] 0
##
## $word_freq_internet
## [1] 0
##
## $word_freq_order
## [1] 0
##
## $word_freq_mail
## [1] 0
##
## $word_freq_receive
## [1] 0
##
## $word_freq_will
## [1] 0
##
## $word_freq_people
## [1] 0
##
## $word_freq_report
## [1] 0
##
## $word_freq_addresses
## [1] 0
##
## $word_freq_free
## [1] 0
##
## $word_freq_business
## [1] 0
##
## $word_freq_email
## [1] 0
##
## $word_freq_you
## [1] 0
##
## $word_freq_credit
## [1] 0
##
## $word_freq_your
## [1] 0
##
## $word_freq_font
## [1] 0
##
## $word_freq_000
## [1] 0
##
## $word_freq_money
## [1] 0
##
## $word_freq_hp
## [1] 0
##
## $word_freq_hpl
## [1] 0
##
## $word_freq_george
## [1] 0
##
## $word_freq_650
## [1] 0
##
## $word_freq_lab
## [1] 0
##
## $word_freq_labs
## [1] 0
##
## $word_freq_telnet
## [1] 0
##
## $word_freq_857
## [1] 0
##
## $word_freq_data
## [1] 0
##
## $word_freq_415
## [1] 0
##
## $word_freq_85
## [1] 0
##
## $word_freq_technology
## [1] 0
##
## $word_freq_1999
## [1] 0
##
## $word_freq_parts
## [1] 0
##
## $word_freq_pm
## [1] 0
##
## $word_freq_direct
## [1] 0
##
## $word_freq_cs
## [1] 0
##
## $word_freq_meeting
## [1] 0
##
## $word_freq_original
## [1] 0
##
## $word_freq_project
## [1] 0
##
## $word_freq_re
## [1] 0
##
## $word_freq_edu
## [1] 0
##
## $word_freq_table
## [1] 0
##
## $word_freq_conference
## [1] 0
##
## $char_freq_.
## [1] 0
##
## $char_freq_..1
## [1] 0
##
## $char_freq_..2
## [1] 0
##
## $char_freq_..3
## [1] 0
##
## $char_freq_..4
## [1] 0
##
## $char_freq_..5
## [1] 0
##
## $capital_run_length_average
## [1] 0
##
## $capital_run_length_longest
## [1] 0
##
## $capital_run_length_total
## [1] 0
##
## $status
## [1] 0
Observations
1. There are no NA records in the dataset.
2. So Data imputation is not required.
Data Imputation
# inline comments
#Data_Impute(dfrDataset$word_freq_make)
#dfrDataset$word_freq_make[is.na(dfrDataset$word_freq_make)] <- mean(dfrDataset$word_freq_make[!is.na(dfrDataset$word_freq_make)])
Box Plot
lapply(dfrDataset, FUN=Graph_Boxplot)
## $word_freq_make
##
## $word_freq_address
##
## $word_freq_all
##
## $word_freq_3d
##
## $word_freq_our
##
## $word_freq_over
##
## $word_freq_remove
##
## $word_freq_internet
##
## $word_freq_order
##
## $word_freq_mail
##
## $word_freq_receive
##
## $word_freq_will
##
## $word_freq_people
##
## $word_freq_report
##
## $word_freq_addresses
##
## $word_freq_free
##
## $word_freq_business
##
## $word_freq_email
##
## $word_freq_you
##
## $word_freq_credit
##
## $word_freq_your
##
## $word_freq_font
##
## $word_freq_000
##
## $word_freq_money
##
## $word_freq_hp
##
## $word_freq_hpl
##
## $word_freq_george
##
## $word_freq_650
##
## $word_freq_lab
##
## $word_freq_labs
##
## $word_freq_telnet
##
## $word_freq_857
##
## $word_freq_data
##
## $word_freq_415
##
## $word_freq_85
##
## $word_freq_technology
##
## $word_freq_1999
##
## $word_freq_parts
##
## $word_freq_pm
##
## $word_freq_direct
##
## $word_freq_cs
##
## $word_freq_meeting
##
## $word_freq_original
##
## $word_freq_project
##
## $word_freq_re
##
## $word_freq_edu
##
## $word_freq_table
##
## $word_freq_conference
##
## $char_freq_.
##
## $char_freq_..1
##
## $char_freq_..2
##
## $char_freq_..3
##
## $char_freq_..4
##
## $char_freq_..5
##
## $capital_run_length_average
##
## $capital_run_length_longest
##
## $capital_run_length_total
##
## $status
Outliers
detect_outliers(dfrDataset$word_freq_make)
## [1] 1053
detect_outliers(dfrDataset$word_freq_address)
## [1] 898
detect_outliers(dfrDataset$word_freq_all)
## [1] 338
detect_outliers(dfrDataset$word_freq_3d)
## [1] 47
detect_outliers(dfrDataset$word_freq_our)
## [1] 501
detect_outliers(dfrDataset$word_freq_over)
## [1] 999
detect_outliers(dfrDataset$word_freq_remove)
## [1] 807
detect_outliers(dfrDataset$word_freq_internet)
## [1] 824
detect_outliers(dfrDataset$word_freq_order)
## [1] 773
detect_outliers(dfrDataset$word_freq_mail)
## [1] 852
detect_outliers(dfrDataset$word_freq_receive)
## [1] 709
detect_outliers(dfrDataset$word_freq_will)
## [1] 270
detect_outliers(dfrDataset$word_freq_people)
## [1] 852
detect_outliers(dfrDataset$word_freq_report)
## [1] 357
detect_outliers(dfrDataset$word_freq_addresses)
## [1] 336
detect_outliers(dfrDataset$word_freq_free)
## [1] 957
detect_outliers(dfrDataset$word_freq_business)
## [1] 963
detect_outliers(dfrDataset$word_freq_email)
## [1] 1038
detect_outliers(dfrDataset$word_freq_you)
## [1] 75
detect_outliers(dfrDataset$word_freq_credit)
## [1] 424
detect_outliers(dfrDataset$word_freq_your)
## [1] 229
detect_outliers(dfrDataset$word_freq_font)
## [1] 117
detect_outliers(dfrDataset$word_freq_000)
## [1] 679
detect_outliers(dfrDataset$word_freq_money)
## [1] 735
detect_outliers(dfrDataset$word_freq_hp)
## [1] 1090
detect_outliers(dfrDataset$word_freq_hpl)
## [1] 811
detect_outliers(dfrDataset$word_freq_george)
## [1] 780
detect_outliers(dfrDataset$word_freq_650)
## [1] 463
detect_outliers(dfrDataset$word_freq_lab)
## [1] 372
detect_outliers(dfrDataset$word_freq_labs)
## [1] 469
detect_outliers(dfrDataset$word_freq_telnet)
## [1] 293
detect_outliers(dfrDataset$word_freq_857)
## [1] 205
detect_outliers(dfrDataset$word_freq_data)
## [1] 405
detect_outliers(dfrDataset$word_freq_415)
## [1] 215
detect_outliers(dfrDataset$word_freq_85)
## [1] 485
detect_outliers(dfrDataset$word_freq_technology)
## [1] 599
detect_outliers(dfrDataset$word_freq_1999)
## [1] 829
detect_outliers(dfrDataset$word_freq_parts)
## [1] 83
detect_outliers(dfrDataset$word_freq_pm)
## [1] 384
detect_outliers(dfrDataset$word_freq_direct)
## [1] 453
detect_outliers(dfrDataset$word_freq_cs)
## [1] 148
detect_outliers(dfrDataset$word_freq_meeting)
## [1] 341
detect_outliers(dfrDataset$word_freq_original)
## [1] 375
detect_outliers(dfrDataset$word_freq_project)
## [1] 327
detect_outliers(dfrDataset$word_freq_re)
## [1] 1001
detect_outliers(dfrDataset$word_freq_edu)
## [1] 517
detect_outliers(dfrDataset$word_freq_table)
## [1] 63
detect_outliers(dfrDataset$word_freq_conference)
## [1] 203
detect_outliers(dfrDataset$char_freq_.)
## [1] 790
detect_outliers(dfrDataset$char_freq_..1)
## [1] 296
detect_outliers(dfrDataset$char_freq_..2)
## [1] 529
detect_outliers(dfrDataset$char_freq_..3)
## [1] 411
detect_outliers(dfrDataset$char_freq_..4)
## [1] 811
detect_outliers(dfrDataset$char_freq_..5)
## [1] 750
detect_outliers(dfrDataset$capital_run_length_average)
## [1] 363
detect_outliers(dfrDataset$capital_run_length_longest)
## [1] 463
detect_outliers(dfrDataset$capital_run_length_total)
## [1] 550
detect_outliers(dfrDataset$status)
## [1] 0
Observations
1. We can see that there are outliers in the Data set.
2. As Random Forest is insensitive to Outliers so we are going with Outliers in this Model.
Outliers Treatment
#dfrDataset$word_freq_make<- Replace_Outliers_to_Median(dfrDataset$word_freq_make)
#dfrDataset$word_freq_address<- Replace_Outliers_to_Median(dfrDataset$word_freq_address)
#dfrDataset$word_freq_all<- Replace_Outliers_to_Median(dfrDataset$word_freq_all)
#dfrDataset$word_freq_3d<- Replace_Outliers_to_Median(dfrDataset$word_freq_3d)
#dfrDataset$word_freq_our<- Replace_Outliers_to_Median(dfrDataset$word_freq_our)
#dfrDataset$word_freq_over<- Replace_Outliers_to_Median(dfrDataset$word_freq_over)
#dfrDataset$word_freq_remove<- #Replace_Outliers_to_Median(dfrDataset$word_freq_remove)
#dfrDataset$word_freq_internet<- #Replace_Outliers_to_Median(dfrDataset$word_freq_internet)
#dfrDataset$word_freq_order<- #Replace_Outliers_to_Median(dfrDataset$word_freq_order)
#dfrDataset$word_freq_mail<- Replace_Outliers_to_Median(dfrDataset$word_freq_mail)
#dfrDataset$word_freq_receive<- Replace_Outliers_to_Median(dfrDataset$word_freq_receive)
#dfrDataset$word_freq_will<- Replace_Outliers_to_Median(dfrDataset$word_freq_will)
#dfrDataset$word_freq_people<- Replace_Outliers_to_Median(dfrDataset$word_freq_people)
#dfrDataset$word_freq_report<- Replace_Outliers_to_Median(dfrDataset$word_freq_report)
#dfrDataset$word_freq_addresses<- Replace_Outliers_to_Median(dfrDataset$word_freq_addresses)
#dfrDataset$word_freq_free<- Replace_Outliers_to_Median(dfrDataset$word_freq_free)
#dfrDataset$word_freq_business<- Replace_Outliers_to_Median(dfrDataset$word_freq_business)
#dfrDataset$word_freq_email<- Replace_Outliers_to_Median(dfrDataset$word_freq_email)
#dfrDataset$word_freq_you<- Replace_Outliers_to_Median(dfrDataset$word_freq_you)
#dfrDataset$word_freq_credit<- #Replace_Outliers_to_Median(dfrDataset$word_freq_credit)
#dfrDataset$word_freq_your<- Replace_Outliers_to_Median(dfrDataset$word_freq_your)
#dfrDataset$word_freq_font<- Replace_Outliers_to_Median(dfrDataset$word_freq_font)
#dfrDataset$word_freq_000<- Replace_Outliers_to_Median(dfrDataset$word_freq_000)
#dfrDataset$word_freq_money<- Replace_Outliers_to_Median(dfrDataset$word_freq_money)
#dfrDataset$word_freq_hp<- Replace_Outliers_to_Median(dfrDataset$word_freq_hp)
#dfrDataset$word_freq_hpl<- Replace_Outliers_to_Median(dfrDataset$word_freq_hpl)
#dfrDataset$word_freq_george<- Replace_Outliers_to_Median(dfrDataset$word_freq_george)
#dfrDataset$word_freq_650<- Replace_Outliers_to_Median(dfrDataset$word_freq_650)
#dfrDataset$word_freq_lab<- Replace_Outliers_to_Median(dfrDataset$word_freq_lab)
#dfrDataset$word_freq_labs<- Replace_Outliers_to_Median(dfrDataset$word_freq_labs)
#dfrDataset$word_freq_telnet<- Replace_Outliers_to_Median(dfrDataset$word_freq_telnet)
#dfrDataset$word_freq_857<- Replace_Outliers_to_Median(dfrDataset$word_freq_857)
#dfrDataset$word_freq_data<- Replace_Outliers_to_Median(dfrDataset$word_freq_data)
#dfrDataset$word_freq_415<- Replace_Outliers_to_Median(dfrDataset$word_freq_415)
#dfrDataset$word_freq_85<- Replace_Outliers_to_Median(dfrDataset$word_freq_85)
#dfrDataset$word_freq_technology<- Replace_Outliers_to_Median(dfrDataset$word_freq_technology)
#dfrDataset$word_freq_1999<- Replace_Outliers_to_Median(dfrDataset$word_freq_1999)
#dfrDataset$word_freq_parts<- Replace_Outliers_to_Median(dfrDataset$word_freq_parts)
#dfrDataset$word_freq_pm<- Replace_Outliers_to_Median(dfrDataset$word_freq_pm)
#dfrDataset$word_freq_direct<- Replace_Outliers_to_Median(dfrDataset$word_freq_direct)
#dfrDataset$word_freq_cs<- Replace_Outliers_to_Median(dfrDataset$word_freq_cs)
#dfrDataset$word_freq_meeting<- Replace_Outliers_to_Median(dfrDataset$word_freq_meeting)
#dfrDataset$word_freq_original<- Replace_Outliers_to_Median(dfrDataset$word_freq_original)
#dfrDataset$word_freq_project<- Replace_Outliers_to_Median(dfrDataset$word_freq_project)
#dfrDataset$word_freq_re<- Replace_Outliers_to_Median(dfrDataset$word_freq_re)
#dfrDataset$word_freq_edu<- Replace_Outliers_to_Median(dfrDataset$word_freq_edu)
#dfrDataset$word_freq_table<- #Replace_Outliers_to_Median(dfrDataset$word_freq_table)
#dfrDataset$word_freq_conference<- #Replace_Outliers_to_Median(dfrDataset$word_freq_conference)
#dfrDataset$char_freq_.<- Replace_Outliers_to_Median(dfrDataset$char_freq_.)
#dfrDataset$char_freq_..1<- Replace_Outliers_to_Median(dfrDataset$char_freq_..1)
#dfrDataset$char_freq_..2<- Replace_Outliers_to_Median(dfrDataset$char_freq_..2)
#dfrDataset$char_freq_..3<- Replace_Outliers_to_Median(dfrDataset$char_freq_..3)
#dfrDataset$char_freq_..4<- Replace_Outliers_to_Median(dfrDataset$char_freq_..4)
#dfrDataset$char_freq_..5<- Replace_Outliers_to_Median(dfrDataset$char_freq_..5)
#dfrDataset$capital_run_length_average<- #Replace_Outliers_to_Median(dfrDataset$capital_run_length_average)
#dfrDataset$capital_run_length_longest<- Replace_Outliers_to_Median(dfrDataset$capital_run_length_longest)
#dfrDataset$capital_run_length_total<- Replace_Outliers_to_Median(dfrDataset$capital_run_length_total)
#dfrDataset$status<- Replace_Outliers_to_Median(dfrDataset$status)
#dfrDataset <- dfrDataset[complete.cases(dfrDataset), ]
#nrow(dfrDataset)
#head(dfrDataset)
Reason Why Outlier Handling is not require.
1. Total No of data records are 4601, after removing outliers of each column only 177 data records are remaining which is not good as full bunch of records are not part of analysis.
2. If Outliers is replaced with Mean value then again same no of outliers remains in the database as for most of the columns Median value is Zero.
3. If Outliers are replaced with Median then many columns will not have any specific values other than Zeroes as Median is zero for most of the columns & after replacing with median we will not be able to find out Correlation between variables.
4. Random Forest is insensitive to the Outliers.
5. After creating the models with different methods we find out that Data accuracy is freat & Outliers are not creating any problem in pridction.(This point is added after model creation)
Spam Distribution
status <- table(dfrDataset$status)
status
##
## 0 1
## 2788 1813
prop.table(status)
##
## 0 1
## 0.6059552 0.3940448
Observations
1. 1813 Emails are spam which are 39.40% of total records.
Top Words of Spam & non Spam mails
dfrDataset_All <- aggregate(dfrDataset[, 1:54], list(dfrDataset$status), mean)
dfrDataset_spam <- dfrDataset_All[dfrDataset_All$Group.1==1,]
dfrDataset_spam$Group.1 <- NULL
dfrDataset_spam <- dfrDataset_spam[order(-dfrDataset_spam)[1:3]]
dfrDataset_spam <- unlist(dfrDataset_spam)
barplot(dfrDataset_spam, main="Spam Emails Top 3 Words",ylab="Percentage", xlab="Words")
dfrDataset_notspam <- dfrDataset_All[dfrDataset_All$Group.1==0,]
dfrDataset_notspam$Group.1 <- NULL
dfrDataset_notspam <- dfrDataset_notspam[order(-dfrDataset_notspam)[1:3]]
dfrDataset_notspam <- unlist(dfrDataset_notspam)
barplot(dfrDataset_notspam, main="Non Spam Emails Top 3 Words", ylab="Percentage", xlab="Words")
Output Visualisation
dfrQltyFreq <- summarise(group_by(dfrDataset, status), count=n())
dfrQltyFreq
## # A tibble: 2 x 2
## status count
## <int> <int>
## 1 0 2788
## 2 1 1813
# boxplot of mpg by car cylinders
ggplot(dfrQltyFreq, aes(x=status, y=count)) +
geom_bar(stat="identity", aes(fill=count)) +
labs(title="Status Frequency Distribution") +
labs(x="Status") +
labs(y="Counts")
Find Corelations
## find correlations
vcnCorsData <- abs(sapply(colnames(dfrDataset), detectCor)) #absolute value
summary(vcnCorsData)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.002525 0.148801 0.253256 0.276879 0.354708 1.000000
Show Corelations
vcnCorsData
## word_freq_make word_freq_address
## 0.24069974 0.29750940
## word_freq_all word_freq_3d
## 0.33283147 0.09077776
## word_freq_our word_freq_over
## 0.40913946 0.31864550
## word_freq_remove word_freq_internet
## 0.51877779 0.34379623
## word_freq_order word_freq_mail
## 0.30073703 0.29682394
## word_freq_receive word_freq_will
## 0.35496682 0.14847653
## word_freq_people word_freq_report
## 0.21287588 0.14977533
## word_freq_addresses word_freq_free
## 0.26515743 0.50416922
## word_freq_business word_freq_email
## 0.35290749 0.29909391
## word_freq_you word_freq_credit
## 0.36110406 0.32418657
## word_freq_your word_freq_font
## 0.50159062 0.13797471
## word_freq_000 word_freq_money
## 0.42580256 0.47215455
## word_freq_hp word_freq_hpl
## 0.39981558 0.34188069
## word_freq_george word_freq_650
## 0.35393063 0.22619064
## word_freq_lab word_freq_labs
## 0.22068802 0.24580530
## word_freq_telnet word_freq_857
## 0.20467400 0.16983798
## word_freq_data word_freq_415
## 0.15756347 0.15802818
## word_freq_85 word_freq_technology
## 0.21413087 0.16680254
## word_freq_1999 word_freq_parts
## 0.26070752 0.00252536
## word_freq_pm word_freq_direct
## 0.14721389 0.02813193
## word_freq_cs word_freq_meeting
## 0.14453750 0.19574176
## word_freq_original word_freq_project
## 0.10781412 0.14453744
## word_freq_re word_freq_edu
## 0.07176763 0.19702549
## word_freq_table word_freq_conference
## 0.02266674 0.13903044
## char_freq_. char_freq_..1
## 0.05683530 0.03263555
## char_freq_..2 char_freq_..3
## 0.11122690 0.59785363
## char_freq_..4 char_freq_..5
## 0.56563314 0.26668614
## capital_run_length_average capital_run_length_longest
## 0.48794983 0.51515693
## capital_run_length_total status
## 0.44397367 1.00000000
Plot Corelations
#corrplot(cor(dfrDataset[1:10,1:58])[1:58,58, drop=FALSE], cl.pos='n')
corrplot(cor(dfrDataset[c(1:10,58)]))
corrplot(cor(dfrDataset[c(10:20,58)]))
corrplot(cor(dfrDataset[c(20:30,58)]))
corrplot(cor(dfrDataset[c(30:40,58)]))
corrplot(cor(dfrDataset[c(40:50,58)]))
corrplot(cor(dfrDataset[c(50:58)]))
Observations
1. Correlation has been ploted between Status variable & other variables.
2. There is Medium to high correlation only for few variables with respect to status variable.
More than Medium Corelations
vcnCorsData[vcnCorsData>0.4]
## word_freq_our word_freq_remove
## 0.4091395 0.5187778
## word_freq_free word_freq_your
## 0.5041692 0.5015906
## word_freq_000 word_freq_money
## 0.4258026 0.4721546
## char_freq_..3 char_freq_..4
## 0.5978536 0.5656331
## capital_run_length_average capital_run_length_longest
## 0.4879498 0.5151569
## capital_run_length_total status
## 0.4439737 1.0000000
Create Column IsSpam
dfrDataset <- mutate(dfrDataset, IsSpam= ifelse(dfrDataset$status ==0,'Not Spam',
'Spam'))
dfrDataset$IsSpam <- as.factor(dfrDataset$IsSpam)
table(dfrDataset$IsSpam)
##
## Not Spam Spam
## 2788 1813
#By now u should have checked for correlation and remove columns if required and checked for data imbalance
#Columns with more NA remove that too.
Observations
1. New variable has been created successfully according to status variable.
2. There are 2788 email records which are not spam while 1813 emails are spam.
Dataset Split
set.seed(707)
vctTrnRecs <- createDataPartition(y=dfrDataset$status, p=0.8, list=FALSE)
dfrTrnData <- dfrDataset[vctTrnRecs,]
dfrTstData <- dfrDataset[-vctTrnRecs,]
Observations
1. Data set is divided between training & test datasets.
2. 80% data is in training data to create better model while 20% data is in test data to test the model.
Training Dataset RowCount & ColCount
dim(dfrTrnData)
## [1] 3681 59
Observations
1. There are 3681 data records in the training data set.
Testing Dataset RowCount & ColCount
dim(dfrTstData)
## [1] 920 59
Observations
1. There are 920 data records in the test data set.
Training Dataset Head
head(dfrTrnData)
## word_freq_make word_freq_address word_freq_all word_freq_3d
## 2 0.21 0.28 0.50 0
## 3 0.06 0.00 0.71 0
## 4 0.00 0.00 0.00 0
## 10 0.06 0.12 0.77 0
## 11 0.00 0.00 0.00 0
## 12 0.00 0.00 0.25 0
## word_freq_our word_freq_over word_freq_remove word_freq_internet
## 2 0.14 0.28 0.21 0.07
## 3 1.23 0.19 0.19 0.12
## 4 0.63 0.00 0.31 0.63
## 10 0.19 0.32 0.38 0.00
## 11 0.00 0.00 0.96 0.00
## 12 0.38 0.25 0.25 0.00
## word_freq_order word_freq_mail word_freq_receive word_freq_will
## 2 0.00 0.94 0.21 0.79
## 3 0.64 0.25 0.38 0.45
## 4 0.31 0.63 0.31 0.31
## 10 0.06 0.00 0.00 0.64
## 11 0.00 1.92 0.96 0.00
## 12 0.00 0.00 0.12 0.12
## word_freq_people word_freq_report word_freq_addresses word_freq_free
## 2 0.65 0.21 0.14 0.14
## 3 0.12 0.00 1.75 0.06
## 4 0.31 0.00 0.00 0.31
## 10 0.25 0.00 0.12 0.00
## 11 0.00 0.00 0.00 0.00
## 12 0.12 0.00 0.00 0.00
## word_freq_business word_freq_email word_freq_you word_freq_credit
## 2 0.07 0.28 3.47 0.00
## 3 0.06 1.03 1.36 0.32
## 4 0.00 0.00 3.18 0.00
## 10 0.00 0.12 1.67 0.06
## 11 0.00 0.96 3.84 0.00
## 12 0.00 0.00 1.16 0.00
## word_freq_your word_freq_font word_freq_000 word_freq_money
## 2 1.59 0 0.43 0.43
## 3 0.51 0 1.16 0.06
## 4 0.31 0 0.00 0.00
## 10 0.71 0 0.19 0.00
## 11 0.96 0 0.00 0.00
## 12 0.77 0 0.00 0.00
## word_freq_hp word_freq_hpl word_freq_george word_freq_650 word_freq_lab
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 10 0 0 0 0 0
## 11 0 0 0 0 0
## 12 0 0 0 0 0
## word_freq_labs word_freq_telnet word_freq_857 word_freq_data
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 10 0 0 0 0
## 11 0 0 0 0
## 12 0 0 0 0
## word_freq_415 word_freq_85 word_freq_technology word_freq_1999
## 2 0 0 0 0.07
## 3 0 0 0 0.00
## 4 0 0 0 0.00
## 10 0 0 0 0.00
## 11 0 0 0 0.00
## 12 0 0 0 0.00
## word_freq_parts word_freq_pm word_freq_direct word_freq_cs
## 2 0 0 0.00 0
## 3 0 0 0.06 0
## 4 0 0 0.00 0
## 10 0 0 0.00 0
## 11 0 0 0.96 0
## 12 0 0 0.00 0
## word_freq_meeting word_freq_original word_freq_project word_freq_re
## 2 0 0.00 0.00 0.00
## 3 0 0.12 0.00 0.06
## 4 0 0.00 0.00 0.00
## 10 0 0.00 0.06 0.00
## 11 0 0.00 0.00 0.00
## 12 0 0.00 0.00 0.00
## word_freq_edu word_freq_table word_freq_conference char_freq_.
## 2 0.00 0 0 0.000
## 3 0.06 0 0 0.010
## 4 0.00 0 0 0.000
## 10 0.00 0 0 0.040
## 11 0.00 0 0 0.000
## 12 0.00 0 0 0.022
## char_freq_..1 char_freq_..2 char_freq_..3 char_freq_..4 char_freq_..5
## 2 0.132 0 0.372 0.180 0.048
## 3 0.143 0 0.276 0.184 0.010
## 4 0.137 0 0.137 0.000 0.000
## 10 0.030 0 0.244 0.081 0.000
## 11 0.000 0 0.462 0.000 0.000
## 12 0.044 0 0.663 0.000 0.000
## capital_run_length_average capital_run_length_longest
## 2 5.114 101
## 3 9.821 485
## 4 3.537 40
## 10 1.729 43
## 11 1.312 6
## 12 1.243 11
## capital_run_length_total status IsSpam
## 2 1028 1 Spam
## 3 2259 1 Spam
## 4 191 1 Spam
## 10 749 1 Spam
## 11 21 1 Spam
## 12 184 1 Spam
Testing Dataset Head
head(dfrTstData)
## word_freq_make word_freq_address word_freq_all word_freq_3d
## 1 0.00 0.64 0.64 0
## 5 0.00 0.00 0.00 0
## 6 0.00 0.00 0.00 0
## 7 0.00 0.00 0.00 0
## 8 0.00 0.00 0.00 0
## 9 0.15 0.00 0.46 0
## word_freq_our word_freq_over word_freq_remove word_freq_internet
## 1 0.32 0 0.00 0.00
## 5 0.63 0 0.31 0.63
## 6 1.85 0 0.00 1.85
## 7 1.92 0 0.00 0.00
## 8 1.88 0 0.00 1.88
## 9 0.61 0 0.30 0.00
## word_freq_order word_freq_mail word_freq_receive word_freq_will
## 1 0.00 0.00 0.00 0.64
## 5 0.31 0.63 0.31 0.31
## 6 0.00 0.00 0.00 0.00
## 7 0.00 0.64 0.96 1.28
## 8 0.00 0.00 0.00 0.00
## 9 0.92 0.76 0.76 0.92
## word_freq_people word_freq_report word_freq_addresses word_freq_free
## 1 0.00 0 0 0.32
## 5 0.31 0 0 0.31
## 6 0.00 0 0 0.00
## 7 0.00 0 0 0.96
## 8 0.00 0 0 0.00
## 9 0.00 0 0 0.00
## word_freq_business word_freq_email word_freq_you word_freq_credit
## 1 0 1.29 1.93 0.00
## 5 0 0.00 3.18 0.00
## 6 0 0.00 0.00 0.00
## 7 0 0.32 3.85 0.00
## 8 0 0.00 0.00 0.00
## 9 0 0.15 1.23 3.53
## word_freq_your word_freq_font word_freq_000 word_freq_money word_freq_hp
## 1 0.96 0 0 0.00 0
## 5 0.31 0 0 0.00 0
## 6 0.00 0 0 0.00 0
## 7 0.64 0 0 0.00 0
## 8 0.00 0 0 0.00 0
## 9 2.00 0 0 0.15 0
## word_freq_hpl word_freq_george word_freq_650 word_freq_lab
## 1 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## 7 0 0 0 0
## 8 0 0 0 0
## 9 0 0 0 0
## word_freq_labs word_freq_telnet word_freq_857 word_freq_data
## 1 0 0 0 0.00
## 5 0 0 0 0.00
## 6 0 0 0 0.00
## 7 0 0 0 0.00
## 8 0 0 0 0.00
## 9 0 0 0 0.15
## word_freq_415 word_freq_85 word_freq_technology word_freq_1999
## 1 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## 7 0 0 0 0
## 8 0 0 0 0
## 9 0 0 0 0
## word_freq_parts word_freq_pm word_freq_direct word_freq_cs
## 1 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## 7 0 0 0 0
## 8 0 0 0 0
## 9 0 0 0 0
## word_freq_meeting word_freq_original word_freq_project word_freq_re
## 1 0 0.0 0 0
## 5 0 0.0 0 0
## 6 0 0.0 0 0
## 7 0 0.0 0 0
## 8 0 0.0 0 0
## 9 0 0.3 0 0
## word_freq_edu word_freq_table word_freq_conference char_freq_.
## 1 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## 7 0 0 0 0
## 8 0 0 0 0
## 9 0 0 0 0
## char_freq_..1 char_freq_..2 char_freq_..3 char_freq_..4 char_freq_..5
## 1 0.000 0 0.778 0.000 0.000
## 5 0.135 0 0.135 0.000 0.000
## 6 0.223 0 0.000 0.000 0.000
## 7 0.054 0 0.164 0.054 0.000
## 8 0.206 0 0.000 0.000 0.000
## 9 0.271 0 0.181 0.203 0.022
## capital_run_length_average capital_run_length_longest
## 1 3.756 61
## 5 3.537 40
## 6 3.000 15
## 7 1.671 4
## 8 2.450 11
## 9 9.744 445
## capital_run_length_total status IsSpam
## 1 278 1 Spam
## 5 191 1 Spam
## 6 54 1 Spam
## 7 112 1 Spam
## 8 49 1 Spam
## 9 1257 1 Spam
Training Dataset Summary
lapply(dfrTrnData, FUN=summary)
## $word_freq_make
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09885 0.00000 4.54000
##
## $word_freq_address
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2027 0.0000 14.2800
##
## $word_freq_all
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.276 0.400 5.100
##
## $word_freq_3d
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.07702 0.00000 42.81000
##
## $word_freq_our
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3066 0.3700 10.0000
##
## $word_freq_over
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09361 0.00000 3.57000
##
## $word_freq_remove
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1128 0.0000 7.2700
##
## $word_freq_internet
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1111 0.0000 11.1100
##
## $word_freq_order
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09092 0.00000 5.26000
##
## $word_freq_mail
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2315 0.1400 18.1800
##
## $word_freq_receive
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.06053 0.00000 2.61000
##
## $word_freq_will
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.090 0.533 0.780 7.690
##
## $word_freq_people
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09264 0.00000 5.55000
##
## $word_freq_report
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05711 0.00000 10.00000
##
## $word_freq_addresses
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0495 0.0000 4.4100
##
## $word_freq_free
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2504 0.1000 20.0000
##
## $word_freq_business
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 0.14 0.00 7.14
##
## $word_freq_email
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1867 0.0000 9.0900
##
## $word_freq_you
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.310 1.681 2.670 18.750
##
## $word_freq_credit
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.08879 0.00000 18.18000
##
## $word_freq_your
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.2000 0.8052 1.2800 11.1100
##
## $word_freq_font
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1237 0.0000 17.1000
##
## $word_freq_000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1015 0.0000 5.4500
##
## $word_freq_money
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09806 0.00000 12.50000
##
## $word_freq_hp
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.5447 0.0000 20.8300
##
## $word_freq_hpl
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2772 0.0000 16.6600
##
## $word_freq_george
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.7443 0.0000 33.3300
##
## $word_freq_650
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1324 0.0000 9.0900
##
## $word_freq_lab
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09591 0.00000 14.28000
##
## $word_freq_labs
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1049 0.0000 5.8800
##
## $word_freq_telnet
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.06735 0.00000 12.50000
##
## $word_freq_857
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04803 0.00000 4.76000
##
## $word_freq_data
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1003 0.0000 18.1800
##
## $word_freq_415
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04888 0.00000 4.76000
##
## $word_freq_85
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.111 0.000 20.000
##
## $word_freq_technology
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1027 0.0000 7.6900
##
## $word_freq_1999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1399 0.0000 6.8900
##
## $word_freq_parts
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.01477 0.00000 8.33000
##
## $word_freq_pm
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0794 0.0000 9.7500
##
## $word_freq_direct
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.06296 0.00000 4.76000
##
## $word_freq_cs
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04437 0.00000 7.14000
##
## $word_freq_meeting
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1302 0.0000 14.2800
##
## $word_freq_original
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04856 0.00000 3.57000
##
## $word_freq_project
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.08274 0.00000 20.00000
##
## $word_freq_re
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3028 0.1200 21.4200
##
## $word_freq_edu
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1815 0.0000 22.0500
##
## $word_freq_table
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.000000 0.000000 0.005186 0.000000 2.170000
##
## $word_freq_conference
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.02867 0.00000 10.00000
##
## $char_freq_.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0403 0.0000 4.3850
##
## $char_freq_..1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0650 0.1398 0.1890 9.7520
##
## $char_freq_..2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.01525 0.00000 4.08100
##
## $char_freq_..3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2693 0.3110 32.4780
##
## $char_freq_..4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.07428 0.05000 5.30000
##
## $char_freq_..5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.03956 0.00000 13.12900
##
## $capital_run_length_average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.576 2.250 5.180 3.697 1102.500
##
## $capital_run_length_longest
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 6.00 14.00 52.23 43.00 9989.00
##
## $capital_run_length_total
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 35.0 93.0 282.7 266.0 15841.0
##
## $status
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3953 1.0000 1.0000
##
## $IsSpam
## Not Spam Spam
## 2226 1455
Testing Dataset Summary
lapply(dfrTstData, FUN=summary)
## $word_freq_make
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1274 0.0000 4.0000
##
## $word_freq_address
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2542 0.0000 14.2800
##
## $word_freq_all
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2994 0.5000 4.5400
##
## $word_freq_3d
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.01904 0.00000 7.18000
##
## $word_freq_our
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3349 0.4300 7.1400
##
## $word_freq_over
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1051 0.0000 5.8800
##
## $word_freq_remove
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1199 0.0000 7.2700
##
## $word_freq_internet
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.08196 0.00000 4.00000
##
## $word_freq_order
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.08664 0.00000 3.33000
##
## $word_freq_mail
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2709 0.2200 11.1100
##
## $word_freq_receive
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05701 0.00000 2.00000
##
## $word_freq_will
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.1650 0.5763 0.8625 9.6700
##
## $word_freq_people
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09909 0.00000 2.94000
##
## $word_freq_report
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.06469 0.00000 5.55000
##
## $word_freq_addresses
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04803 0.00000 2.31000
##
## $word_freq_free
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2427 0.1100 20.0000
##
## $word_freq_business
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1529 0.0000 4.8700
##
## $word_freq_email
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.177 0.000 4.160
##
## $word_freq_you
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.310 1.585 2.500 14.000
##
## $word_freq_credit
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.07271 0.00000 6.25000
##
## $word_freq_your
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.270 0.828 1.260 8.000
##
## $word_freq_font
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1112 0.0000 17.1000
##
## $word_freq_000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1022 0.0000 3.3800
##
## $word_freq_money
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.07909 0.00000 4.41000
##
## $word_freq_hp
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.5688 0.0000 20.0000
##
## $word_freq_hpl
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2181 0.0000 7.6900
##
## $word_freq_george
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.8594 0.0000 33.3300
##
## $word_freq_650
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09478 0.00000 4.76000
##
## $word_freq_lab
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1109 0.0000 10.0000
##
## $word_freq_labs
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09458 0.00000 4.76000
##
## $word_freq_telnet
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05438 0.00000 4.76000
##
## $word_freq_857
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04312 0.00000 4.76000
##
## $word_freq_data
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.08496 0.00000 8.33000
##
## $word_freq_415
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04366 0.00000 4.76000
##
## $word_freq_85
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.08305 0.00000 4.76000
##
## $word_freq_technology
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.07655 0.00000 4.76000
##
## $word_freq_1999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.125 0.000 3.700
##
## $word_freq_parts
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.000000 0.000000 0.006913 0.000000 1.560000
##
## $word_freq_pm
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.07554 0.00000 11.11000
##
## $word_freq_direct
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.07235 0.00000 4.76000
##
## $word_freq_cs
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04086 0.00000 4.75000
##
## $word_freq_meeting
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1408 0.0000 7.6900
##
## $word_freq_original
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.03627 0.00000 1.69000
##
## $word_freq_project
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.06501 0.00000 4.54000
##
## $word_freq_re
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.295 0.070 16.660
##
## $word_freq_edu
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1731 0.0000 9.0900
##
## $word_freq_table
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.000000 0.000000 0.006478 0.000000 2.120000
##
## $word_freq_conference
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04469 0.00000 8.33000
##
## $char_freq_.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.03168 0.00000 3.67200
##
## $char_freq_..1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0675 0.1360 0.1782 4.2710
##
## $char_freq_..2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0239 0.0000 2.7770
##
## $char_freq_..3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2681 0.3352 5.8280
##
## $char_freq_..4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.08193 0.06125 6.00300
##
## $char_freq_..5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.06296 0.00000 19.82900
##
## $capital_run_length_average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.625 2.333 5.236 3.796 443.666
##
## $capital_run_length_longest
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 16.00 51.94 44.00 1325.00
##
## $capital_run_length_total
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 37.0 101.5 285.5 259.2 9088.0
##
## $status
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3891 1.0000 1.0000
##
## $IsSpam
## Not Spam Spam
## 562 358
myCtrl <- trainControl(method=“cv”, number=10, repeats=3)
Increase in the iterations increases accruacy and thus takes time to make the model. for train we require caret package, for RF we require RF package.
m1 <- train(predictor~., data=dataFrame, method=“rf”,
verbose=F, trControl=myCtrl)
cvCtrl <- trainControl(method=“repeatedcv”, number=10, repeats=3) m2 <- train(predictor~., data=dataFrame, method=“gmb”, verbose=F, trControl=cvCtrl)
myCtrl <- trainControl(method=“oob”, number=10, repeats=3) m3 <- train(predictor~., data=dataFrame, tuneGrid=data.frame(mtry=10), trControl=myCtrl)
We could also try one of these if required
myCtrl <- trainControl(method=“cv”, number=10, repeats=3) o1 <- train(predictor~., data=dataFrame, method=“svm”, verbose=F, trControl=cvCtrl)
myCtrl <- trainControl(method=“cv”, number=10, repeats=3) o2 <- train(predictor~., data=dataFrame, method=“knn”, verbose=F, trControl=cvCtrl)
myCtrl <- trainControl(method=“cv”, number=10, repeats=3) o3 <- train(predictor~., data=dataFrame, method=“bag”, verbose=F, trControl=cvCtrl)
Create Model - Random Forest (Default)
## set seed
set.seed(707)
# mtry
myMtry=sqrt(ncol(dfrTrnData)-2) #total number of predictor columns. hence u minus 2 coz u not taking taste and quality
myNtrees=500
# start time
vctProcStrt <- proc.time()
#Proc.time gives statistics of the process.
# random forest (default)
#mdlRndForDef <- randomForest(IsSpam~.-status, data=dfrTrnData)
mdlRndForDef <- randomForest(IsSpam~.-status, data=dfrTrnData,
mtry=myMtry, ntree=myNtrees)
# end time
vctProcEnds <- proc.time()
# print
print(paste("Model Created ...",vctProcEnds[1] - vctProcStrt[1]))
## [1] "Model Created ... 17.37"
Observations
1. 15.87 seconds was taken to create the model by Default method.
View Model - Default Random Forest
mdlRndForDef
##
## Call:
## randomForest(formula = IsSpam ~ . - status, data = dfrTrnData, mtry = myMtry, ntree = myNtrees)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 8
##
## OOB estimate of error rate: 4.54%
## Confusion matrix:
## Not Spam Spam class.error
## Not Spam 2168 58 0.02605571
## Spam 109 1346 0.07491409
#Anything above 70% is good.
Observations
1. Error rate is only 4.54% which means accuracy is around 95.46% wich is very high accuracy.
View Model Summary - Default Random Forest
summary(mdlRndForDef)
## Length Class Mode
## call 5 -none- call
## type 1 -none- character
## predicted 3681 factor numeric
## err.rate 1500 -none- numeric
## confusion 6 -none- numeric
## votes 7362 matrix numeric
## oob.times 3681 -none- numeric
## classes 2 -none- character
## importance 57 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 3681 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
#mdlRndForDef[3]
#It gives you the lengthof the vector.
#It prints the attributes of the predictor model.
#If you want to know the values in the attributes.
# Type mdlRndForDef$ntree[1] in the console
Prediction - Test Data - Random Forest (Default)
vctRndForDef <- predict(mdlRndForDef, newdata=dfrTstData)
cmxRndForDef <- confusionMatrix(vctRndForDef, dfrTstData$IsSpam)
cmxRndForDef
## Confusion Matrix and Statistics
##
## Reference
## Prediction Not Spam Spam
## Not Spam 534 21
## Spam 28 337
##
## Accuracy : 0.9467
## 95% CI : (0.9302, 0.9603)
## No Information Rate : 0.6109
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8884
## Mcnemar's Test P-Value : 0.3914
##
## Sensitivity : 0.9502
## Specificity : 0.9413
## Pos Pred Value : 0.9622
## Neg Pred Value : 0.9233
## Prevalence : 0.6109
## Detection Rate : 0.5804
## Detection Prevalence : 0.6033
## Balanced Accuracy : 0.9458
##
## 'Positive' Class : Not Spam
##
#Look for accruacy an 95% CI. The highest accruacy is 75.
#p value < 0.05. hence your model is good.
Observations
1. P value is less than 0.05 which is rejecting the NULL Hypothesis at 95% confidence interval & showing great dependency between dependent & predictor variables.
2. Accuracy of test data is also very highwhich is around 95%, near to train data so model is good.
Create Model - Random Forest (RFM)
## set seed
set.seed(707)
# start time
vctProcStrt <- proc.time()
# random forest (default)
myControl <- trainControl(method="cv", number=10, repeats=3) #change these params to increase accruacy. insted of cv u can write repeated cv, number can be more than 10 and repeats more than 3
myMetric <- "Accuracy" #Consider this as syntax
myMtry <- sqrt(ncol(dfrTrnData)-2) #can change this also.
myNtrees <- 500 #to increase the accruacy increase this. Change only on thing at a time
myTuneGrid <- expand.grid(.mtry=myMtry) #How to configure mtry in train method we use this. consider this as syntax.
mdlRndForRfm <- train(IsSpam~.-status, data=dfrTrnData, method="rf",
verbose=F, metric=myMetric, trControl=myControl,
tuneGrid=myTuneGrid, ntree=myNtrees)
# end time
vctProcEnds <- proc.time()
# print
print(paste("Model Created ...",vctProcEnds[1] - vctProcStrt[1]))
## [1] "Model Created ... 174.72"
Observations
1. 56.97 seconds was taken to create the model by Random Forest method.
View Model - Random Forest (RFM)
mdlRndForRfm
## Random Forest
##
## 3681 samples
## 58 predictor
## 2 classes: 'Not Spam', 'Spam'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 3312, 3314, 3312, 3313, 3313, 3313, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9521966 0.8995836
##
## Tuning parameter 'mtry' was held constant at a value of 7.549834
Observations
1. Accuracy is around 95.22% wich is very high accuracy but less than default method.
View Model Summary - Random Forest (RFM)
summary(mdlRndForRfm)
## Length Class Mode
## call 6 -none- call
## type 1 -none- character
## predicted 3681 factor numeric
## err.rate 1500 -none- numeric
## confusion 6 -none- numeric
## votes 7362 matrix numeric
## oob.times 3681 -none- numeric
## classes 2 -none- character
## importance 57 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 3681 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## xNames 57 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 2 -none- character
## param 2 -none- list
Prediction - Test Data - Random Forest (RFM)
vctRndForRfm <- predict(mdlRndForRfm, newdata=dfrTstData)
cmxRndForRfm <- confusionMatrix(vctRndForRfm, dfrTstData$IsSpam)
cmxRndForRfm
## Confusion Matrix and Statistics
##
## Reference
## Prediction Not Spam Spam
## Not Spam 534 22
## Spam 28 336
##
## Accuracy : 0.9457
## 95% CI : (0.929, 0.9594)
## No Information Rate : 0.6109
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.886
## Mcnemar's Test P-Value : 0.4795
##
## Sensitivity : 0.9502
## Specificity : 0.9385
## Pos Pred Value : 0.9604
## Neg Pred Value : 0.9231
## Prevalence : 0.6109
## Detection Rate : 0.5804
## Detection Prevalence : 0.6043
## Balanced Accuracy : 0.9444
##
## 'Positive' Class : Not Spam
##
Observations
1. P value is less than 0.05 which is rejecting the NULL Hypothesis at 95% confidence interval & showing great dependency between dependent & predictor variables.
2. Accuracy of test data is also very highwhich is around 94.57% near to train data so model is good but it is less accuracy than the default method.
Create Model - Random Forest (GBM)- General bagging method. Creating groups of decision trees.GBM is generic. RF model is specific
## set seed
set.seed(707)
# start time
vctProcStrt <- proc.time()
# random forest (default)
myControl <- trainControl(method="repeatedcv", number=10, repeats=3)
myMetric <- "Accuracy"
myMtry <- sqrt(ncol(dfrTrnData)-2)
myNtrees <- 500
myTuneGrid <- expand.grid(.mtry=myMtry)
mdlRndForGbm <- train(IsSpam~.-status, data=dfrTrnData, method="gbm",
verbose=F, metric=myMetric, trControl=myControl)
#ntree=myNtrees)
# tuneGrid=myTuneGrid)
# end time
vctProcEnds <- proc.time()
# print
print(paste("Model Created ...",vctProcEnds[1] - vctProcStrt[1]))
## [1] "Model Created ... 154.41"
Observations
1. 155.7 seconds was taken to create the model by Random Forest method which is much higher than default method.
View Model - Random Forest (GBM)
mdlRndForGbm
## Stochastic Gradient Boosting
##
## 3681 samples
## 58 predictor
## 2 classes: 'Not Spam', 'Spam'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 3312, 3314, 3312, 3313, 3313, 3313, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.9125264 0.8130164
## 1 100 0.9307222 0.8532164
## 1 150 0.9367051 0.8663083
## 2 50 0.9312693 0.8546294
## 2 100 0.9405110 0.8744321
## 2 150 0.9432274 0.8803238
## 3 50 0.9363428 0.8654668
## 3 100 0.9449460 0.8839671
## 3 150 0.9471209 0.8886790
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
Observations
1. Accuracy is around 94% wich is very high accuracy but less than default method.
View Model Summary - Random Forest (GBM)
summary(mdlRndForGbm)
## var rel.inf
## char_freq_..3 char_freq_..3 21.71328125
## char_freq_..4 char_freq_..4 19.17273719
## word_freq_remove word_freq_remove 11.66807935
## word_freq_hp word_freq_hp 9.17395240
## word_freq_free word_freq_free 7.94587755
## capital_run_length_longest capital_run_length_longest 5.19101659
## capital_run_length_average capital_run_length_average 4.55363849
## word_freq_your word_freq_your 3.03793007
## word_freq_our word_freq_our 2.49322457
## word_freq_money word_freq_money 2.43499996
## word_freq_george word_freq_george 2.39065969
## word_freq_edu word_freq_edu 1.74627237
## word_freq_1999 word_freq_1999 1.28704265
## capital_run_length_total capital_run_length_total 1.13584861
## word_freq_meeting word_freq_meeting 0.61307150
## word_freq_you word_freq_you 0.59182671
## word_freq_000 word_freq_000 0.56908488
## word_freq_receive word_freq_receive 0.51898504
## word_freq_re word_freq_re 0.44280485
## word_freq_650 word_freq_650 0.40545253
## char_freq_. char_freq_. 0.40499187
## word_freq_business word_freq_business 0.35117130
## word_freq_hpl word_freq_hpl 0.33315065
## word_freq_will word_freq_will 0.31637881
## word_freq_internet word_freq_internet 0.29897659
## word_freq_email word_freq_email 0.23535307
## word_freq_technology word_freq_technology 0.18272811
## char_freq_..1 char_freq_..1 0.15921637
## word_freq_over word_freq_over 0.13902156
## word_freq_font word_freq_font 0.12361794
## word_freq_mail word_freq_mail 0.05960259
## word_freq_report word_freq_report 0.05752083
## word_freq_pm word_freq_pm 0.05072411
## word_freq_project word_freq_project 0.03704862
## word_freq_credit word_freq_credit 0.03502091
## word_freq_order word_freq_order 0.03055389
## word_freq_conference word_freq_conference 0.02965122
## word_freq_3d word_freq_3d 0.02071226
## word_freq_original word_freq_original 0.01781143
## word_freq_address word_freq_address 0.01701609
## word_freq_make word_freq_make 0.01394553
## word_freq_all word_freq_all 0.00000000
## word_freq_people word_freq_people 0.00000000
## word_freq_addresses word_freq_addresses 0.00000000
## word_freq_lab word_freq_lab 0.00000000
## word_freq_labs word_freq_labs 0.00000000
## word_freq_telnet word_freq_telnet 0.00000000
## word_freq_857 word_freq_857 0.00000000
## word_freq_data word_freq_data 0.00000000
## word_freq_415 word_freq_415 0.00000000
## word_freq_85 word_freq_85 0.00000000
## word_freq_parts word_freq_parts 0.00000000
## word_freq_direct word_freq_direct 0.00000000
## word_freq_cs word_freq_cs 0.00000000
## word_freq_table word_freq_table 0.00000000
## char_freq_..2 char_freq_..2 0.00000000
## char_freq_..5 char_freq_..5 0.00000000
Prediction - Test Data - Random Forest (GBM)
vctRndForGbm <- predict(mdlRndForGbm, newdata=dfrTstData)
cmxRndForGbm <- confusionMatrix(vctRndForGbm, dfrTstData$IsSpam)
cmxRndForGbm
## Confusion Matrix and Statistics
##
## Reference
## Prediction Not Spam Spam
## Not Spam 532 19
## Spam 30 339
##
## Accuracy : 0.9467
## 95% CI : (0.9302, 0.9603)
## No Information Rate : 0.6109
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8886
## Mcnemar's Test P-Value : 0.1531
##
## Sensitivity : 0.9466
## Specificity : 0.9469
## Pos Pred Value : 0.9655
## Neg Pred Value : 0.9187
## Prevalence : 0.6109
## Detection Rate : 0.5783
## Detection Prevalence : 0.5989
## Balanced Accuracy : 0.9468
##
## 'Positive' Class : Not Spam
##
Observations
1. P value is less than 0.05 which is rejecting the NULL Hypothesis at 95% confidence interval & showing great dependency between dependent & predictor variables.
2. Accuracy of test data is also very highwhich is around 94.67% near to train data so model is good
Create Model - Random Forest (OOB)
## set seed
set.seed(707)
# start time
vctProcStrt <- proc.time()
# random forest (default)
myControl <- trainControl(method="oob", number=10, repeats=3)
myMetric <- "Accuracy"
myMtry <- 10 #sqrt(ncol(dfrTrnData)-2)
myNtrees <- 500
myTuneGrid <- expand.grid(.mtry=myMtry)
mdlRndForOob <- train(IsSpam~.-status, data=dfrTrnData,
verbose=F, metric=myMetric, trControl=myControl,
tuneGrid=myTuneGrid, ntree=myNtrees)
# end time
vctProcEnds <- proc.time()
# print
print(paste("Model Created ...",vctProcEnds[1] - vctProcStrt[1]))
## [1] "Model Created ... 33.93"
Observations
1. 35.24 seconds was taken to create the model by Random Forest method which is much higher than default method.
View Model - Random Formbest (OOB)
mdlRndForOob
## Random Forest
##
## 3681 samples
## 58 predictor
## 2 classes: 'Not Spam', 'Spam'
##
## No pre-processing
## Resampling results:
##
## Accuracy Kappa
## 0.9527302 0.9007209
##
## Tuning parameter 'mtry' was held constant at a value of 10
Observations
1. Accuracy is around 95.27% wich is very high accuracy but less than default method.
View Model Summary - Random Forest (OOB)
summary(mdlRndForOob)
## Length Class Mode
## call 6 -none- call
## type 1 -none- character
## predicted 3681 factor numeric
## err.rate 1500 -none- numeric
## confusion 6 -none- numeric
## votes 7362 matrix numeric
## oob.times 3681 -none- numeric
## classes 2 -none- character
## importance 57 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 3681 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## xNames 57 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 2 -none- character
## param 2 -none- list
Prediction - Test Data - Random Forest (OOB)
vctRndForOob <- predict(mdlRndForOob, newdata=dfrTstData)
cmxRndForOob <- confusionMatrix(vctRndForOob, dfrTstData$IsSpam)
cmxRndForOob
## Confusion Matrix and Statistics
##
## Reference
## Prediction Not Spam Spam
## Not Spam 533 22
## Spam 29 336
##
## Accuracy : 0.9446
## 95% CI : (0.9278, 0.9585)
## No Information Rate : 0.6109
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8838
## Mcnemar's Test P-Value : 0.4008
##
## Sensitivity : 0.9484
## Specificity : 0.9385
## Pos Pred Value : 0.9604
## Neg Pred Value : 0.9205
## Prevalence : 0.6109
## Detection Rate : 0.5793
## Detection Prevalence : 0.6033
## Balanced Accuracy : 0.9435
##
## 'Positive' Class : Not Spam
##
Observations
1. P value is less than 0.05 which is rejecting the NULL Hypothesis at 95% confidence interval & showing great dependency between dependent & predictor variables.
2. Accuracy of test data is also very highwhich is around 94.46% near to train data so model is good but it is less accuracy than the default method.
Reason Why Outlier Handling is not required.
- Total No of data records are 4601, after removing outliers of each column only 177 data records are remaining which is not good as full bunch of records are not part of analysis.
- If Outliers is replaced with Mean value then again same no of outliers remains in the database as for most of the columns Median value is Zero.
- If Outliers are replaced with Median then many columns will not have any specific values other than Zeroes as Median is zero for most of the columns & after replacing with median we will not be able to find out Correlation between variables.
- Random Forest is insensitive to the Outliers.
- After creating the models with different methods we find out that Data accuracy is freat & Outliers are not creating any problem in pridction.
Default Method
Time Taken 15.87 Sec
Train Data Accuracy 95.46%
P-value <0.05
Test Data Accuracy 94.67%
Random Forest
Time Taken 56.97 Sec
Train Data Accuracy 95.22%
P-value <0.05
Test Data Accuracy 94.57%
GBM
Time Taken 155.7 Sec
Train Data Accuracy 94%
P-value <0.05
Test Data Accuracy 94.67%
OOB
Time Taken 35.24 Sec
Train Data Accuracy 95.27%
P-value <0.05
Test Data Accuracy 94.46%