Start with a general question
Make it concrete
The data set may depend on your goal or a Question
Spam E-mail Database which we can get from the Kern Lab package in R.http://search.r-project.org/library/kernlab/html/spam.html
# If it isn't installed, install the kernlab package with install.packages()
install.packages("kernlab")
library(kernlab)
data(spam)
str(spam[,1:5])
## 'data.frame': 4601 obs. of 5 variables:
## $ make : num 0 0.21 0.06 0 0 0 0 0 0.15 0.06 ...
## $ address: num 0.64 0.28 0 0 0 0 0 0 0 0.12 ...
## $ all : num 0.64 0.5 0.71 0 0 0 0 0 0.46 0.77 ...
## $ num3d : num 0 0 0 0 0 0 0 0 0 0 ...
## $ our : num 0.32 0.14 1.23 0.63 0.63 1.85 1.92 1.88 0.61 0.19 ...
Once we’ve cleaned the data and we’ve gotten a basic look at it, it’s important to determine of the data are good enough to solve your problems because in some cases they may not be good enough.
If you determined the data are not good enough for your question, then
It’s important to not to just push on with the data you have just because that’s all you’ve got, because that can lead inappropriate inferences or conclusions.
http://search.r-project.org/library/kernlab/html/spam.html
library(kernlab)
data(spam)
str(spam[,1:5])
## 'data.frame': 4601 obs. of 5 variables:
## $ make : num 0 0.21 0.06 0 0 0 0 0 0.15 0.06 ...
## $ address: num 0.64 0.28 0 0 0 0 0 0 0 0.12 ...
## $ all : num 0.64 0.5 0.71 0 0 0 0 0 0.46 0.77 ...
## $ num3d : num 0 0 0 0 0 0 0 0 0 0 ...
## $ our : num 0.32 0.14 1.23 0.63 0.63 1.85 1.92 1.88 0.61 0.19 ...
dim(spam)
## [1] 4601 58
# Perform the subsampling
set.seed(3435)
trainIndicator = rbinom(n = 4601,size = 1, prob = 0.5)
table(trainIndicator)
## trainIndicator
## 0 1
## 2314 2287
So we can see that 2314 are going to be one half and 2287 will be in the other half.
And so the training set will be one set and the test set will be another set of data.
trainSpam <- spam[trainIndicator == 1, ]
testSpam <- spam[trainIndicator == 0, ]
# Checking first 5 Variables
str(trainSpam[,1:5])
## 'data.frame': 2287 obs. of 5 variables:
## $ make : num 0 0 0.15 0 0 0 0 0 0 0 ...
## $ address: num 0.64 0 0 0 0 0.42 0 0 0 0 ...
## $ all : num 0.64 0 0.46 0.25 0 0.42 0 0.55 0 0 ...
## $ num3d : num 0 0 0 0 0 0 0 0 0 0 ...
## $ our : num 0.32 1.92 0.61 0.38 0.9 1.27 0 1.11 2.94 1.16 ...
str(testSpam[,1:5])
## 'data.frame': 2314 obs. of 5 variables:
## $ make : num 0.21 0.06 0 0 0 0 0.06 0 0 0 ...
## $ address: num 0.28 0 0 0 0 0 0.12 0 0.69 0 ...
## $ all : num 0.5 0.71 0 0 0 0 0.77 0 0.34 1.42 ...
## $ num3d : num 0 0 0 0 0 0 0 0 0 0 ...
## $ our : num 0.14 1.23 0.63 0.63 1.85 1.88 0.19 0 0.34 0.71 ...
names(trainSpam)
## [1] "make" "address" "all"
## [4] "num3d" "our" "over"
## [7] "remove" "internet" "order"
## [10] "mail" "receive" "will"
## [13] "people" "report" "addresses"
## [16] "free" "business" "email"
## [19] "you" "credit" "your"
## [22] "font" "num000" "money"
## [25] "hp" "hpl" "george"
## [28] "num650" "lab" "labs"
## [31] "telnet" "num857" "data"
## [34] "num415" "num85" "technology"
## [37] "num1999" "parts" "pm"
## [40] "direct" "cs" "meeting"
## [43] "original" "project" "re"
## [46] "edu" "table" "conference"
## [49] "charSemicolon" "charRoundbracket" "charSquarebracket"
## [52] "charExclamation" "charDollar" "charHash"
## [55] "capitalAve" "capitalLong" "capitalTotal"
## [58] "type"
head(trainSpam)
## make address all num3d our over remove internet order mail receive will
## 1 0.00 0.64 0.64 0 0.32 0.00 0.00 0 0.00 0.00 0.00 0.64
## 7 0.00 0.00 0.00 0 1.92 0.00 0.00 0 0.00 0.64 0.96 1.28
## 9 0.15 0.00 0.46 0 0.61 0.00 0.30 0 0.92 0.76 0.76 0.92
## 12 0.00 0.00 0.25 0 0.38 0.25 0.25 0 0.00 0.00 0.12 0.12
## 14 0.00 0.00 0.00 0 0.90 0.00 0.90 0 0.00 0.90 0.90 0.00
## 16 0.00 0.42 0.42 0 1.27 0.00 0.42 0 0.00 1.27 0.00 0.00
## people report addresses free business email you credit your font num000
## 1 0.00 0 0 0.32 0 1.29 1.93 0.00 0.96 0 0
## 7 0.00 0 0 0.96 0 0.32 3.85 0.00 0.64 0 0
## 9 0.00 0 0 0.00 0 0.15 1.23 3.53 2.00 0 0
## 12 0.12 0 0 0.00 0 0.00 1.16 0.00 0.77 0 0
## 14 0.90 0 0 0.00 0 0.00 2.72 0.00 0.90 0 0
## 16 0.00 0 0 1.27 0 0.00 1.70 0.42 1.27 0 0
## money hp hpl george num650 lab labs telnet num857 data num415 num85
## 1 0.00 0 0 0 0 0 0 0 0 0.00 0 0
## 7 0.00 0 0 0 0 0 0 0 0 0.00 0 0
## 9 0.15 0 0 0 0 0 0 0 0 0.15 0 0
## 12 0.00 0 0 0 0 0 0 0 0 0.00 0 0
## 14 0.00 0 0 0 0 0 0 0 0 0.00 0 0
## 16 0.42 0 0 0 0 0 0 0 0 0.00 0 0
## technology num1999 parts pm direct cs meeting original project re edu table
## 1 0 0.00 0 0 0.00 0 0 0.0 0 0 0 0
## 7 0 0.00 0 0 0.00 0 0 0.0 0 0 0 0
## 9 0 0.00 0 0 0.00 0 0 0.3 0 0 0 0
## 12 0 0.00 0 0 0.00 0 0 0.0 0 0 0 0
## 14 0 0.00 0 0 0.00 0 0 0.0 0 0 0 0
## 16 0 1.27 0 0 0.42 0 0 0.0 0 0 0 0
## conference charSemicolon charRoundbracket charSquarebracket charExclamation
## 1 0 0.000 0.000 0 0.778
## 7 0 0.000 0.054 0 0.164
## 9 0 0.000 0.271 0 0.181
## 12 0 0.022 0.044 0 0.663
## 14 0 0.000 0.000 0 0.000
## 16 0 0.000 0.063 0 0.572
## charDollar charHash capitalAve capitalLong capitalTotal type
## 1 0.000 0.000 3.756 61 278 spam
## 7 0.054 0.000 1.671 4 112 spam
## 9 0.203 0.022 9.744 445 1257 spam
## 12 0.000 0.000 1.243 11 184 spam
## 14 0.000 0.000 2.083 7 25 spam
## 16 0.063 0.000 5.659 55 249 spam
table(trainSpam$type)
##
## nonspam spam
## 1381 906
plot(trainSpam$capitalAve ~ trainSpam$type)
plot(log10(trainSpam$capitalAve + 1) ~ trainSpam$type)
plot(log10(trainSpam[ ,1:4] +1))
We’ve got a pairs plot of a few of the variables
As this is the log transformation of each of the variables we can see that
hCluster <- hclust(dist(t(trainSpam[,1:55])))
plot(hCluster)
hCluster <- hclust(dist(t(log10(trainSpam[,1:55] + 1))))
plot(hCluster)
So here using the reformulate function to create a formula that includes
We’re just going to cycle through all the variables in this data set using this for-loop to build a logistic regression model.
Then subsequently calculate the cross validated error rate of predicting spam emails from a single variable.
Once we’ve done this, we’re going to try to figure out, which of the individual variables has the minimum cross validated error rate.
So we can take this vector of CV error, and just figure out which one is the minimum.
trainSpam$numType <- as.numeric(trainSpam$type) - 1
trainSpam$type[901:910]
## [1] spam spam spam spam spam spam nonspam nonspam nonspam
## [10] nonspam
## Levels: nonspam spam
trainSpam$numType[901:910]
## [1] 1 1 1 1 1 1 0 0 0 0
######################################################
costFunction <- function(x,y)
{
sum(x != (y > 0.5))
}
cvError <- rep(NA, 55)
#######################################################
library(boot)
for(i in 1:55)
{
lmFormula <- reformulate(names(trainSpam)[i], response = "numType")
glmFit <- glm(lmFormula, family = "binomial", data = trainSpam)
cvError[i] <- cv.glm(trainSpam, glmFit, costFunction, 2)$delta[2]
}
#######################################################
## Which predictor has minimum cross-validated error?
names(trainSpam)[which.min(cvError)]
## [1] "charDollar"
charDollar.### Use the best model from the group
predictionModel = glm(numType ~ charDollar, family = "binomial", data = trainSpam)
### Get predictions on the test set
predictionTest = predict(predictionModel, testSpam)
predictedSpam = rep("nonspam", dim(testSpam)[1])
## Classify as 'spam' for those with prob > 0.5
predictedSpam[predictionModel$fitted > 0.5] = "spam"
## Classification Table
table(predictedSpam,testSpam$type)
##
## predictedSpam nonspam spam
## nonspam 1346 458
## spam 61 449
### Error Rate
(61 + 458)/(1346 + 458 + 61 + 449)
## [1] 0.2242869
$$$$ !”R Markdown and knitr and R Studio to document your analyses as you do them.