packages = c(
"dplyr","ggplot2","caTools","tm","SnowballC","ROCR","rpart",
"rpart.plot","randomForest")
existing = as.character(installed.packages()[,1])
for(pkg in packages[!(packages %in% existing)]) install.packages(pkg)
rm(list=ls(all=TRUE))
Sys.setlocale("LC_ALL","C")
[1] "C"
options(digits=5, scipen=10)
library(dplyr)
library(tm)
library(SnowballC)
library(ROCR)
library(caTools)
library(rpart)
library(rpart.plot)
library(randomForest)
Problem 1 - Exploration
D = read.csv("data/emails.csv", stringsAsFactors = F)
1.1 How many emails are in the dataset?
nrow(D)
[1] 5728
1.2 How many of the emails are spam?
table(D$spam)
0 1
4360 1368
1.3 Which word appears at the beginning of every email in the dataset?
substr(D$text[1:5], 1, 60)
[1] "Subject: naturally irresistible your corporate identity lt "
[2] "Subject: the stock trading gunslinger fanny is merrill but "
[3] "Subject: unbelievable new homes made easy im wanting to sho"
[4] "Subject: 4 color printing special request additional inform"
[5] "Subject: do not have money , get software cds from here ! s"
1.4 Words in every document
【P1.4】Could a spam classifier potentially benefit from including the frequency of the word that appears in every email?
- Yes – the number of times the word appears might help us differentiate spam from ham
1.5 How many characters are in the longest email?
nchar(D$text) %>% max
[1] 43952
1.6 Which row contains the shortest email in the dataset?
nchar(D$text) %>% which.min
[1] 1992
Problem 2 - Preparing the Corpus
2.1 Corpus and DTM
corp = Corpus(VectorSource(D$text))
corp = tm_map(corp, content_transformer(tolower))
transformation drops documents
corp = tm_map(corp, removePunctuation)
transformation drops documents
corp = tm_map(corp, removeWords, stopwords("english"))
transformation drops documents
corp = tm_map(corp, stemDocument)
transformation drops documents
dtm = DocumentTermMatrix(corp)
【P2.1】How many terms are in dtm?
dtm
<<DocumentTermMatrix (documents: 5728, terms: 28687)>>
Non-/sparse entries: 481719/163837417
Sparsity : 100%
Maximal term length: 24
Weighting : term frequency (tf)
2.2 Remove less frequent words
Limit dtm to contain terms appearing in at least 5%
spdtm = removeSparseTerms(dtm, 0.95)
【P2.2】How many terms are in spdtm?
spdtm
<<DocumentTermMatrix (documents: 5728, terms: 330)>>
Non-/sparse entries: 213551/1676689
Sparsity : 89%
Maximal term length: 10
Weighting : term frequency (tf)
2.3 Build data frame
Build a data frame ems from spdtm
ems = as.data.frame(as.matrix(spdtm))
【P2.3】What is the most frequent word in spdtm?
colSums(ems) %>% sort %>% tail
hou will vinc subject ect enron
5577 8252 8532 10202 11427 13388
2.4 Most frequent words in HAM emalis
Incorporate target variable spam
ems$spam = D$spam
【P2.4】How many word stems appear at least 5000 times in the ham emails in the dataset?
subset(ems, spam==0) %>% colSums %>% sort %>% tail(10)
com pleas kaminski 2000 hou will vinc subject
4444 4494 4801 4935 5569 6802 8531 8625
ect enron
11417 13388
2.5 Most frequent words in SPAM emalis
【P2.5】How many word stems appear at least 1000 times in the spam emails in the dataset?
subset(ems, spam==1) %>% colSums %>% {.[. > 1000]}
compani subject will spam
1065 1577 1450 1368
2.6 Observation 1
【P2.6】The lists of most common words are significantly different between the spam and ham emails. What does this likely imply?
- The frequencies of these most common words are likely to help differentiate between spam and ham
2.7 Observation 2
【P2.7】Several of the most common word stems from the ham documents, such as “enron”, “hou” (short for Houston), “vinc” (the word stem of “Vince”) and “kaminski”, are likely specific to Vincent Kaminski’s inbox. What does this mean about the applicability of the text analytics models we will train for the spam filtering problem?
- The models we build are personalized, and would need to be further tested before being used as a spam filter for another person
Problem 3 - Building machine learning models
Split the data and build GLM, CART and random forest models …
ems$spam = factor(ems$spam)
names(ems) = make.names(names(ems))
set.seed(123); spl = sample.split(ems$spam, 0.7)
train = subset(ems, spl == TRUE)
test = subset(ems, spl == FALSE)
table(test$spam) %>% prop.table # 0.76135
0 1
0.76135 0.23865
m.glm = glm(spam ~ ., train, family = 'binomial')
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
m.cart = rpart(spam ~ ., train, method="class")
set.seed(123); m.rf = randomForest(spam ~ ., train)
3.1 Prediction of Logistic Model
【P3.1a】 How many of the training set predicted probabilities from spamLog are less than 0.00001?
【P3.1b】 How many of the training set predicted probabilities from spamLog are more than 0.99999?
【P3.1c】 How many of the training set predicted probabilities from spamLog are between 0.00001 and 0.99999?
3.2 Significant predictors in the GLM model
【P3.2】How many variables are labeled as significant (at the p=0.05 level) in the logistic regression summary output?
3.3 Words in the Decision Tree
【P3.3】How many of the word stems “enron”, “hou”, “vinc”, and “kaminski” appear in the CART tree?
Recall that we suspect these word stems are specific to Vincent Kaminski and might affect the generalizability of a spam filter built with his ham data.
3.4 What is the training accuracy of the GLM model?
3.5 What is the training AUC of the GLM model?
3.6 What is the training accuracy of the CART model?
3.7 What is the training accuracy of the CART model?
3.8 What is the training accuracy of the RF model?
3.9 What is the training accuracy of the RF model?
3.10 Which model had the best training set performance, in terms of accuracy & AUC?
Problem 4 - Evaluating on the Test Set
Obtain predicted probabilities for the testing set for each of the models,
4.1 ~ 4.6 ACC/AUC of the GLM/CART/RF models
see the table above
4.7 Which model demonstrated the greatest degree of overfitting??
