Source file ⇒ The_Analytics_Edge_edX_MIT15.071x_June2015_4.rmd
These are my notes for the lectures of the The_Analytics_Edge_edX_MIT15.0 71x_June2015“ by Professor Dimitris Bertsimas. The goal of these notes is to provide the reproducible R code for all the lectures.
A good list of resources about using R for Text Analytics are given below:
NOTE:I have gone head with some summary outputs which i could have restrained from doing, but the main intention was to see whether the function was performing as desired as there were some issues related to tm package version as used in the lecture and the latest version.
We will be trying to understand sentiment of tweets about the company Apple.
While Apple has a large number of fans, they also have a large number of people who don’t like their products. They also have several competitors.
To better understand public perception, Apple wants to monitor how people feel over time and how people receive new announcements.
Our challenge in this lecture is to see if we can correctly classify tweets as being negative,positive, or neither about Apple.
The Data
To collect the data needed for this task, we had to perform two steps.
The first was to collect data about tweets from the internet.
Twitter data is publicly available, and it can be collected it through scraping the website or via the Twitter API.
The sender of the tweet might be useful to predict sentiment, but we will ignore it to keep our data anonymized.
So we will just be using the text of the tweet.
Then we need to construct the outcome variable for these tweets, which means that we have to label them as positive, negative, or neutral sentiment.
We would like to label thousands of tweets, and we know that two people might disagree over the correct classification of a tweet. To do this efficiently, one option is to use the Amazon Mechanical Turk.
The task that we put on the Amazon Mechanical Turk was to judge the sentiment expressed by the following item toward the software company Apple.
The items we gave them were tweets that we had collected. The workers could pick from the following options as their response:
These outcomes were represented as a number on the scale from -2 to 2.
Each tweet was labeled by five workers. For each tweet, we take the average of the five scores given by the five workers, hence the final scores can range from -2 to 2 in increments of 0.2.
The following graph shows the distribution of the number of tweets classified into each of the categories. We can see here that the majority of tweets were classified as neutral, with a small number classified as strongly negative or strongly positive.
distribution of score
So now we have a bunch of tweets that are labeled with their sentiment. But how do we build independent variables from the text of a tweet to be used to predict the sentiment?
A Bag of Words
One of the most used techniques to transforms text into independent variables is that called Bag of Words.
Fully understanding text is difficult, but Bag of Words provides a very simple approach: it just counts the number of times each word appears in the text and uses these counts as the independent variables.
For example, in the sentence,
"This course is great. I would recommend this course to my friends,"
the word this is seen twice, the word course is seen twice, the word great is seen once, et cetera.
bag of words
In Bag of Words, there is one feature for each word. This is a very simple approach, but is often very effective, too. It is used as a baseline in text analytics projects and for Natural Language Processing.
This is not the whole story, though. Preprocessing the text can dramatically improve the performance of the Bag of Words method.
Cleaning Up Irregularities
One part of preprocessing the text is to clean up irregularities.
Text data often as many inconsistencies that will cause algorithms trouble. Computers are very literal by default.
One common irregularity concerns the case of the letters, and it is customary to change all words to either lower-case or upper-case.
Punctuation also causes problems, and the basic approach is to remove everything that is not a letter. However some punctuation is meaningful, and therefore the removal of punctuation should be tailored to the specific problem.
There are also unhelpful terms:
Stemming: This step is motivated by the desire to represent words with different endings as the same word. We probably do not need to draw a distinction between argue, argued, argues, and arguing. They could all be represented by a common stem, argu. The algorithmic process of performing this reduction is called stemming.
There are many ways to approach the problem.
This second approach is widely popular and is called the Porter Stemmer, designed by Martin Porter in 1980, and it’s still used today.
QUICK QUESTION
Which of these problems is the LEAST likely to be a good application of natural language processing?
Ans:Judging the winner of a poetry contest
EXPLANATION:Judging the winner of a poetry contest requires a deep level of human understanding and emotion. Perhaps someday a computer will be able to accurately judge the winner of a poetry contest, but currently the other three tasks are much better suited for natural language processing.
QUICK QUESTION
For each tweet, we computed an overall score by averaging all five scores assigned by the Amazon Mechanical Turk workers. However, Amazon Mechanical Turk workers might make significant mistakes when labeling a tweet. The mean could be highly affected by this.
Which of the three alternative metrics below would best capture the typical opinion of the five Amazon Mechanical Turk workers, would be less affected by mistakes, and is well-defined regardless of the five labels?
Ans:An overall score equal to the median (middle) score
EXPLANATION:The correct answer is the first one - the median would capture the typical opinion of the workers and tends to be less affected by significant mistakes. The majority score might not have given a score to all tweets because they might not all have a majority score (consider a tweet with scores 0, 0, 1, 1, and 2). The minimum score does not necessarily capture the typical opinion and could be highly affected by mistakes (consider a tweet with scores -2, 1, 1, 1, 1).
QUICK QUESTION
For each of the following questions, pick the preprocessing task that we discussed in the previous video that would change the sentence “Data is useful AND powerful!” to the new sentence listed in the question.
New sentence: Data useful powerful!
Ans:Removing stop words
New sentence: data is useful and powerful
Ans:Cleaning up irregularities (changing to lowercase and removing punctuation)
New sentence: Data is use AND power!
Ans:Stemming
EXPLANATION:The first new sentence has the stop words “is” and “and” removed. The second new sentence has the irregularities removed (no capital letters or punctuation). The third new sentence has the words stemmed - the “ful” is removed from “useful” and “powerful
Sys.setlocale("LC_ALL", "C")
## [1] "C"
# Unit 5 - Twitter
# VIDEO 5
#LOADING AND PROCESSING DATA IN R
tweets = read.csv("tweets.csv", stringsAsFactors=FALSE)
#Note: when working on a text analytics problem it is important (necessary!) to add the extra argument stringsAsFactors = FALSE, so that the text is read in properly.
#Let's take a look at the structure of our data:
str(tweets)
## 'data.frame': 1181 obs. of 2 variables:
## $ Tweet: chr "I have to say, Apple has by far the best customer care service I have ever received! @Apple @AppStore" "iOS 7 is so fricking smooth & beautiful!! #ThanxApple @Apple" "LOVE U @APPLE" "Thank you @apple, loving my new iPhone 5S!!!!! #apple #iphone5S pic.twitter.com/XmHJCU4pcb" ...
## $ Avg : num 2 2 1.8 1.8 1.8 1.8 1.8 1.6 1.6 1.6 ...
#We have 1181 observations of 2 variables:
##Tweet: the text of the tweet.
##Avg: the average sentiment score.
#The tweet texts are real tweets that gathered on the internet directed to Apple with a few cleaned up words.We are more interested in being able to detect the tweets with clear negative sentiment, so let's define a new variable in our data set called Negative.
#equal to TRUE if the average sentiment score is less than or equal to -1
#equal to FALSE if the average sentiment score is greater than -1.
# Create dependent variable
tweets$Negative = as.factor(tweets$Avg <= -1)
table(tweets$Negative)
##
## FALSE TRUE
## 999 182
#Now to pre process our text data so that we could we could use the 'Bag of words' approach , we will be using the'tm'-- text mining package
#install.packages("tm")
library(tm)
#install.packages("SnowballC")
library(SnowballC)
#One of the concepts introduced by tm package is that of a corpus.A corpus is a collection of documents.We need to convert our tweets into corpus for pre processing.
#Various function in the tm package can be used to create a corpus in many different ways.We will create it from the tweet column of our data frame using two functions, Corpus() and VectorSource(). We feed to this latter the Tweets variable of the tweets data frame.
# Create corpus
corpus = Corpus(VectorSource(tweets$Tweet))
# Look at corpus
corpus
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 1181
#We can check that the documents match our tweets by using double brackets [[.
#To inspect the first (or 10th) tweet in our corpus, we select the first (or 10th) element as:
attributes(corpus[[1]])
## $names
## [1] "content" "meta"
##
## $class
## [1] "PlainTextDocument" "TextDocument"
corpus[[1]]$content
## [1] "I have to say, Apple has by far the best customer care service I have ever received! @Apple @AppStore"
corpus[[10]]$content
## [1] "Just checked out the specs on the new iOS 7...wow is all I have to say! I can't wait to get the new update ?? Bravo @Apple"
# IMPORTANT NOTE: If you are using the latest version of the tm package, you will need to run the following line before continuing (it converts corpus to a Plain Text Document). This is a recent change having to do with the tolower function that occurred after this video was recorded.
corpus = tm_map(corpus, PlainTextDocument)
#Converting text to lower case
#Pre-processing is easy in tm.
#Each operation, like stemming or removing stop words, can be done with one line in R, where we use the tm_map() function which takes as its first argument the name of a corpus and as second argument a function performing the transformation that we want to apply to the text.
#To transform all text to lower case:
corpus = tm_map(corpus, content_transformer(tolower))
#Checking the same two "documents" as before:
corpus[[1]]$content
## [1] "i have to say, apple has by far the best customer care service i have ever received! @apple @appstore"
corpus[[10]]$content
## [1] "just checked out the specs on the new ios 7...wow is all i have to say! i can't wait to get the new update ?? bravo @apple"
# Removing punctuation
corpus = tm_map(corpus, removePunctuation)
corpus[[1]]$content
## [1] "i have to say apple has by far the best customer care service i have ever received apple appstore"
corpus[[10]]$content
## [1] "just checked out the specs on the new ios 7wow is all i have to say i cant wait to get the new update bravo apple"
# Look at stop words provided by tm package.It is necessary to define a list of words that we regard as being stop words, and for this the tm package provides a default list for the English language. We can check it out with:
stopwords("english")[1:10]
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
length(stopwords("english"))
## [1] 174
#Next we want to remove the stop words in our tweets.
#Removing words can be done with the removeWords argument to the tm_map() function, with an extra argument, i.e. what the stop words are that we want to remove.
#We will remove all of these English stop words, but we will also remove the word "apple" since all of these tweets have the word "apple" and it probably won't be very useful in our prediction problem.
# Removing stopwords and apple
corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))
corpus[[1]]$content
## [1] " say far best customer care service ever received appstore"
corpus[[10]]$content
## [1] "just checked specs new ios 7wow say cant wait get new update bravo "
#Stemming
#Lastly, we want to stem our document with the stemDocument argument.
# Stem document
corpus = tm_map(corpus, stemDocument)
corpus[[1]]$content
## [1] " say far best custom care servic ever receiv appstor"
corpus[[10]]$content
## [1] "just check spec new io 7wow say cant wait get new updat bravo"
#We can see that this took off the ending of "customer," "service," "received," and "appstore."
##################################
#QUICK QUESTION
#Q:Given a corpus in R, how many commands do you need to run in R to clean up the irregularities (removing capital letters and punctuation)?
#Ans:2
#Q:How many commands do you need to run to stem the document?
#Ans:1
#EXPLANATION:In R, you can clean up the irregularities with two lines:
#corpus = tm_map(corpus, tolower)
#corpus = tm_map(corpus, removePunctuation) And you can stem the document with one line:
#corpus = tm_map(corpus, stemDocument)
# Video 6
#Create a Document Term Matrix
#We are now ready to extract the word frequencies to be used in our prediction problem. The tm package provides a function called DocumentTermMatrix() that generates a matrix where:
#the rows correspond to documents, in our case tweets, and
#the columns correspond to words in those tweets.
#The values in the matrix are the number of times that word appears in each document.
corpus = tm_map(corpus, PlainTextDocument)
# Create matrix
frequencies=DocumentTermMatrix(corpus)
frequencies
## <<DocumentTermMatrix (documents: 1181, terms: 3289)>>
## Non-/sparse entries: 8980/3875329
## Sparsity : 100%
## Maximal term length: 115
## Weighting : term frequency (tf)
#We see that in the corpus there are 3289 unique words.
#Let's see what this matrix looks like using the inspect() function, in particular slicing a block of rows/columns from the Document Term Matrix by calling by their indices:
# Look at matrix
inspect(frequencies[1000:1005,505:515])
## <<DocumentTermMatrix (documents: 6, terms: 11)>>
## Non-/sparse entries: 1/65
## Sparsity : 98%
## Maximal term length: 9
## Weighting : term frequency (tf)
##
## Terms
## Docs cheapen cheaper check cheep cheer cheerio cherylcol chief
## character(0) 0 0 0 0 0 0 0 0
## character(0) 0 0 0 0 0 0 0 0
## character(0) 0 0 0 0 0 0 0 0
## character(0) 0 0 0 0 0 0 0 0
## character(0) 0 0 0 0 0 0 0 0
## character(0) 0 0 0 0 1 0 0 0
## Terms
## Docs chiiiiqu child children
## character(0) 0 0 0
## character(0) 0 0 0
## character(0) 0 0 0
## character(0) 0 0 0
## character(0) 0 0 0
## character(0) 0 0 0
#In this range we see that the word "cheer" appears in the tweet 1005, but "cheap" does not appear in any of these tweets. This data is what we call sparse. This means that there are many zeros in our matrix.
#We can look at what the most popular terms are, or words, with the function findFreqTerms(), selecting a minimum number of 20 occurrences over the whole corpus:
# Check for sparsity
findFreqTerms(frequencies, lowfreq=20)
## [1] "android" "anyon" "app"
## [4] "appl" "back" "batteri"
## [7] "better" "buy" "can"
## [10] "cant" "come" "dont"
## [13] "fingerprint" "freak" "get"
## [16] "googl" "ios7" "ipad"
## [19] "iphon" "iphone5" "iphone5c"
## [22] "ipod" "ipodplayerpromo" "itun"
## [25] "just" "like" "lol"
## [28] "look" "love" "make"
## [31] "market" "microsoft" "need"
## [34] "new" "now" "one"
## [37] "phone" "pleas" "promo"
## [40] "promoipodplayerpromo" "realli" "releas"
## [43] "samsung" "say" "store"
## [46] "thank" "think" "time"
## [49] "twitter" "updat" "use"
## [52] "via" "want" "well"
## [55] "will" "work"
#Out of the 3289 words in our matrix, only 56 words appear at least 20 times in our tweets.
#This means that we probably have a lot of terms that will be pretty useless for our prediction model. The number of terms is an issue for two main reasons:
#One is computational: more terms means more independent variables, which usually means it takes longer to build our models.
#The other is that in building models the ratio of independent variables to observations will affect how well the model will generalize.
# Remove sparse terms(removing some terms that don't appear very often.)
sparse = removeSparseTerms(frequencies, 0.995)
#This function takes a second parameters, the sparsity threshold. The sparsity threshold works as follows.
#If we say 0.98, this means to only keep terms that appear in 2% or more of the tweets.
#If we say 0.99, that means to only keep terms that appear in 1% or more of the tweets.
#If we say 0.995, that means to only keep terms that appear in 0.5% or more of the tweets, about six or more tweets.
#Let's see what the new Document Term Matrix properties look like:
sparse
## <<DocumentTermMatrix (documents: 1181, terms: 309)>>
## Non-/sparse entries: 4669/360260
## Sparsity : 99%
## Maximal term length: 20
## Weighting : term frequency (tf)
#It only contains 309 unique terms, i.e. only about 9.4% of the full set.
# Convert sparse to a data frame to use for predictive modeling
tweetsSparse = as.data.frame(as.matrix(sparse))
#Fix variables names in the data frame
#Since R struggles with variable names that start with a number, and we probably have some words here that start with a number, we should run the make.names() function to make sure all of our words are appropriate variable names. It will convert the variable names to make sure they are all appropriate names for R before we build our predictive models. You should do this each time you build a data frame using text analytics.
# Make all variable names R-friendly
colnames(tweetsSparse) = make.names(colnames(tweetsSparse))
# Add dependent variable
#We should add back to this data frame our dependent variable to this data set. We'll call it tweetsSparse$Negative and set it equal to the original Negative variable from the tweets data frame.
tweetsSparse$Negative = tweets$Negative
# Split the data in training/testing sets
library(caTools)
set.seed(123)
split = sample.split(tweetsSparse$Negative, SplitRatio = 0.7)
trainSparse = subset(tweetsSparse, split==TRUE)
testSparse = subset(tweetsSparse, split==FALSE)
#QUICK QUESTION
#In the previous video, we showed a list of all words that appear at least 20 times in our tweets. Which of the following words appear at least 100 times? Select all that apply. (HINT: use the findFreqTerms function)
findFreqTerms(frequencies, lowfreq=100)
## [1] "iphon" "itun" "new"
#Ans:"iphon", "itun", and "new"
# Video 7
# Build a CART model
library(rpart)
library(rpart.plot)
#Let's first use CART to build a predictive model, using the rpart() function to predict Negative using all of the other variables as our independent variables and the data set trainSparse.
#We'll add one more argument here, which is method = "class" so that the rpart() function knows to build a classification model. We keep default settings for all other parameters, in particular we are not adding anything for minbucket or cp.
#Building the classification model with all the IVs
tweetCART = rpart(Negative ~ ., data=trainSparse, method="class")
#plotting the tree
prp(tweetCART)
#The tree says that
#if the word "freak" is in the tweet, then predict TRUE, or negative sentiment.
#If the word "freak" is not in the tweet, but the word "hate" is again predict TRUE.
#If neither of these two words are in the tweet, but the word "wtf" is, also predict TRUE, or negative sentiment.
#If none of these three words are in the tweet, then predict FALSE, or non-negative sentiment.
#This tree makes sense intuitively since these three words are generally seen as negative words.
# Evaluate the Out-of-Sample numerical performance of the model to get class predictions
#sing the predict() function we compute the predictions of our model tweetCART on the new data set testSparse. Be careful to add the argument type = "class" to make sure we get class predictions.
predictCART = predict(tweetCART, newdata=testSparse, type="class")
#computing the confusion matrix from the predictions
cmat_CART<-table(testSparse$Negative, predictCART)
cmat_CART
## predictCART
## FALSE TRUE
## FALSE 294 6
## TRUE 37 18
# Compute accuracy
accu_CART <- (cmat_CART[1,1] + cmat_CART[2,2])/sum(cmat_CART) #(294+18)/(294+6+37+18)=0.8788732
#Ans:Overall accuracy=0.8788732
#Sensitivity = 18 / 55 = 0.3273 ( = TP rate)
#Specificity = 294 / 300 = 0.98
#FP rate = 6 / 300 = 0.02
#Comparison with theBaseline accuracy
#Let's compare this to a simple baseline model that always predicts non-negative (i.e. the most common value of the dependent variable).
#To compute the accuracy of the baseline model, let's make a table of just the outcome variable Negative.
cmat_baseline<-table(testSparse$Negative)
cmat_baseline
##
## FALSE TRUE
## 300 55
accu_baseline <- max(cmat_baseline)/sum(cmat_baseline)#300/(300+55)=08450704
#Ans:Baseline model accuracy=0.8450704
#So our CARTt model does better than the baseline model.Lets see how Random Forest does?
#Random forest model
library(randomForest)
set.seed(123)
#Building the Random forest model with all the IVs (Takes considerably a long time since er have a large no. of IVs)
#We use the randomForest() function to predict Negative again using all of our other variables as independent variables and the data set trainSparse. Again we use the default parameter settings:
tweetRF = randomForest(Negative ~ ., data=trainSparse)
# Make Out-of-Sample predictions:
predictRF = predict(tweetRF, newdata=testSparse)
#computing the confusion matrix
cmat_RF<-table(testSparse$Negative, predictRF)
cmat_RF
## predictRF
## FALSE TRUE
## FALSE 293 7
## TRUE 34 21
#Overall model Accuracy:
accu_RF <- (cmat_RF[1,1] + cmat_RF[2,2])/sum(cmat_RF)
accu_RF #(293+21)/(293+7+34+21)=0.884507
## [1] 0.884507
#The overall accuracy of this Random Forest model is 0.884507
#The accuracy is slightly better than the CART model, but the interpretability of CART model is more compared to Random Forest and hence probably i would use the CART model
#If you were to use cross-validation to pick the cp parameter for the CART model, the accuracy would increase to about the same as the random forest model.So by using a bag-of-words approach and these models, we can reasonably predict sentiment even with a relatively small data set of tweets.
##################################
#QUICK QUESTION
#Comparison with logistic regression model
#In the previous video, we used CART and Random Forest to predict sentiment. Let's see how well logistic regression does. Build a logistic regression model (using the training set) to predict "Negative" using all of the independent variables. You may get a warning message after building your model - don't worry (we explain what it means in the explanation).
#Build the model, using all independent variables as predictors:
tweetLog<- glm(Negative ~ . , data =trainSparse, family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#summary(tweetLog)
#Now, make predictions on the testing set using the logistic regression model:
predictLog= predict(tweetLog, newdata=testSparse, type="response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
#where "tweetLog" should be the name of your logistic regression model. You might also get a warning message after this command, but don't worry - it is due to the same problem as the previous warning message.
#Build a confusion matrix (with a threshold of 0.5) and compute the accuracy of the model.What is the accuracy?
# Confusion matrix with threshold of 0.5
cmat_log<-table(testSparse$Negative, predictLog> 0.5)
cmat_log
##
## FALSE TRUE
## FALSE 253 47
## TRUE 22 33
#lets now compute the overall accuracy
accu_log <- (cmat_log[1,1] + cmat_log[2,2])/sum(cmat_log)
accu_log #(253+33)/(253+47+22+33) = 0.8056338
## [1] 0.8056338
#Ans:0.8056338
#EXPLANATION:The accuracy is (253+33)/(253+47+22+33) = 0.8056338, which is worse than the baseline.
#The Perils of Over-fitting:
#If you were to compute the accuracy on the training set instead, you would see that the model does really well on the training set - this is an example of over-fitting. The model fits the training set really well, but does not perform well on the test set. A logistic regression model with a large number of variables is particularly at risk for overfitting.
#Note that you might have gotten a different answer than us, because the glm function struggles with this many variables. The warning messages that you might have seen in this problem have to do with the number of variables, and the fact that the model is overfitting to the training set. We'll discuss this in more detail in the Homework Assignment.
THE ANALYTICS EDGE
How IBM Built a Jeopardy! Champion
A Grand Challenge
Why was everyone so interested?
A Tradition of Challenges
The Challenge Begins
The Contestants
Ken_Jennings
Brad_Rutter
Watson
The Match Begins
match begins!
QUICK QUESTION
What were the goals of IBM when they set out to build Watson? Select all that apply.
Ans:To build a computer that could compete with the best human players at Jeopardy!.& To build a computer that could answer questions that are commonly believed to require human intelligence.
EXPLANATION:The main goals of IBM were to build a computer that could answer questions that are commonly believed to require human intelligence, and to therefore compete with the best human players at Jeopardy!.
Overview of the Jeopardy! game
jeopardy
ExampleRound
Example_Round
Jeopardy! Questions
QUICK QUESTION
For which of the following reasons is Jeopardy! challenging? Select all that apply.
Ans:A wide variety of categories. , Speed is required - you have to buzz in faster than your competitors. , The categories and clues are often cryptic.
EXPLANATION:Jeopardy! is challenging because there are a wide variety of categories, speed is required, and the categories and clues are cryptic. Expert knowledge is not generally required.
Why is Jeopardy Hard?
A Straightforward Approach
Using Analytics
Watson’s Database and Tools
How Watson Works
QUICK QUESTION
Which of the following two questions do you think would be EASIEST for a computer to answer?
Ans:What year was Abraham Lincoln born?
EXPLANATION:The second question would be the easiest, because the answer is a fact. The first question is much more subjective.
Step 1: Question Analysis
Step 1: Question Analysis
Step 2: Hypothesis Generation
QUICK QUESTION
Select the LAT of the following Jeopardy question: NICHOLAS II WAS THE LAST RULING CZAR OF THIS ROYAL FAMILY (Hint: The answer is “The Romanovs”)
Ans:THIS ROYAL FAMILY
Select the LAT of the following Jeopardy question: REGARDING THIS DEVICE, ARCHIMEDES SAID, “GIVE ME A PLACE TO STAND ON, AND I WILL MOVE THE EARTH” (Hint: The answer is “A lever”)
Ans: THIS DEVICE
EXPLANATION:The LAT in the first question is “THIS ROYAL FAMILY” and the LAT in the second question is “THIS DEVICE”. Remember that if you replace the LAT with the correct answer, the sentence should make sense.
Step 3: Scoring Hypotheses
Lightweight Scoring Algorithms
Scoring Analytics
Passage Search
Passage Search
Passage_Search_diff
Scoring Analytics
Step 4: Final Merging and Ranking
Ranking and Confidence Estimation
The Watson System
Watson System
QUICK QUESTION
To predict which candidate answer is correct, we said that Watson uses logistic regression. Which of the other techniques that we have learned could be used instead? Select all that apply.
Ans:CART , Random Forests
EXPLANATION:CART and Random Forests are both techniques that are also used for classification, and could provide confidence probabilities.
Progress from 2006 - 2010
progress
Let the games begin!
The Results
results
What’s Next for Watson
The Analytics Edge
We will be looking into how to use the text of emails in the inboxes of Enron executives to predict if those emails are relevant to an investigation into the company.
We will be extracting word frequencies from the text of the documents, and then integrating those frequencies into predictive models.
We are going to talk about predictive coding – an emerging use of text analytics in the area of criminal justice.
The case we will consider concerns Enron, a US energy company based out of Houston, Texas that was involved in a number of electricity production and distribution markets and that collapsed in the early 2000’s after widespread account fraud was exposed. To date Enron remains a stunning symbol of corporate corruption.
While Enron’s collapse stemmed largely from accounting fraud, the firm also faced sanctions for its involvement in the California electricity crisis.
In 2000 to 2001, California experienced a number of power blackouts, despite having sufficient generating capacity.
It later surfaced that Enron played a key role in this energy crisis by artificially reducing power supply to spike prices and then making a profit from this market instability.
The Federal Energy Regulatory Commission, or FERC, investigated Enron’s involvement in the crisis, and its investigation eventually led to a $1.52 billion settlement.
FERC’s investigation into Enron will be the topic of today’s recitation.
The eDiscovery Problem
Enron was a huge company, and its corporate servers contained millions of emails and other electronic files. Sifting through these documents to find the ones relevant to an investigation is no simple task.
In law, this electronic document retrieval process is called the eDiscovery problem, and relevant files are called responsive documents.
Traditionally, the eDiscovery problem has been solved by using keyword search. In our case, perhaps, searching for phrases like “electricity bid” or “energy schedule”,followed by an expensive and time-consuming manual review process, in which attorneys read through thousands of documents to determine which ones are responsive.
Predictive Coding
Predictive coding is a new technique in which attorneys manually label some documents and then use text analytics models trained on the manually labeled documents to predict which of the remaining documents are responsive.
The Data
As part of its investigation, the FERC released hundreds of thousands of emails from top executives at Enron creating the largest publicly available set of emails today.
We will use this data set called the Enron Corpus to perform predictive coding in this recitation.
The data set contains just two fields:
The labels for these emails were made by attorneys as part of the 2010 text retrieval conference legal track, a predictive coding competition.
# Unit 5 - Recitation
# Video 2
# Load the dataset
emails = read.csv("energy_bids.csv", stringsAsFactors=FALSE)
#Lets see the structure
str(emails)
## 'data.frame': 855 obs. of 2 variables:
## $ email : chr "North America's integrated electricity market requires cooperation on environmental policies Commission for Environmental Coope"| __truncated__ "FYI -----Original Message----- From: \t\"Ginny Feliciano\" <gfeliciano@earthlink.net>@ENRON [mailto:IMCEANOTES-+22Ginny+20Felic"| __truncated__ "14:13:53 Synchronizing Mailbox 'Kean, Steven J.' 14:13:53 Synchronizing Hierarchy 14:13:53 Synchronizing Favorites 14:13:53 Syn"| __truncated__ "^ ----- Forwarded by Steven J Kean/NA/Enron on 03/02/2001 12:27 PM ----- Suzanne_Nimocks@mckinsey.com Sent by: Carol_Benter@mck"| __truncated__ ...
## $ responsive: int 0 1 0 1 0 0 1 0 0 0 ...
#Let's look at a few examples (using the strwrap() function for easier-to-read formatting):
strwrap(emails$email[1])
## [1] "North America's integrated electricity market requires cooperation"
## [2] "on environmental policies Commission for Environmental Cooperation"
## [3] "releases working paper on North America's electricity market"
## [4] "Montreal, 27 November 2001 -- The North American Commission for"
## [5] "Environmental Cooperation (CEC) is releasing a working paper"
## [6] "highlighting the trend towards increasing trade, competition and"
## [7] "cross-border investment in electricity between Canada, Mexico and"
## [8] "the United States. It is hoped that the working paper,"
## [9] "Environmental Challenges and Opportunities in the Evolving North"
## [10] "American Electricity Market, will stimulate public discussion"
## [11] "around a CEC symposium of the same title about the need to"
## [12] "coordinate environmental policies trinationally as a North"
## [13] "America-wide electricity market develops. The CEC symposium will"
## [14] "take place in San Diego on 29-30 November, and will bring together"
## [15] "leading experts from industry, academia, NGOs and the governments"
## [16] "of Canada, Mexico and the United States to consider the impact of"
## [17] "the evolving continental electricity market on human health and"
## [18] "the environment. \"Our goal [with the working paper and the"
## [19] "symposium] is to highlight key environmental issues that must be"
## [20] "addressed as the electricity markets in North America become more"
## [21] "and more integrated,\" said Janine Ferretti, executive director of"
## [22] "the CEC. \"We want to stimulate discussion around the important"
## [23] "policy questions being raised so that countries can cooperate in"
## [24] "their approach to energy and the environment.\" The CEC, an"
## [25] "international organization created under an environmental side"
## [26] "agreement to NAFTA known as the North American Agreement on"
## [27] "Environmental Cooperation, was established to address regional"
## [28] "environmental concerns, help prevent potential trade and"
## [29] "environmental conflicts, and promote the effective enforcement of"
## [30] "environmental law. The CEC Secretariat believes that greater North"
## [31] "American cooperation on environmental policies regarding the"
## [32] "continental electricity market is necessary to: * protect air"
## [33] "quality and mitigate climate change, * minimize the possibility of"
## [34] "environment-based trade disputes, * ensure a dependable supply of"
## [35] "reasonably priced electricity across North America * avoid"
## [36] "creation of pollution havens, and * ensure local and national"
## [37] "environmental measures remain effective. The Changing Market The"
## [38] "working paper profiles the rapid changing North American"
## [39] "electricity market. For example, in 2001, the US is projected to"
## [40] "export 13.1 thousand gigawatt-hours (GWh) of electricity to Canada"
## [41] "and Mexico. By 2007, this number is projected to grow to 16.9"
## [42] "thousand GWh of electricity. \"Over the past few decades, the North"
## [43] "American electricity market has developed into a complex array of"
## [44] "cross-border transactions and relationships,\" said Phil Sharp,"
## [45] "former US congressman and chairman of the CEC's Electricity"
## [46] "Advisory Board. \"We need to achieve this new level of cooperation"
## [47] "in our environmental approaches as well.\" The Environmental"
## [48] "Profile of the Electricity Sector The electricity sector is the"
## [49] "single largest source of nationally reported toxins in the United"
## [50] "States and Canada and a large source in Mexico. In the US, the"
## [51] "electricity sector emits approximately 25 percent of all NOx"
## [52] "emissions, roughly 35 percent of all CO2 emissions, 25 percent of"
## [53] "all mercury emissions and almost 70 percent of SO2 emissions."
## [54] "These emissions have a large impact on airsheds, watersheds and"
## [55] "migratory species corridors that are often shared between the"
## [56] "three North American countries. \"We want to discuss the possible"
## [57] "outcomes from greater efforts to coordinate federal, state or"
## [58] "provincial environmental laws and policies that relate to the"
## [59] "electricity sector,\" said Ferretti. \"How can we develop more"
## [60] "compatible environmental approaches to help make domestic"
## [61] "environmental policies more effective?\" The Effects of an"
## [62] "Integrated Electricity Market One key issue raised in the paper is"
## [63] "the effect of market integration on the competitiveness of"
## [64] "particular fuels such as coal, natural gas or renewables. Fuel"
## [65] "choice largely determines environmental impacts from a specific"
## [66] "facility, along with pollution control technologies, performance"
## [67] "standards and regulations. The paper highlights other impacts of a"
## [68] "highly competitive market as well. For example, concerns about so"
## [69] "called \"pollution havens\" arise when significant differences in"
## [70] "environmental laws or enforcement practices induce power companies"
## [71] "to locate their operations in jurisdictions with lower standards."
## [72] "\"The CEC Secretariat is exploring what additional environmental"
## [73] "policies will work in this restructured market and how these"
## [74] "policies can be adapted to ensure that they enhance"
## [75] "competitiveness and benefit the entire region,\" said Sharp."
## [76] "Because trade rules and policy measures directly influence the"
## [77] "variables that drive a successfully integrated North American"
## [78] "electricity market, the working paper also addresses fuel choice,"
## [79] "technology, pollution control strategies and subsidies. The CEC"
## [80] "will use the information gathered during the discussion period to"
## [81] "develop a final report that will be submitted to the Council in"
## [82] "early 2002. For more information or to view the live video webcast"
## [83] "of the symposium, please go to: http://www.cec.org/electricity."
## [84] "You may download the working paper and other supporting documents"
## [85] "from:"
## [86] "http://www.cec.org/programs_projects/other_initiatives/electricity/docs.cfm?varlan=english."
## [87] "Commission for Environmental Cooperation 393, rue St-Jacques"
## [88] "Ouest, Bureau 200 Montr<U+00C3><U+00A9>al (Qu<U+00C3><U+00A9>bec) Canada H2Y 1N9 Tel: (514)"
## [89] "350-4300; Fax: (514) 350-4314 E-mail: info@ccemtl.org ***********"
# Look at emails
emails$email[1]
## [1] "North America's integrated electricity market requires cooperation on environmental policies Commission for Environmental Cooperation releases working paper on North America's electricity market Montreal, 27 November 2001 -- The North American Commission for Environmental Cooperation (CEC) is releasing a working paper highlighting the trend towards increasing trade, competition and cross-border investment in electricity between Canada, Mexico and the United States. It is hoped that the working paper, Environmental Challenges and Opportunities in the Evolving North American Electricity Market, will stimulate public discussion around a CEC symposium of the same title about the need to coordinate environmental policies trinationally as a North America-wide electricity market develops. The CEC symposium will take place in San Diego on 29-30 November, and will bring together leading experts from industry, academia, NGOs and the governments of Canada, Mexico and the United States to consider the impact of the evolving continental electricity market on human health and the environment. \"Our goal [with the working paper and the symposium] is to highlight key environmental issues that must be addressed as the electricity markets in North America become more and more integrated,\" said Janine Ferretti, executive director of the CEC. \"We want to stimulate discussion around the important policy questions being raised so that countries can cooperate in their approach to energy and the environment.\" The CEC, an international organization created under an environmental side agreement to NAFTA known as the North American Agreement on Environmental Cooperation, was established to address regional environmental concerns, help prevent potential trade and environmental conflicts, and promote the effective enforcement of environmental law. The CEC Secretariat believes that greater North American cooperation on environmental policies regarding the continental electricity market is necessary to: * protect air quality and mitigate climate change, * minimize the possibility of environment-based trade disputes, * ensure a dependable supply of reasonably priced electricity across North America * avoid creation of pollution havens, and * ensure local and national environmental measures remain effective. The Changing Market The working paper profiles the rapid changing North American electricity market. For example, in 2001, the US is projected to export 13.1 thousand gigawatt-hours (GWh) of electricity to Canada and Mexico. By 2007, this number is projected to grow to 16.9 thousand GWh of electricity. \"Over the past few decades, the North American electricity market has developed into a complex array of cross-border transactions and relationships,\" said Phil Sharp, former US congressman and chairman of the CEC's Electricity Advisory Board. \"We need to achieve this new level of cooperation in our environmental approaches as well.\" The Environmental Profile of the Electricity Sector The electricity sector is the single largest source of nationally reported toxins in the United States and Canada and a large source in Mexico. In the US, the electricity sector emits approximately 25 percent of all NOx emissions, roughly 35 percent of all CO2 emissions, 25 percent of all mercury emissions and almost 70 percent of SO2 emissions. These emissions have a large impact on airsheds, watersheds and migratory species corridors that are often shared between the three North American countries. \"We want to discuss the possible outcomes from greater efforts to coordinate federal, state or provincial environmental laws and policies that relate to the electricity sector,\" said Ferretti. \"How can we develop more compatible environmental approaches to help make domestic environmental policies more effective?\" The Effects of an Integrated Electricity Market One key issue raised in the paper is the effect of market integration on the competitiveness of particular fuels such as coal, natural gas or renewables. Fuel choice largely determines environmental impacts from a specific facility, along with pollution control technologies, performance standards and regulations. The paper highlights other impacts of a highly competitive market as well. For example, concerns about so called \"pollution havens\" arise when significant differences in environmental laws or enforcement practices induce power companies to locate their operations in jurisdictions with lower standards. \"The CEC Secretariat is exploring what additional environmental policies will work in this restructured market and how these policies can be adapted to ensure that they enhance competitiveness and benefit the entire region,\" said Sharp. Because trade rules and policy measures directly influence the variables that drive a successfully integrated North American electricity market, the working paper also addresses fuel choice, technology, pollution control strategies and subsidies. The CEC will use the information gathered during the discussion period to develop a final report that will be submitted to the Council in early 2002. For more information or to view the live video webcast of the symposium, please go to: http://www.cec.org/electricity. You may download the working paper and other supporting documents from: http://www.cec.org/programs_projects/other_initiatives/electricity/docs.cfm?varlan=english. Commission for Environmental Cooperation 393, rue St-Jacques Ouest, Bureau 200 Montr<U+00C3><U+00A9>al (Qu<U+00C3><U+00A9>bec) Canada H2Y 1N9 Tel: (514) 350-4300; Fax: (514) 350-4314 E-mail: info@ccemtl.org ***********"
#We can see just by parsing through the first couple of lines that this is an email about a new working paper, "The Environmental Challenges and Opportunities in the Evolving North American Electricity Market", released by the Commission for Environmental Cooperation, or CEC.While this certainly deals with electricity markets, it doesn't have to do with energy schedules or bids, hence it is not responsive to our query.
#If we look at the value in the responsive variable for this email:
emails$responsive[1]
## [1] 0
#we see that its value is 0, as expected.
#Let's check the second email:
emails$email[2]
## [1] "FYI -----Original Message----- From: \t\"Ginny Feliciano\" <gfeliciano@earthlink.net>@ENRON [mailto:IMCEANOTES-+22Ginny+20Feliciano+22+20+3Cgfeliciano+40earthlink+2Enet+3E+40ENRON@ENRON.com] Sent:\tThursday, June 28, 2001 3:40 PM To:\tSilvia Woodard; Paul Runci; Katrin Thomas; John A. Riggs; Kurt E. Yeager; Gregg Ward; Philip K. Verleger; Admiral Richard H. Truly; Susan Tomasky; Tsutomu Toichi; Susan F. Tierney; John A. Strom; Gerald M. Stokes; Kevin Stoffer; Edward M. Stern; Irwin M. Stelzer; Hoff Stauffer; Steven R. Spencer; Robert Smart; Bernie Schroeder; George A. Schreiber, Jr.; Robert N. Schock; James R. Schlesinger; Roger W. Sant; John W. Rowe; James E. Rogers; John F. Riordan; James Ragland; Frank J. Puzio; Tony Prophet; Robert Priddle; Michael Price; John B. Phillips; Robert Perciasepe; D. Louis Peoples; Robert Nordhaus; Walker Nolan; William A. Nitze; Kazutoshi Muramatsu; Ernest J. Moniz; Nancy C. Mohn; Callum McCarthy; Thomas R. Mason; Edward P. Martin; Jan W. Mares; James K. Malernee; S. David Freeman; Edwin Lupberger; Amory B. Lovins; Lynn LeMaster; Hoesung Lee; Lay, Kenneth; Lester Lave; Wilfrid L. Kohl; Soo Kyung Kim; Melanie Kenderdine; Paul L. Joskow; Ira H. Jolles; Frederick E. John; John Jimison; William W. Hogan; Robert A. Hefner, III; James K. Gray; Craig G. Goodman; Charles F. Goff, Jr.; Jerry D. Geist; Fritz Gautschi; Larry G. Garberding; Roger Gale; William Fulkerson; Stephen E. Frank; George Frampton; Juan Eibenschutz; Theodore R. Eck; Congressman John Dingell; Brian N. Dickie; William E. Dickenson; Etienne Deffarges; Wilfried Czernie; Loren C. Cox; Anne Cleary; Bernard H. Cherry; Red Cavaney; Ralph Cavanagh; Thomas R. Casten; Peter Bradford; Peter D. Blair; Ellen Berman; Roger A. Berliner; Michael L. Beatty; Vicky A. Bailey; Merribel S. Ayres; Catherine G. Abbott Subject:\tEnergy Deregulation - California State Auditor Report Attached is my report prepared on behalf of the California State Auditor. I look forward to seeing you at The Aspen Institute Energy Policy Forum. Charles J. Cicchetti Pacific Economics Group, LLC - ca report new.pdf ***********"
#The original message is actually very short, it just says FYI, and most of it is a forwarded message. We have the list of recipients, and down at the very bottom is the message itself. "Attached is my report prepared on behalf of the California State auditor." There is also an attached report.
#Our data set contains just the text of the emails and not the text of the attachments. It turns out, as we might expect, that this attachment had to do with Enron's electricity bids in California, and therefore this email is responsive to our query.
#We can check this in the value of the responsive variable.
emails$responsive[2]
## [1] 1
#We see that that it is a 1.
#Let's look at the breakdown of the number of emails that are responsive to our query.
# Responsive emails
table(emails$responsive)
##
## 0 1
## 716 139
#We see that the data set is unbalanced, with a relatively small proportion of emails responsive to the query. This is typical in predictive coding problems.
# Video 3
# Load tm package
library(tm)
#CREATING A CORPUS
#We will need to convert our tweets to a corpus for pre-processing. Various function in the tm package can be used to create a corpus in many different ways.
#We will create it from the tweet column of our data frame using two functions, Corpus() and VectorSource(). We feed to this latter the Tweets variable of the tweets data frame
corpus = Corpus(VectorSource(emails$email))
#Let's take a look at corpus:
corpus
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 855
#Lets see the first email in our corpus
corpus[[1]]$content
## [1] "North America's integrated electricity market requires cooperation on environmental policies Commission for Environmental Cooperation releases working paper on North America's electricity market Montreal, 27 November 2001 -- The North American Commission for Environmental Cooperation (CEC) is releasing a working paper highlighting the trend towards increasing trade, competition and cross-border investment in electricity between Canada, Mexico and the United States. It is hoped that the working paper, Environmental Challenges and Opportunities in the Evolving North American Electricity Market, will stimulate public discussion around a CEC symposium of the same title about the need to coordinate environmental policies trinationally as a North America-wide electricity market develops. The CEC symposium will take place in San Diego on 29-30 November, and will bring together leading experts from industry, academia, NGOs and the governments of Canada, Mexico and the United States to consider the impact of the evolving continental electricity market on human health and the environment. \"Our goal [with the working paper and the symposium] is to highlight key environmental issues that must be addressed as the electricity markets in North America become more and more integrated,\" said Janine Ferretti, executive director of the CEC. \"We want to stimulate discussion around the important policy questions being raised so that countries can cooperate in their approach to energy and the environment.\" The CEC, an international organization created under an environmental side agreement to NAFTA known as the North American Agreement on Environmental Cooperation, was established to address regional environmental concerns, help prevent potential trade and environmental conflicts, and promote the effective enforcement of environmental law. The CEC Secretariat believes that greater North American cooperation on environmental policies regarding the continental electricity market is necessary to: * protect air quality and mitigate climate change, * minimize the possibility of environment-based trade disputes, * ensure a dependable supply of reasonably priced electricity across North America * avoid creation of pollution havens, and * ensure local and national environmental measures remain effective. The Changing Market The working paper profiles the rapid changing North American electricity market. For example, in 2001, the US is projected to export 13.1 thousand gigawatt-hours (GWh) of electricity to Canada and Mexico. By 2007, this number is projected to grow to 16.9 thousand GWh of electricity. \"Over the past few decades, the North American electricity market has developed into a complex array of cross-border transactions and relationships,\" said Phil Sharp, former US congressman and chairman of the CEC's Electricity Advisory Board. \"We need to achieve this new level of cooperation in our environmental approaches as well.\" The Environmental Profile of the Electricity Sector The electricity sector is the single largest source of nationally reported toxins in the United States and Canada and a large source in Mexico. In the US, the electricity sector emits approximately 25 percent of all NOx emissions, roughly 35 percent of all CO2 emissions, 25 percent of all mercury emissions and almost 70 percent of SO2 emissions. These emissions have a large impact on airsheds, watersheds and migratory species corridors that are often shared between the three North American countries. \"We want to discuss the possible outcomes from greater efforts to coordinate federal, state or provincial environmental laws and policies that relate to the electricity sector,\" said Ferretti. \"How can we develop more compatible environmental approaches to help make domestic environmental policies more effective?\" The Effects of an Integrated Electricity Market One key issue raised in the paper is the effect of market integration on the competitiveness of particular fuels such as coal, natural gas or renewables. Fuel choice largely determines environmental impacts from a specific facility, along with pollution control technologies, performance standards and regulations. The paper highlights other impacts of a highly competitive market as well. For example, concerns about so called \"pollution havens\" arise when significant differences in environmental laws or enforcement practices induce power companies to locate their operations in jurisdictions with lower standards. \"The CEC Secretariat is exploring what additional environmental policies will work in this restructured market and how these policies can be adapted to ensure that they enhance competitiveness and benefit the entire region,\" said Sharp. Because trade rules and policy measures directly influence the variables that drive a successfully integrated North American electricity market, the working paper also addresses fuel choice, technology, pollution control strategies and subsidies. The CEC will use the information gathered during the discussion period to develop a final report that will be submitted to the Council in early 2002. For more information or to view the live video webcast of the symposium, please go to: http://www.cec.org/electricity. You may download the working paper and other supporting documents from: http://www.cec.org/programs_projects/other_initiatives/electricity/docs.cfm?varlan=english. Commission for Environmental Cooperation 393, rue St-Jacques Ouest, Bureau 200 Montr<U+00C3><U+00A9>al (Qu<U+00C3><U+00A9>bec) Canada H2Y 1N9 Tel: (514) 350-4300; Fax: (514) 350-4314 E-mail: info@ccemtl.org ***********"
#or
strwrap(corpus[[1]])
## [1] "North America's integrated electricity market requires cooperation"
## [2] "on environmental policies Commission for Environmental Cooperation"
## [3] "releases working paper on North America's electricity market"
## [4] "Montreal, 27 November 2001 -- The North American Commission for"
## [5] "Environmental Cooperation (CEC) is releasing a working paper"
## [6] "highlighting the trend towards increasing trade, competition and"
## [7] "cross-border investment in electricity between Canada, Mexico and"
## [8] "the United States. It is hoped that the working paper,"
## [9] "Environmental Challenges and Opportunities in the Evolving North"
## [10] "American Electricity Market, will stimulate public discussion"
## [11] "around a CEC symposium of the same title about the need to"
## [12] "coordinate environmental policies trinationally as a North"
## [13] "America-wide electricity market develops. The CEC symposium will"
## [14] "take place in San Diego on 29-30 November, and will bring together"
## [15] "leading experts from industry, academia, NGOs and the governments"
## [16] "of Canada, Mexico and the United States to consider the impact of"
## [17] "the evolving continental electricity market on human health and"
## [18] "the environment. \"Our goal [with the working paper and the"
## [19] "symposium] is to highlight key environmental issues that must be"
## [20] "addressed as the electricity markets in North America become more"
## [21] "and more integrated,\" said Janine Ferretti, executive director of"
## [22] "the CEC. \"We want to stimulate discussion around the important"
## [23] "policy questions being raised so that countries can cooperate in"
## [24] "their approach to energy and the environment.\" The CEC, an"
## [25] "international organization created under an environmental side"
## [26] "agreement to NAFTA known as the North American Agreement on"
## [27] "Environmental Cooperation, was established to address regional"
## [28] "environmental concerns, help prevent potential trade and"
## [29] "environmental conflicts, and promote the effective enforcement of"
## [30] "environmental law. The CEC Secretariat believes that greater North"
## [31] "American cooperation on environmental policies regarding the"
## [32] "continental electricity market is necessary to: * protect air"
## [33] "quality and mitigate climate change, * minimize the possibility of"
## [34] "environment-based trade disputes, * ensure a dependable supply of"
## [35] "reasonably priced electricity across North America * avoid"
## [36] "creation of pollution havens, and * ensure local and national"
## [37] "environmental measures remain effective. The Changing Market The"
## [38] "working paper profiles the rapid changing North American"
## [39] "electricity market. For example, in 2001, the US is projected to"
## [40] "export 13.1 thousand gigawatt-hours (GWh) of electricity to Canada"
## [41] "and Mexico. By 2007, this number is projected to grow to 16.9"
## [42] "thousand GWh of electricity. \"Over the past few decades, the North"
## [43] "American electricity market has developed into a complex array of"
## [44] "cross-border transactions and relationships,\" said Phil Sharp,"
## [45] "former US congressman and chairman of the CEC's Electricity"
## [46] "Advisory Board. \"We need to achieve this new level of cooperation"
## [47] "in our environmental approaches as well.\" The Environmental"
## [48] "Profile of the Electricity Sector The electricity sector is the"
## [49] "single largest source of nationally reported toxins in the United"
## [50] "States and Canada and a large source in Mexico. In the US, the"
## [51] "electricity sector emits approximately 25 percent of all NOx"
## [52] "emissions, roughly 35 percent of all CO2 emissions, 25 percent of"
## [53] "all mercury emissions and almost 70 percent of SO2 emissions."
## [54] "These emissions have a large impact on airsheds, watersheds and"
## [55] "migratory species corridors that are often shared between the"
## [56] "three North American countries. \"We want to discuss the possible"
## [57] "outcomes from greater efforts to coordinate federal, state or"
## [58] "provincial environmental laws and policies that relate to the"
## [59] "electricity sector,\" said Ferretti. \"How can we develop more"
## [60] "compatible environmental approaches to help make domestic"
## [61] "environmental policies more effective?\" The Effects of an"
## [62] "Integrated Electricity Market One key issue raised in the paper is"
## [63] "the effect of market integration on the competitiveness of"
## [64] "particular fuels such as coal, natural gas or renewables. Fuel"
## [65] "choice largely determines environmental impacts from a specific"
## [66] "facility, along with pollution control technologies, performance"
## [67] "standards and regulations. The paper highlights other impacts of a"
## [68] "highly competitive market as well. For example, concerns about so"
## [69] "called \"pollution havens\" arise when significant differences in"
## [70] "environmental laws or enforcement practices induce power companies"
## [71] "to locate their operations in jurisdictions with lower standards."
## [72] "\"The CEC Secretariat is exploring what additional environmental"
## [73] "policies will work in this restructured market and how these"
## [74] "policies can be adapted to ensure that they enhance"
## [75] "competitiveness and benefit the entire region,\" said Sharp."
## [76] "Because trade rules and policy measures directly influence the"
## [77] "variables that drive a successfully integrated North American"
## [78] "electricity market, the working paper also addresses fuel choice,"
## [79] "technology, pollution control strategies and subsidies. The CEC"
## [80] "will use the information gathered during the discussion period to"
## [81] "develop a final report that will be submitted to the Council in"
## [82] "early 2002. For more information or to view the live video webcast"
## [83] "of the symposium, please go to: http://www.cec.org/electricity."
## [84] "You may download the working paper and other supporting documents"
## [85] "from:"
## [86] "http://www.cec.org/programs_projects/other_initiatives/electricity/docs.cfm?varlan=english."
## [87] "Commission for Environmental Cooperation 393, rue St-Jacques"
## [88] "Ouest, Bureau 200 Montr<U+00C3><U+00A9>al (Qu<U+00C3><U+00A9>bec) Canada H2Y 1N9 Tel: (514)"
## [89] "350-4300; Fax: (514) 350-4314 E-mail: info@ccemtl.org ***********"
# Pre-process data
#IMPORTANT NOTE: If you are using the latest version of the tm package, you will need to run the following line before continuing (it converts corpus to a Plain Text Document). This is a recent change having to do with the tolower function that occurred after this video was recorded.
corpus = tm_map(corpus, PlainTextDocument)
#Converting text to lower case
#We use the tm_map() function which takes as its first argument the name of a corpus and as second argument a function performing the transformation that we want to apply to the text.
#To transform all text to lower case:
corpus = tm_map(corpus, content_transformer(tolower))
#Checking the same "documents" as before:
corpus[[1]]$content
## [1] "north america's integrated electricity market requires cooperation on environmental policies commission for environmental cooperation releases working paper on north america's electricity market montreal, 27 november 2001 -- the north american commission for environmental cooperation (cec) is releasing a working paper highlighting the trend towards increasing trade, competition and cross-border investment in electricity between canada, mexico and the united states. it is hoped that the working paper, environmental challenges and opportunities in the evolving north american electricity market, will stimulate public discussion around a cec symposium of the same title about the need to coordinate environmental policies trinationally as a north america-wide electricity market develops. the cec symposium will take place in san diego on 29-30 november, and will bring together leading experts from industry, academia, ngos and the governments of canada, mexico and the united states to consider the impact of the evolving continental electricity market on human health and the environment. \"our goal [with the working paper and the symposium] is to highlight key environmental issues that must be addressed as the electricity markets in north america become more and more integrated,\" said janine ferretti, executive director of the cec. \"we want to stimulate discussion around the important policy questions being raised so that countries can cooperate in their approach to energy and the environment.\" the cec, an international organization created under an environmental side agreement to nafta known as the north american agreement on environmental cooperation, was established to address regional environmental concerns, help prevent potential trade and environmental conflicts, and promote the effective enforcement of environmental law. the cec secretariat believes that greater north american cooperation on environmental policies regarding the continental electricity market is necessary to: * protect air quality and mitigate climate change, * minimize the possibility of environment-based trade disputes, * ensure a dependable supply of reasonably priced electricity across north america * avoid creation of pollution havens, and * ensure local and national environmental measures remain effective. the changing market the working paper profiles the rapid changing north american electricity market. for example, in 2001, the us is projected to export 13.1 thousand gigawatt-hours (gwh) of electricity to canada and mexico. by 2007, this number is projected to grow to 16.9 thousand gwh of electricity. \"over the past few decades, the north american electricity market has developed into a complex array of cross-border transactions and relationships,\" said phil sharp, former us congressman and chairman of the cec's electricity advisory board. \"we need to achieve this new level of cooperation in our environmental approaches as well.\" the environmental profile of the electricity sector the electricity sector is the single largest source of nationally reported toxins in the united states and canada and a large source in mexico. in the us, the electricity sector emits approximately 25 percent of all nox emissions, roughly 35 percent of all co2 emissions, 25 percent of all mercury emissions and almost 70 percent of so2 emissions. these emissions have a large impact on airsheds, watersheds and migratory species corridors that are often shared between the three north american countries. \"we want to discuss the possible outcomes from greater efforts to coordinate federal, state or provincial environmental laws and policies that relate to the electricity sector,\" said ferretti. \"how can we develop more compatible environmental approaches to help make domestic environmental policies more effective?\" the effects of an integrated electricity market one key issue raised in the paper is the effect of market integration on the competitiveness of particular fuels such as coal, natural gas or renewables. fuel choice largely determines environmental impacts from a specific facility, along with pollution control technologies, performance standards and regulations. the paper highlights other impacts of a highly competitive market as well. for example, concerns about so called \"pollution havens\" arise when significant differences in environmental laws or enforcement practices induce power companies to locate their operations in jurisdictions with lower standards. \"the cec secretariat is exploring what additional environmental policies will work in this restructured market and how these policies can be adapted to ensure that they enhance competitiveness and benefit the entire region,\" said sharp. because trade rules and policy measures directly influence the variables that drive a successfully integrated north american electricity market, the working paper also addresses fuel choice, technology, pollution control strategies and subsidies. the cec will use the information gathered during the discussion period to develop a final report that will be submitted to the council in early 2002. for more information or to view the live video webcast of the symposium, please go to: http://www.cec.org/electricity. you may download the working paper and other supporting documents from: http://www.cec.org/programs_projects/other_initiatives/electricity/docs.cfm?varlan=english. commission for environmental cooperation 393, rue st-jacques ouest, bureau 200 montr<U+00E3><U+00A9>al (qu<U+00E3><U+00A9>bec) canada h2y 1n9 tel: (514) 350-4300; fax: (514) 350-4314 e-mail: info@ccemtl.org ***********"
#Removing punctuation
corpus = tm_map(corpus, removePunctuation)
corpus[[1]]$content
## [1] "north americas integrated electricity market requires cooperation on environmental policies commission for environmental cooperation releases working paper on north americas electricity market montreal 27 november 2001 the north american commission for environmental cooperation cec is releasing a working paper highlighting the trend towards increasing trade competition and crossborder investment in electricity between canada mexico and the united states it is hoped that the working paper environmental challenges and opportunities in the evolving north american electricity market will stimulate public discussion around a cec symposium of the same title about the need to coordinate environmental policies trinationally as a north americawide electricity market develops the cec symposium will take place in san diego on 2930 november and will bring together leading experts from industry academia ngos and the governments of canada mexico and the united states to consider the impact of the evolving continental electricity market on human health and the environment our goal with the working paper and the symposium is to highlight key environmental issues that must be addressed as the electricity markets in north america become more and more integrated said janine ferretti executive director of the cec we want to stimulate discussion around the important policy questions being raised so that countries can cooperate in their approach to energy and the environment the cec an international organization created under an environmental side agreement to nafta known as the north american agreement on environmental cooperation was established to address regional environmental concerns help prevent potential trade and environmental conflicts and promote the effective enforcement of environmental law the cec secretariat believes that greater north american cooperation on environmental policies regarding the continental electricity market is necessary to protect air quality and mitigate climate change minimize the possibility of environmentbased trade disputes ensure a dependable supply of reasonably priced electricity across north america avoid creation of pollution havens and ensure local and national environmental measures remain effective the changing market the working paper profiles the rapid changing north american electricity market for example in 2001 the us is projected to export 131 thousand gigawatthours gwh of electricity to canada and mexico by 2007 this number is projected to grow to 169 thousand gwh of electricity over the past few decades the north american electricity market has developed into a complex array of crossborder transactions and relationships said phil sharp former us congressman and chairman of the cecs electricity advisory board we need to achieve this new level of cooperation in our environmental approaches as well the environmental profile of the electricity sector the electricity sector is the single largest source of nationally reported toxins in the united states and canada and a large source in mexico in the us the electricity sector emits approximately 25 percent of all nox emissions roughly 35 percent of all co2 emissions 25 percent of all mercury emissions and almost 70 percent of so2 emissions these emissions have a large impact on airsheds watersheds and migratory species corridors that are often shared between the three north american countries we want to discuss the possible outcomes from greater efforts to coordinate federal state or provincial environmental laws and policies that relate to the electricity sector said ferretti how can we develop more compatible environmental approaches to help make domestic environmental policies more effective the effects of an integrated electricity market one key issue raised in the paper is the effect of market integration on the competitiveness of particular fuels such as coal natural gas or renewables fuel choice largely determines environmental impacts from a specific facility along with pollution control technologies performance standards and regulations the paper highlights other impacts of a highly competitive market as well for example concerns about so called pollution havens arise when significant differences in environmental laws or enforcement practices induce power companies to locate their operations in jurisdictions with lower standards the cec secretariat is exploring what additional environmental policies will work in this restructured market and how these policies can be adapted to ensure that they enhance competitiveness and benefit the entire region said sharp because trade rules and policy measures directly influence the variables that drive a successfully integrated north american electricity market the working paper also addresses fuel choice technology pollution control strategies and subsidies the cec will use the information gathered during the discussion period to develop a final report that will be submitted to the council in early 2002 for more information or to view the live video webcast of the symposium please go to httpwwwcecorgelectricity you may download the working paper and other supporting documents from httpwwwcecorgprogramsprojectsotherinitiativeselectricitydocscfmvarlanenglish commission for environmental cooperation 393 rue stjacques ouest bureau 200 montr<U+00E3>al qu<U+00E3>bec canada h2y 1n9 tel 514 3504300 fax 514 3504314 email infoccemtlorg "
#Removing stop words
#Removing words can be done with the removeWords argument to the tm_map() function, with an extra argument, i.e. what the stop words are that we want to remove, for which we simply use the list for english that is provided by the tm package.
#We will remove all of these English stop words as it probably won't be very useful in our prediction problem.
corpus = tm_map(corpus, removeWords, stopwords("english"))
#Stemming
#Lastly, we want to stem our document with the stemDocument argument.
corpus = tm_map(corpus, stemDocument)
# Now that we have gone through those four preprocessing steps, we can take a second look at the first email in the corpus.
corpus[[1]]$content
## [1] "north america integr electr market requir cooper environment polici commiss environment cooper releas work paper north america electr market montreal 27 novemb 2001 north american commiss environment cooper cec releas work paper highlight trend toward increas trade competit crossbord invest electr canada mexico unit state hope work paper environment challeng opportun evolv north american electr market will stimul public discuss around cec symposium titl need coordin environment polici trinat north americawid electr market develop cec symposium will take place san diego 2930 novemb will bring togeth lead expert industri academia ngos govern canada mexico unit state consid impact evolv continent electr market human health environ goal work paper symposium highlight key environment issu must address electr market north america becom integr said janin ferretti execut director cec want stimul discuss around import polici question rais countri can cooper approach energi environ cec intern organ creat environment side agreement nafta known north american agreement environment cooper establish address region environment concern help prevent potenti trade environment conflict promot effect enforc environment law cec secretariat believ greater north american cooper environment polici regard continent electr market necessari protect air qualiti mitig climat chang minim possibl environmentbas trade disput ensur depend suppli reason price electr across north america avoid creation pollut haven ensur local nation environment measur remain effect chang market work paper profil rapid chang north american electr market exampl 2001 us project export 131 thousand gigawatthour gwh electr canada mexico 2007 number project grow 169 thousand gwh electr past decad north american electr market develop complex array crossbord transact relationship said phil sharp former us congressman chairman cec electr advisori board need achiev new level cooper environment approach well environment profil electr sector electr sector singl largest sourc nation report toxin unit state canada larg sourc mexico us electr sector emit approxim 25 percent nox emiss rough 35 percent co2 emiss 25 percent mercuri emiss almost 70 percent so2 emiss emiss larg impact airsh watersh migratori speci corridor often share three north american countri want discuss possibl outcom greater effort coordin feder state provinci environment law polici relat electr sector said ferretti can develop compat environment approach help make domest environment polici effect effect integr electr market one key issu rais paper effect market integr competit particular fuel coal natur gas renew fuel choic larg determin environment impact specif facil along pollut control technolog perform standard regul paper highlight impact high competit market well exampl concern call pollut haven aris signific differ environment law enforc practic induc power compani locat oper jurisdict lower standard cec secretariat explor addit environment polici will work restructur market polici can adapt ensur enhanc competit benefit entir region said sharp trade rule polici measur direct influenc variabl drive success integr north american electr market work paper also address fuel choic technolog pollut control strategi subsidi cec will use inform gather discuss period develop final report will submit council earli 2002 inform view live video webcast symposium pleas go httpwwwcecorgelectr may download work paper support document httpwwwcecorgprogramsprojectsotherinitiativeselectricitydocscfmvarlanenglish commiss environment cooper 393 rue stjacqu ouest bureau 200 montrcal qucbec canada h2i 1n9 tel 514 3504300 fax 514 3504314 email infoccemtlorg"
#or
strwrap(corpus[[1]])
## [1] "north america integr electr market requir cooper environment"
## [2] "polici commiss environment cooper releas work paper north america"
## [3] "electr market montreal 27 novemb 2001 north american commiss"
## [4] "environment cooper cec releas work paper highlight trend toward"
## [5] "increas trade competit crossbord invest electr canada mexico unit"
## [6] "state hope work paper environment challeng opportun evolv north"
## [7] "american electr market will stimul public discuss around cec"
## [8] "symposium titl need coordin environment polici trinat north"
## [9] "americawid electr market develop cec symposium will take place san"
## [10] "diego 2930 novemb will bring togeth lead expert industri academia"
## [11] "ngos govern canada mexico unit state consid impact evolv continent"
## [12] "electr market human health environ goal work paper symposium"
## [13] "highlight key environment issu must address electr market north"
## [14] "america becom integr said janin ferretti execut director cec want"
## [15] "stimul discuss around import polici question rais countri can"
## [16] "cooper approach energi environ cec intern organ creat environment"
## [17] "side agreement nafta known north american agreement environment"
## [18] "cooper establish address region environment concern help prevent"
## [19] "potenti trade environment conflict promot effect enforc"
## [20] "environment law cec secretariat believ greater north american"
## [21] "cooper environment polici regard continent electr market necessari"
## [22] "protect air qualiti mitig climat chang minim possibl"
## [23] "environmentbas trade disput ensur depend suppli reason price"
## [24] "electr across north america avoid creation pollut haven ensur"
## [25] "local nation environment measur remain effect chang market work"
## [26] "paper profil rapid chang north american electr market exampl 2001"
## [27] "us project export 131 thousand gigawatthour gwh electr canada"
## [28] "mexico 2007 number project grow 169 thousand gwh electr past decad"
## [29] "north american electr market develop complex array crossbord"
## [30] "transact relationship said phil sharp former us congressman"
## [31] "chairman cec electr advisori board need achiev new level cooper"
## [32] "environment approach well environment profil electr sector electr"
## [33] "sector singl largest sourc nation report toxin unit state canada"
## [34] "larg sourc mexico us electr sector emit approxim 25 percent nox"
## [35] "emiss rough 35 percent co2 emiss 25 percent mercuri emiss almost"
## [36] "70 percent so2 emiss emiss larg impact airsh watersh migratori"
## [37] "speci corridor often share three north american countri want"
## [38] "discuss possibl outcom greater effort coordin feder state provinci"
## [39] "environment law polici relat electr sector said ferretti can"
## [40] "develop compat environment approach help make domest environment"
## [41] "polici effect effect integr electr market one key issu rais paper"
## [42] "effect market integr competit particular fuel coal natur gas renew"
## [43] "fuel choic larg determin environment impact specif facil along"
## [44] "pollut control technolog perform standard regul paper highlight"
## [45] "impact high competit market well exampl concern call pollut haven"
## [46] "aris signific differ environment law enforc practic induc power"
## [47] "compani locat oper jurisdict lower standard cec secretariat explor"
## [48] "addit environment polici will work restructur market polici can"
## [49] "adapt ensur enhanc competit benefit entir region said sharp trade"
## [50] "rule polici measur direct influenc variabl drive success integr"
## [51] "north american electr market work paper also address fuel choic"
## [52] "technolog pollut control strategi subsidi cec will use inform"
## [53] "gather discuss period develop final report will submit council"
## [54] "earli 2002 inform view live video webcast symposium pleas go"
## [55] "httpwwwcecorgelectr may download work paper support document"
## [56] "httpwwwcecorgprogramsprojectsotherinitiativeselectricitydocscfmvarlanenglish"
## [57] "commiss environment cooper 393 rue stjacqu ouest bureau 200"
## [58] "montrcal qucbec canada h2i 1n9 tel 514 3504300 fax 514 3504314"
## [59] "email infoccemtlorg"
#It looks quite a bit different now. It is a lot harder to read now that we removed all the stop words and punctuation and word stems, but now the emails in this corpus are ready for our machine learning algorithms.
# Video 4
#Create a Document Term Matrix
corpus = tm_map(corpus, PlainTextDocument)
#We are now ready to extract the word frequencies to be used in our prediction problem. The tm package provides a function called DocumentTermMatrix() that generates a matrix where:
#the rows correspond to documents, in our case tweets, and
#the columns correspond to words in those tweets.
#The values in the matrix are the number of times that word appears in each document.
dtm = DocumentTermMatrix(corpus)
dtm
## <<DocumentTermMatrix (documents: 855, terms: 22162)>>
## Non-/sparse entries: 102858/18845652
## Sparsity : 99%
## Maximal term length: 156
## Weighting : term frequency (tf)
#what we can see is that even though we have only 855 emails in the corpus, we have over 22164 terms that showed up at least once, which is clearly too many variables for the number of observations we have.
#So we want to remove the terms that don't appear too often in our data set.
# Remove sparse terms,Therefore let's remove some terms that don't appear very often.
dtm = removeSparseTerms(dtm, 0.97)
dtm
## <<DocumentTermMatrix (documents: 855, terms: 788)>>
## Non-/sparse entries: 51612/622128
## Sparsity : 92%
## Maximal term length: 19
## Weighting : term frequency (tf)
#We can see that we have decreased the number of terms to 788, which is a much more reasonable number.
# Create data frame from dtm
#Let's convert the sparse matrix into a data frame that we will be able to use for our predictive models.
labeledTerms = as.data.frame(as.matrix(dtm))
#To make all variable names R-friendly use:
colnames(labeledTerms) <- make.names(colnames(labeledTerms))
# We also have to add back-in the outcome variable
labeledTerms$responsive = emails$responsive
#str(labeledTerms)
#The data frame contains an awful lot of variables, 789 in total, of which 788 are the frequencies of various words in the emails, and the last one is responsive, i.e. the outcome variable
# Video 5
#Split data in training/testing sets
#let's split our data into a training set and a testing set, putting 70% of the data in the training set.
library(caTools)
set.seed(144)
spl = sample.split(labeledTerms$responsive, 0.7)
train = subset(labeledTerms, spl == TRUE)
test = subset(labeledTerms, spl == FALSE)
# Build a CART model
#Now we are ready to build the model, and we will build a simple CART model using the default parameters. A random forest would be another good choice from our toolset.
library(rpart)
library(rpart.plot)
#Let's use CART to build a predictive model, using the rpart() function to predict responsive using all of the other variables as our independent variables and the data set train.
emailCART = rpart(responsive~., data=train, method="class")
prp(emailCART)
#We see at the very top is the word California.
#If californ appears at least twice in an email, we are going to take the right path and predict that a document is responsive.
#It is somewhat unsurprising that California shows up, because we know that Enron had a heavy involvement in the California energy markets.
#Further down the tree, we see a number of other terms that we could plausibly expect to be related to energy bids and energy scheduling, like system, demand, bid, and gas.
#Down at the bottom is jeff, which is perhaps a reference to Enron's CEO, Jeff Skillings, who ended up actually being jailed for his involvement in the fraud at the company.
# Video 6
# Out-of-Sample Performnce of the Model (Make predictions on the test set)
#Now that we have trained a model, we need to evaluate it on the test set.
#We build an object predictCART that has the predicted probabilities for each class from our CART model, by using the predict() function on the model emailCART and the test data with newdata = test.
pred = predict(emailCART, newdata=test)
#This new object gives us the predicted probabilities on the test set. We can look at the first 10 rows with
pred[1:10,]
## 0 1
## character(0) 0.2156863 0.78431373
## character(0).1 0.9557522 0.04424779
## character(0).2 0.9557522 0.04424779
## character(0).3 0.8125000 0.18750000
## character(0).4 0.4000000 0.60000000
## character(0).5 0.9557522 0.04424779
## character(0).6 0.9557522 0.04424779
## character(0).7 0.9557522 0.04424779
## character(0).8 0.1250000 0.87500000
## character(0).9 0.1250000 0.87500000
#The first column is the predicted probability of the document being non-responsive.
#The second column is the predicted probability of the document being responsive.
#They sum to 1.
#In our case we are interested in the predicted probability of the document being responsive and it would be convenient to handle that as a separated variable.
pred.prob = pred[,2]
#This new object contains our test set predicted probabilities.
#We are interested in the accuracy of our model on the test set, i.e. out-of-sample.
#First we compute the confusion matrix:
cmat_CART<-table(test$responsive, pred.prob >= 0.5)
cmat_CART
##
## FALSE TRUE
## 0 195 20
## 1 17 25
#Compute accuracy
accu_CART <- (cmat_CART[1,1] + cmat_CART[2,2])/sum(cmat_CART)
accu_CART #(195+25)/(195+25+17+20)=0.8560311
## [1] 0.8560311
#Overall accuracy of this CART model is 0.856
#Sensitivity = TP rate = 25 / 42 = 0.595
#Specificity = (1 - FP rate) = 195 / 215 = 0.907
#FP rate = 20 / 215 = 0.093
#FN rate = 17 / 42 = 0.405
#Comparison with the baseline model
#Let's compare this to a simple baseline model that always predicts non-responsive (i.e. the most common value of the dependent variable).
#To compute the accuracy of the baseline model, let's make a table of just the outcome variable responsive.
cmat_baseline <-table(test$responsive)
cmat_baseline
##
## 0 1
## 215 42
#Baseline accuracy
accu_baseline <- max(cmat_baseline)/sum(cmat_baseline)
accu_baseline #215/(215+42)=0.8365759
## [1] 0.8365759
#The accuracy of the baseline model is then 0.8366.
#We see just a small improvement in accuracy using the CART model, which is a common case in unbalanced data sets.
#However, as in most document retrieval applications, there are uneven costs for different types of errors here.
#Typically, a human will still have to manually review all of the predicted responsive documents to make sure they are actually responsive.
#Therefore:
#If we have a false positive, i.e. a non-responsive document labeled as responsive, the mistake translates to a bit of additional work in the manual review process but no further harm, since the manual review process will remove this erroneous result.
#On the other hand, if we have a false negative, i.e. a responsive document labeled as non-responsive by our model, we will miss the document entirely in our predictive coding process.
#Therefore, we are going to assign a higher cost to false negatives than to false positives, which makes this a good time to look at other cut-offs on our ROC curve.
# Video 7
# ROC curve
library(ROCR)
##Lets evaluate our model using the ROC curve
#Let's look at the ROC curve so we can understand the performance of our model at different cutoffs.
#To plot the ROC curve we use the performance() function to extract the true positive rate and false positive rate.
##First we use the prediction() function with first argument the second column of pred, and second argument the true outcome values, test$responsive.
predROCR = prediction(pred.prob, test$responsive)
##We pass the output of prediction() to performance() to which we give also two arguments for what we want on the X and Y axes of our ROC curve, true positive rate and false positive rate.
perfROCR = performance(predROCR, "tpr", "fpr")
#We then plot the ROC curve, with the option that color-codes the different cutoff thresholds.
plot(perfROCR, colorize=TRUE)
#The best cutoff to select is entirely dependent on the costs assigned by the decision maker to false positives and true positives.
#However, we do favor cutoffs that give us a high sensitivity, i.e. we want to identify a large number of the responsive documents.
#Therefore a choice that might look promising could be in the part of the curve where it becomes flatter (going towards the right), where we have a true positive rate of around 70% (meaning that we're getting about 70% of all the responsive documents), and a false positive rate of about 20% (meaning that we are making mistakes and accidentally identifying as responsive 20% of the non-responsive documents.)
#Since, typically, the vast majority of documents are non-responsive, operating at this cutoff would result in a large decrease in the amount of manual effort needed in the eDiscovery process.
#From the blue color of the plot at this particular location we can infer that we are looking at a threshold around maybe 0.15 or so, significantly lower than 0.5, which is definitely what we would expect since we favor false positives to false negatives.
#Compute Area Under the Curve (AUC)
performance(predROCR, "auc")@y.values
## [[1]]
## [1] 0.7936323
#The AUC of the CART models is 0.7936, which means that our model can differentiate between a randomly selected responsive and non-responsive document about 79.4% of the time.
Wikipedia is a free online encyclopedia that anyone can edit and contribute to. It is available in many languages and is growing all the time. On the English language version of Wikipedia:
One of the consequences of being editable by anyone is that some people vandalize pages. This can take the form of removing content, adding promotional or inappropriate content, or more subtle shifts that change the meaning of the article. With this many articles and edits per day it is difficult for humans to detect all instances of vandalism and revert (undo) them. As a result, Wikipedia uses bots - computer programs that automatically revert edits that look like vandalism. In this assignment we will attempt to develop a vandalism detector that uses machine learning to distinguish between a valid edit and vandalism.
The data for this problem is based on the revision history of the page Language. Wikipedia provides a history for each page that consists of the state of the page at each revision. Rather than manually considering each revision, a script was run that checked whether edits stayed or were reverted. If a change was eventually reverted then that revision is marked as vandalism. This may result in some misclassifications, but the script performs well enough for our needs.
As a result of this preprocessing, some common processing tasks have already been done, including lower-casing and punctuation removal. The columns in the dataset are:
Notice the repeated use of unique. The data we have available is not the traditional bag of words - rather it is the set of words that were removed or added. For example, if a word was removed multiple times in a revision it will only appear one time in the “Removed” column.
#PROBLEM 1.1 - BAGS OF WORDS
#Load the data wiki.csv with the option stringsAsFactors=FALSE, calling the data frame "wiki". Convert the "Vandal" column to a factor using the command wiki$Vandal = as.factor(wiki$Vandal).
#LOADING AND PROCESSING DATA IN R
wiki<-read.csv("wiki.csv",stringsAsFactors=FALSE)
str(wiki)
## 'data.frame': 3876 obs. of 7 variables:
## $ X.1 : int 1 2 3 4 5 6 7 8 9 10 ...
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Vandal : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Minor : int 1 1 0 1 1 0 0 0 1 0 ...
## $ Loggedin: int 1 1 1 0 1 1 1 1 1 0 ...
## $ Added : chr " represent psycholinguisticspsycholinguistics orthographyorthography help text all actions through human ethnologue relationsh"| __truncated__ " website external links" " " " afghanistan used iran mostly that farsiis is countries some xmlspacepreservepersian parts tajikestan region" ...
## $ Removed : chr " " " talklanguagetalk" " regarded as technologytechnologies human first" " represent psycholinguisticspsycholinguistics orthographyorthography help all actions through ethnologue relationships linguis"| __truncated__ ...
#We have 3876 observations of 7 variables
#Convert the "Vandal" column to a factor
wiki$Vandal = as.factor(wiki$Vandal)
str(wiki)
## 'data.frame': 3876 obs. of 7 variables:
## $ X.1 : int 1 2 3 4 5 6 7 8 9 10 ...
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Vandal : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Minor : int 1 1 0 1 1 0 0 0 1 0 ...
## $ Loggedin: int 1 1 1 0 1 1 1 1 1 0 ...
## $ Added : chr " represent psycholinguisticspsycholinguistics orthographyorthography help text all actions through human ethnologue relationsh"| __truncated__ " website external links" " " " afghanistan used iran mostly that farsiis is countries some xmlspacepreservepersian parts tajikestan region" ...
## $ Removed : chr " " " talklanguagetalk" " regarded as technologytechnologies human first" " represent psycholinguisticspsycholinguistics orthographyorthography help all actions through ethnologue relationships linguis"| __truncated__ ...
#How many cases of vandalism were detected in the history of this page?
table(wiki$Vandal)
##
## 0 1
## 2061 1815
#Ans:1815
#EXPLANATION:There are 1815 observations with value 1, which denotes vandalism.
##############################
#PROBLEM 1.2 - BAGS OF WORDS
#We will now use the bag of words approach to build a model. We have two columns of textual data, with different meanings. For example, adding rude words has a different meaning to removing rude words. We'll start like we did in class by building a document term matrix from the Added column. The text already is lowercase and stripped of punctuation. So to pre-process the data, just complete the following four steps:
#1) Create the corpus for the Added column, and call it "corpusAdded".
library(tm)
library(SnowballC)
##Various function in the tm package can be used to create a corpus in many different ways.We will create it from the tweet column of our data frame using two functions, Corpus() and VectorSource(). We feed to this latter the Tweets variable of the tweets data frame.
# Create corpus
corpusAdded= Corpus(VectorSource(wiki$Added))
##We can check that the documents match our tweets by using double brackets [[.
#To inspect the first in our corpus, we select the first element as:
corpusAdded[[1]]$content
## [1] " represent psycholinguisticspsycholinguistics orthographyorthography help text all actions through human ethnologue relationships linguistics regarded writing languages to other listing xmlspacepreservelanguages metaverse formal term philology common each including phonologyphonology often ten list humans affiliation see computer are speechpathologyspeech our what for ways dialects please artificial written body be of quite hypothesis found alone refers by about language profanity study programming priorities rosenfelders technologytechnologies makes or first among useful languagephilosophy one sounds use area create phrases mark their genetic basic families complete but sapirwhorfhypothesissapirwhorf with talklanguagetalk population animals this science up vocal can concepts called at and topics locations as numbers have in pathology different develop 4000 things ideas grouped complex animal mathematics fairly literature httpwwwzompistcom philosophy most important meaningful a historicallinguisticsorphilologyhistorical semanticssemantics patterns the oral"
corpusAdded= tm_map(corpusAdded, PlainTextDocument)
#2) Remove the English-language stopwords.
##Removing stop words
#Removing words can be done with the removeWords argument to the tm_map() function, with an extra argument, i.e. what the stop words are that we want to remove, for which we simply use the list for english that is provided by the tm package.
#We will remove all of these English stop words as it probably won't be very useful in our prediction problem.
corpusAdded= tm_map(corpusAdded, removeWords, stopwords("english"))
#3) Stem the words.
#Lastly, we want to stem our document with the stemDocument argument.
corpusAdded= tm_map(corpusAdded, stemDocument)
# Now that we have gone through those four preprocessing steps, we can take a second look at the first email in the corpus.
corpusAdded[[1]]$content
## [1] " repres psycholinguisticspsycholinguist orthographyorthographi help text action human ethnologu relationship linguist regard write languag list xmlspacepreservelanguag metavers formal term philolog common includ phonologyphonolog often ten list human affili see comput speechpathologyspeech way dialect pleas artifici written bodi quit hypothesi found alon refer languag profan studi program prioriti rosenfeld technologytechnolog make first among use languagephilosophi one sound use area creat phrase mark genet basic famili complet sapirwhorfhypothesissapirwhorf talklanguagetalk popul anim scienc vocal can concept call topic locat number patholog differ develop 4000 thing idea group complex anim mathemat fair literatur httpwwwzompistcom philosophi import meaning historicallinguisticsorphilologyhistor semanticssemant pattern oral"
#BAG OF WORDS
#4) Build the DocumentTermMatrix, and call it dtmAdded.
#Create a Document Term Matrix
corpusAdded= tm_map(corpusAdded, PlainTextDocument)
dtmAdded= DocumentTermMatrix(corpusAdded)
dtmAdded
## <<DocumentTermMatrix (documents: 3876, terms: 6675)>>
## Non-/sparse entries: 15368/25856932
## Sparsity : 100%
## Maximal term length: 784
## Weighting : term frequency (tf)
#How many terms appear in dtmAdded?
#Ans:6675
#####################################
#PROBLEM 1.3 - BAGS OF WORDS
#Filter out sparse terms by keeping only terms that appear in 0.3% or more of the revisions, and call the new matrix sparseAdded. How many terms appear in sparseAdded?
#what we can see is that even though we have only 3876 documents in the corpus, we have over 6675 terms that showed up at least once, which is clearly too many variables for the number of observations we have.
#So we want to remove the terms that don't appear too often in our data set.
# Remove sparse terms,i.e. let's remove some terms that don't appear very often.
sparseAdded= removeSparseTerms(dtmAdded, 0.997)
sparseAdded
## <<DocumentTermMatrix (documents: 3876, terms: 166)>>
## Non-/sparse entries: 2681/640735
## Sparsity : 100%
## Maximal term length: 28
## Weighting : term frequency (tf)
#Ans:166
######################################
#PROBLEM 1.4 - BAGS OF WORDS
#Convert sparseAdded to a data frame called wordsAdded, and then prepend all the words with the letter A, by using the command:
# Create data frame from sparseAdded
#Let's convert the sparse matrix into a data frame that we will be able to use for our predictive models.
wordsAdded= as.data.frame(as.matrix(sparseAdded))
#prepend all the words with the letter A
colnames(wordsAdded) = paste("A", colnames(wordsAdded))
#str(wordsAdded)
dim(wordsAdded)
## [1] 3876 166
########################
#Now repeat all of the steps we've done so far (create a corpus, remove stop words, stem the document, create a sparse document term matrix, and convert it to a data frame) to create a Removed bag-of-words dataframe, called wordsRemoved, except this time, prepend all of the words with the letter R:
# Create corpus
corpusRemoved= Corpus(VectorSource(wiki$Removed))
##We can check that the documents match our tweets by using double brackets [[.
#To inspect the first in our corpus, we select the first element as:
corpusRemoved[[1]]$content
## [1] " "
corpusRemoved= tm_map(corpusRemoved, PlainTextDocument)
#2) Remove the English-language stopwords.
##Removing stop words
#Removing words can be done with the removeWords argument to the tm_map() function, with an extra argument, i.e. what the stop words are that we want to remove, for which we simply use the list for english that is provided by the tm package.
#We will remove all of these English stop words as it probably won't be very useful in our prediction problem.
corpusRemoved= tm_map(corpusRemoved, removeWords, stopwords("english"))
#3) Stem the words.
#Lastly, we want to stem our document with the stemDocument argument.
corpusRemoved= tm_map(corpusRemoved, stemDocument)
# Now that we have gone through the preprocessing steps, we can take a second look at the second doc in the corpus.
corpusRemoved[[2]]$content
## [1] " talklanguagetalk"
#BAG OF WORDS
#4) Build the DocumentTermMatrix, and call it dtmAdded.
#Create a Document Term Matrix
corpusRemoved= tm_map(corpusRemoved, PlainTextDocument)
dtmRemoved= DocumentTermMatrix(corpusRemoved)
dtmRemoved
## <<DocumentTermMatrix (documents: 3876, terms: 5403)>>
## Non-/sparse entries: 13293/20928735
## Sparsity : 100%
## Maximal term length: 784
## Weighting : term frequency (tf)
#BAGS OF WORDS
#So we want to remove the terms that don't appear too often in our data set.
# Remove sparse terms,i.e. let's remove some terms that don't appear very often.
sparseRemoved= removeSparseTerms(dtmRemoved, 0.997)
sparseRemoved
## <<DocumentTermMatrix (documents: 3876, terms: 162)>>
## Non-/sparse entries: 2552/625360
## Sparsity : 100%
## Maximal term length: 28
## Weighting : term frequency (tf)
#Create data frame from sparseRemoved
#Let's convert the sparse matrix into a data frame that we will be able to use for our predictive models.
wordsRemoved= as.data.frame(as.matrix(sparseRemoved))
#prepend all the words with the letter R
colnames(wordsRemoved) = paste("R", colnames(wordsRemoved))
#How many words are in the wordsRemoved data frame?
#str(wordsRemoved)
#or
ncol(wordsRemoved)
## [1] 162
#Ans:162
#############################################
#PROBLEM 1.5 - BAGS OF WORDS
#Combine the two data frames into a data frame called wikiWords with the following line of code:
wikiWords = cbind(wordsAdded, wordsRemoved)
#The cbind function combines two sets of variables for the same observations into one data frame.
# We also have to add back-in the vandal variable
wikiWords$vandal= wiki$Vandal
#Split data in training/testing sets
#let's split our data into a training set and a testing set, putting 70% of the data in the training set.
library(caTools)
set.seed(123)
#split the data set using sample.split from the "caTools" package to put 70% in the training set.
spl = sample.split(wikiWords$vandal, 0.7)
wikiTrain = subset(wikiWords, spl == TRUE)
wikiTest=subset(wikiWords, spl == FALSE)
#str(wikiTest)
#What is the accuracy on the test set of a baseline method that always predicts "not vandalism" (the most frequent outcome)?
baseline<-table(wikiTest$vandal)
baseline
##
## 0 1
## 618 545
#Baseline accuracy
accu_baseline <- max(baseline)/sum(baseline)
accu_baseline #618/(618+545) = 0.531
## [1] 0.5313844
#Ans:0.5313844
######################################
#PROBLEM 1.6 - BAGS OF WORDS
#Build a CART model to predict Vandal, using all of the other variables as independent variables. Use the training set to build the model and the default parameters (don't set values for minbucket or cp).
#Now we are ready to build the model, and we will build a simple CART model using the default parameters.
library(rpart)
library(rpart.plot)
#Let's use CART to build a predictive model, using the rpart() function to predict responsive using all of the other variables as our independent variables and the data set wikiTrain.
wikiCART = rpart(vandal~., data=wikiTrain, method="class")
#What is the accuracy of the model on the test set, using a threshold of 0.5? (Remember that if you add the argument type="class" when making predictions, the output of predict will automatically use a threshold of 0.5.)
# Out-of-Sample Performnce of the Model (Make predictions on the test set wikiTest)
testPredictCART= predict(wikiCART, newdata=wikiTest,type="class") #if you add the argument type="class" when making predictions, the output of predict will automatically use a threshold of 0.5.
#We are interested in the accuracy of our model on the test set, i.e. out-of-sample.
#First we compute the confusion matrix
cmat_CART<-table(wikiTest$vandal,testPredictCART)
cmat_CART
## testPredictCART
## 0 1
## 0 618 0
## 1 533 12
#lets now compute the overall accuracy
accu_CART <- (cmat_CART[1,1] + cmat_CART[2,2])/sum(cmat_CART)
accu_CART #(618+12)/(618+533+12) = 0.5417
## [1] 0.5417025
#Ans:0.5417025
##################################################
#PROBLEM 1.7 - BAGS OF WORDS
#Plot the CART tree. How many word stems does the CART model use?
prp(wikiCART)
#Ans:2
#EXPLANATION:If you plot the tree with prp(wikiCART), you can see that the tree uses two words: "R arbitr" and "R thousa".
############################################
#PROBLEM 1.8 - BAGS OF WORDS
#Given the performance of the CART model relative to the baseline, what is the best explanation of these results?
#Ans:Although it beats the baseline, bag of words is not very predictive for this problem.
#EXPLANATION:There is no reason to think there was anything wrong with the split. CART did not overfit, which you can check by computing the accuracy of the model on the training set. Over-sparsification is plausible but unlikely, since we selected a very high sparsity parameter. The only conclusion left is simply that bag of words didn't work very well in this case.
##########################################
#PROBLEM 2.1 - PROBLEM-SPECIFIC KNOWLEDGE
#We weren't able to improve on the baseline using the raw textual information. More specifically, the words themselves were not useful. There are other options though, and in this section we will try two techniques - identifying a key class of words, and counting words.
#The key class of words we will use are website addresses. "Website addresses" (also known as URLs - Uniform Resource Locators) are comprised of two main parts. An example would be "http://www.google.com". The first part is the protocol, which is usually "http" (HyperText Transfer Protocol). The second part is the address of the site, e.g. "www.google.com". We have stripped all punctuation so links to websites appear in the data as one word, e.g. "httpwwwgooglecom". We hypothesize that given that a lot of vandalism seems to be adding links to promotional or irrelevant websites, the presence of a web address is a sign of vandalism.
#We can search for the presence of a web address in the words added by searching for "http" in the Added column. The grepl function returns TRUE if a string is found in another string, e.g.
grepl("cat","dogs and cats",fixed=TRUE) # TRUE
## [1] TRUE
grepl("cat","dogs and rats",fixed=TRUE) # FALSE
## [1] FALSE
#Create a copy of your dataframe from the previous question:
wikiWords2 = wikiWords
#Make a new column in wikiWords2 that is 1 if "http" was in Added:
wikiWords2$HTTP = ifelse(grepl("http",wiki$Added,fixed=TRUE), 1, 0)
#Based on this new column, how many revisions added a link?
table(wikiWords2$HTTP)
##
## 0 1
## 3659 217
#Ans: 217
#EXPLANATION:You can find this number by typing table(wikiWords2$HTTP), and seeing that there are 217 observations with value 1.
###############################
#PROBLEM 2.2 - PROBLEM-SPECIFIC KNOWLEDGE
#In problem 1.5, you computed a vector called "spl" that identified the observations to put in the training and testing sets. Use that variable (do not recompute it with sample.split) to make new training and testing sets:
wikiTrain2 = subset(wikiWords2, spl==TRUE)
wikiTest2 = subset(wikiWords2, spl==FALSE)
#Then create a new CART model using this new variable as one of the independent variables.
#Now we are ready to build the model, and we will build a simple CART model using the default parameters.
library(rpart)
library(rpart.plot)
#Let's use CART to build a predictive model, using the rpart() function to predict responsive using all of the other variables as our independent variables and the data set wikiTrain2.
wikiCART2 = rpart(vandal~., data=wikiTrain2, method="class")
# Out-of-Sample Performnce of the Model (Make predictions on the test set)
testPredictCART2= predict(wikiCART2, newdata=wikiTest2,type="class") #if you add the argument type="class" when making predictions, the output of predict will automatically use a threshold of 0.5.
#We are interested in the accuracy of our model on the test set, i.e. out-of-sample.
#First we compute the confusion matrix
cmat_CART<-table(wikiTest2$vandal,testPredictCART2)
cmat_CART
## testPredictCART2
## 0 1
## 0 609 9
## 1 488 57
#What is the new accuracy of the CART model on the test set, using a threshold of 0.5?
#lets now compute the overall accuracy
accu_CART <- (cmat_CART[1,1] + cmat_CART[2,2])/sum(cmat_CART)
accu_CART #(609+57)/(609+9+488+57) = 0.5726569
## [1] 0.5726569
#Ans:0.5726569
###############################
#PROBLEM 2.3 - PROBLEM-SPECIFIC KNOWLEDGE
#Another possibility is that the number of words added and removed is predictive, perhaps more so than the actual words themselves. We already have a word count available in the form of the document-term matrices (DTMs).
#Sum the rows of dtmAdded and dtmRemoved and add them as new variables in your data frame wikiWords2 (called NumWordsAdded and NumWordsRemoved) by using the following commands:
wikiWords2$NumWordsAdded = rowSums(as.matrix(dtmAdded))
wikiWords2$NumWordsRemoved = rowSums(as.matrix(dtmRemoved))
#What is the average number of words added?
mean(wikiWords2$NumWordsAdded)
## [1] 4.050052
#Ans:4.050052
#################################################
#PROBLEM 2.4 - PROBLEM-SPECIFIC KNOWLEDGE
#In problem 1.5, you computed a vector called "spl" that identified the observations to put in the training and testing sets. Use that variable (do not recompute it with sample.split) to make new training and testing sets with wikiWords2. Create the CART model again (using the training set and the default parameters).
wikiTrain3 = subset(wikiWords2, spl==TRUE)
wikiTest3 = subset(wikiWords2, spl==FALSE)
#Then create a new CART model using this new variable as one of the independent variables.
#Now we are ready to build the model, and we will build a simple CART model using the default parameters.
library(rpart)
library(rpart.plot)
#Let's use CART to build a predictive model, using the rpart() function to predict responsive using all of the other variables as our independent variables and the data set wikiTrain3.
wikiCART3 = rpart(vandal~., data=wikiTrain3, method="class")
# Out-of-Sample Performnce of the Model (Make predictions on the wikiTest3 set)
testPredictCART3= predict(wikiCART3, newdata=wikiTest3,type="class") #if you add the argument type="class" when making predictions, the output of predict will automatically use a threshold of 0.5.
#We are interested in the accuracy of our model on the test set, i.e. out-of-sample.
#First we compute the confusion matrix
cmat_CART<-table(wikiTest3$vandal,testPredictCART3)
cmat_CART
## testPredictCART3
## 0 1
## 0 514 104
## 1 297 248
#What is the new accuracy of the CART model on the test set, using a threshold of 0.5?
#lets now compute the overall accuracy
accu_CART <- (cmat_CART[1,1] + cmat_CART[2,2])/sum(cmat_CART)
accu_CART #(514+248)/(514+104+297+248) = 0.6552021
## [1] 0.6552021
#Ans:0.6552021
#What is the new accuracy of the CART model on the test set?
#Ans:0.6552021
##########################################
#PROBLEM 3.1 - USING NON-TEXTUAL DATA
#We have two pieces of "metadata" (data about data) that we haven't yet used. Make a copy of wikiWords2, and call it wikiWords3:
wikiWords3 = wikiWords2
#Then add the two original variables Minor and Loggedin to this new data frame:
wikiWords3$Minor = wiki$Minor
wikiWords3$Loggedin = wiki$Loggedin
#In problem 1.5, you computed a vector called "spl" that identified the observations to put in the training and testing sets. Use that variable (do not recompute it with sample.split) to make new training and testing sets with wikiWords3.
wikiTrain4 = subset(wikiWords3, spl==TRUE)
wikiTest4 = subset(wikiWords3, spl==FALSE)
#Build a CART model using all the training data. What is the accuracy of the model on the test set?
#Now we are ready to build the model, and we will build a simple CART model using the default parameters.
library(rpart)
library(rpart.plot)
#Let's use CART to build a predictive model, using the rpart() function to predict responsive using all of the other variables as our independent variables and the data set wikiTrain4.
wikiCART4 = rpart(vandal~., data=wikiTrain4, method="class")
# Out-of-Sample Performnce of the Model (Make predictions on the test set wikiTest4)
testPredictCART4= predict(wikiCART4, newdata=wikiTest4,type="class") #if you add the argument type="class" when making predictions, the output of predict will automatically use a threshold of 0.5.
#We are interested in the accuracy of our model on the test set, i.e. out-of-sample.
#First we compute the confusion matrix
cmat_CART<-table(wikiTest4$vandal,testPredictCART4)
cmat_CART
## testPredictCART4
## 0 1
## 0 595 23
## 1 304 241
#What is the new accuracy of the CART model on the test set, using a threshold of 0.5?
#lets now compute the overall accuracy
accu_CART <- (cmat_CART[1,1] + cmat_CART[2,2])/sum(cmat_CART)
accu_CART #(595+241)/(595+23+304+241) = 0.7188306.
## [1] 0.7188306
#Ans:0.7188306
######################################
#PROBLEM 3.2 - USING NON-TEXTUAL DATA
#There is a substantial difference in the accuracy of the model using the meta data. Is this because we made a more complicated model?
#Plot the CART tree. How many splits are there in the tree?
prp(wikiCART4)
#Ans:
#EXPLANATION:You can plot the tree with prp(wikiCART4). The first split is on the variable "Loggedin", the second split is on the number of words added, and the third split is on the number of words removed.By adding new independent variables, we were able to significantly improve our accuracy without making the model more complicated!
AUTOMATING REVIEWS IN MEDICINE
The medical literature is enormous. Pubmed, a database of medical publications maintained by the U.S. National Library of Medicine, has indexed over 23 million medical publications. Further, the rate of medical publication has increased over time, and now there are nearly 1 million new publications in the field each year, or more than one per minute.
The large size and fast-changing nature of the medical literature has increased the need for reviews, which search databases like Pubmed for papers on a particular topic and then report results from the papers found. While such reviews are often performed manually, with multiple people reviewing each search result, this is tedious and time consuming. In this problem, we will see how text analytics can be used to automate the process of information retrieval.
The dataset consists of the titles (variable title) and abstracts (variable abstract) of papers retrieved in a Pubmed search. Each search result is labeled with whether the paper is a clinical trial testing a drug therapy for cancer (variable trial). These labels were obtained by two people reviewing each search result and accessing the actual paper if necessary, as part of a literature review of clinical trials testing drug therapies for advanced and metastatic breast cancer.
#PROBLEM 1.1 - LOADING THE DATA
trials<-read.csv("clinical_trial.csv", stringsAsFactors=FALSE)
str(trials)
## 'data.frame': 1860 obs. of 3 variables:
## $ title : chr "Treatment of Hodgkin's disease and other cancers with 1,3-bis(2-chloroethyl)-1-nitrosourea (BCNU; NSC-409962)." "Cell mediated immune status in malignancy--pretherapy and post-therapy assessment." "Neoadjuvant vinorelbine-capecitabine versus docetaxel-doxorubicin-cyclophosphamide in early nonresponsive breast cancer: phase "| __truncated__ "Randomized phase 3 trial of fluorouracil, epirubicin, and cyclophosphamide alone or followed by Paclitaxel for early breast can"| __truncated__ ...
## $ abstract: chr "" "Twenty-eight cases of malignancies of different kinds were studied to assess T-cell activity and population before and after in"| __truncated__ "BACKGROUND: Among breast cancer patients, nonresponse to initial neoadjuvant chemotherapy is associated with unfavorable outcom"| __truncated__ "BACKGROUND: Taxanes are among the most active drugs for the treatment of metastatic breast cancer, and, as a consequence, they "| __truncated__ ...
## $ trial : int 1 0 1 1 1 0 1 0 0 0 ...
summary(trials)
## title abstract trial
## Length:1860 Length:1860 Min. :0.0000
## Class :character Class :character 1st Qu.:0.0000
## Mode :character Mode :character Median :0.0000
## Mean :0.4392
## 3rd Qu.:1.0000
## Max. :1.0000
#We can use R's string functions to learn more about the titles and abstracts of the located papers. The nchar() function counts the number of characters in a piece of text. Using the nchar() function on the variables in the data frame, answer the following questions:
#How many characters are there in the longest abstract? (Longest here is defined as the abstract with the largest number of characters.)
max(nchar(trials$abstract))
## [1] 3708
#or
summary(nchar(trials$abstract))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 1196 1583 1481 1821 3708
#Ans:3708
##############################
#PROBLEM 1.2 - LOADING THE DATA
#How many search results provided no abstract? (HINT: A search result provided no abstract if the number of characters in the abstract field is zero.)
table(nchar(trials$abstract) == 0)
##
## FALSE TRUE
## 1748 112
#or
sum(nchar(trials$abstract)==0)
## [1] 112
#Ans:112
################################
#PROBLEM 1.3 - LOADING THE DATA
#Find the observation with the minimum number of characters in the title (the variable "title") out of all of the observations in this dataset. What is the text of the title of this article? Include capitalization and punctuation in your response, but don't include the quotes.
trials$title[which.min(nchar(trials$title))]
## [1] "A decade of letrozole: FACE."
#Ans:A decade of letrozole: FACE.
##########################################
#PROBLEM 2.1 - PREPARING THE CORPUS
#Because we have both title and abstract information for trials, we need to build two corpera instead of one. Name them corpusTitle and corpusAbstract.
# Load tm package
library(tm)
#1) Convert the title variable to corpusTitle and the abstract variable to corpusAbstract.
#CREATING the Corpera
#We will need to convert our title & abstract to a corpus for pre-processing. Various function in the tm package can be used to create a corpus in many different ways.
#We will create it using two functions, Corpus() and VectorSource(). We feed to this latter the title & abstract variable of the trials data frame
corpusTitle= Corpus(VectorSource(trials$title))
corpusAbstract= Corpus(VectorSource(trials$abstract))
#Lets see the first title & abstract in our corpera
corpusTitle[[1]]$content
## [1] "Treatment of Hodgkin's disease and other cancers with 1,3-bis(2-chloroethyl)-1-nitrosourea (BCNU; NSC-409962)."
corpusAbstract[[1]]$content
## [1] ""
#2) Convert corpusTitle and corpusAbstract to lowercase. After performing this step, remember to run the lines:
#corpusTitle = tm_map(corpusTitle, PlainTextDocument)
#corpusAbstract = tm_map(corpusAbstract, PlainTextDocument)
#To transform all text to lower case:
corpusTitle= tm_map(corpusTitle, content_transformer(tolower))
corpusAbstract = tm_map(corpusAbstract, content_transformer(tolower))
#Checking the same "documents" as before:
corpusTitle[[1]]$content
## [1] "treatment of hodgkin's disease and other cancers with 1,3-bis(2-chloroethyl)-1-nitrosourea (bcnu; nsc-409962)."
corpusAbstract[[1]]$content
## [1] ""
#3) Remove the punctuation in corpusTitle and corpusAbstract.
#Removing punctuation
corpusTitle= tm_map(corpusTitle, removePunctuation)
corpusAbstract= tm_map(corpusAbstract, removePunctuation)
#Checking the same "documents" as before:
corpusTitle[[1]]$content
## [1] "treatment of hodgkins disease and other cancers with 13bis2chloroethyl1nitrosourea bcnu nsc409962"
corpusAbstract[[1]]$content
## [1] ""
#4) Remove the English language stop words from corpusTitle and corpusAbstract.
#Removing stop words
#Removing words can be done with the removeWords argument to the tm_map() function, with an extra argument, i.e. what the stop words are that we want to remove, for which we simply use the list for english that is provided by the tm package.
#We will remove all of these English stop words as it probably won't be very useful in our prediction problem.
corpusTitle= tm_map(corpusTitle, removeWords, stopwords("english"))
corpusAbstract= tm_map(corpusAbstract, removeWords, stopwords("english"))
#Checking the same "documents" as before:
corpusTitle[[1]]$content
## [1] "treatment hodgkins disease cancers 13bis2chloroethyl1nitrosourea bcnu nsc409962"
corpusAbstract[[1]]$content
## [1] ""
#5) Stem the words in corpusTitle and corpusAbstract (each stemming might take a few minutes).
#Stemming
#Lastly, we want to stem our document with the stemDocument argument.
corpusTitle= tm_map(corpusTitle, stemDocument)
corpusAbstract= tm_map(corpusAbstract, stemDocument)
#Checking the same "documents" as before:
corpusTitle[[1]]$content
## [1] "treatment hodgkin diseas cancer 13bis2chloroethyl1nitrosourea bcnu nsc409962"
corpusAbstract[[1]]$content
## [1] ""
#6) Build a document term matrix called dtmTitle from corpusTitle and dtmAbstract from corpusAbstract.
#Create a Document Term Matrix
Corpus = tm_map(corpusTitle, PlainTextDocument)
Corpus = tm_map(corpusAbstract, PlainTextDocument)
#The tm package provides a function called DocumentTermMatrix() that generates a matrix
#The values in the matrix are the number of times that word appears in each document.
dtmTitle= DocumentTermMatrix(corpusTitle)
dtmTitle
## <<DocumentTermMatrix (documents: 1860, terms: 2834)>>
## Non-/sparse entries: 23416/5247824
## Sparsity : 100%
## Maximal term length: 49
## Weighting : term frequency (tf)
dtmAbstract= DocumentTermMatrix(corpusAbstract)
dtmAbstract
## <<DocumentTermMatrix (documents: 1860, terms: 12343)>>
## Non-/sparse entries: 153241/22804739
## Sparsity : 99%
## Maximal term length: 67
## Weighting : term frequency (tf)
#7) Limit dtmTitle and dtmAbstract to terms with sparseness of at most 95% (aka terms that appear in at least 5% of documents).
# Remove sparse terms which don't appear very often.
dtmTitle= removeSparseTerms(dtmTitle, 0.95)
dtmTitle
## <<DocumentTermMatrix (documents: 1860, terms: 31)>>
## Non-/sparse entries: 10683/46977
## Sparsity : 81%
## Maximal term length: 15
## Weighting : term frequency (tf)
dtmAbstract= removeSparseTerms(dtmAbstract, 0.95)
dtmAbstract
## <<DocumentTermMatrix (documents: 1860, terms: 335)>>
## Non-/sparse entries: 92007/531093
## Sparsity : 85%
## Maximal term length: 15
## Weighting : term frequency (tf)
#8) Convert dtmTitle and dtmAbstract to data frames (keep the names dtmTitle and dtmAbstract).
#Let's convert the sparse matrix into a data frame that we will be able to use for our predictive models.
dtmTitle= as.data.frame(as.matrix(dtmTitle))
dtmAbstract = as.data.frame(as.matrix(dtmAbstract))
#How many terms remain in dtmTitle after removing sparse terms (aka how many columns does it have)?
ncol(dtmTitle)
## [1] 31
#or
dim(dtmTitle)
## [1] 1860 31
#Ans:31
#How many terms remain in dtmAbstract?
ncol(dtmAbstract)
## [1] 335
#or
dim(dtmAbstract)
## [1] 1860 335
#Ans:335
############################################
#PROBLEM 2.2 - PREPARING THE CORPUS
#What is the most likely reason why dtmAbstract has so many more terms than dtmTitle?
#Ans:Abstracts tend to have many more words than titles
#EXPLANATION:Because titles are so short, a word needs to be very common to appear in 5% of titles. Because abstracts have many more words, a word can be much less common and still appear in 5% of abstracts.While abstracts may have wider vocabulary, this is a secondary effect. As we saw in the previous subsection, all papers have titles, but not all have abstracts.
###########################################
#PROBLEM 2.3 - PREPARING THE CORPUS
#What is the most frequent word stem across all the abstracts? Hint: you can use colSums() to compute the frequency of a word across all the abstracts
which.max(colSums(dtmAbstract))
## patient
## 212
#Ans:patient
###############################
#ROBLEM 3.1 - BUILDING A MODEL
#We want to combine dtmTitle and dtmAbstract into a single data frame to make predictions. However, some of the variables in these data frames have the same names. To fix this issue, run the following commands:
colnames(dtmTitle) = paste0("T", colnames(dtmTitle))
colnames(dtmAbstract) = paste0("A", colnames(dtmAbstract))
#What was the effect of these functions?
#Ans:Adding the letter T in front of all the title variable names and adding the letter A in front of all the abstract variable names.
#EXPLANATION:The first line pastes a T at the beginning of each column name for dtmTitle, which are the variable names. The second line does something similar for the Abstract variables - it pastes an A at the beginning of each column name for dtmAbstract, which are the variable names.
#############################
#PROBLEM 3.2 - BUILDING A MODEL
#Using cbind(), combine dtmTitle and dtmAbstract into a single data frame called dtm:
dtm = cbind(dtmTitle, dtmAbstract)
#As we did in class, add the dependent variable "trial" to dtm, copying it from the original data frame called trials.
#We also have to add back-in the outcome variable
dtm$trial<-trials$trial
#How many columns are in this combined data frame?
ncol(dtm)
## [1] 367
#ANs:367
###########################
#PROBLEM 3.3 - BUILDING A MODEL
#Now that we have prepared our data frame, it's time to split it into a training and testing set and to build regression models. Set the random seed to 144 and use the sample.split function from the caTools package to split dtm into data frames named "train" and "test", putting 70% of the data in the training set.
#Split data in training/testing sets
#let's split our data into a training set and a testing set, putting 70% of the data in the training set.
library(caTools)
set.seed(144)
#split the data set using sample.split from the "caTools" package to put 70% in the training set.
spl = sample.split(dtm$trial, 0.7)
train= subset(dtm, spl == TRUE)
test=subset(dtm, spl == FALSE)
#str(train)
#What is the accuracy of the baseline model on the training set? (Remember that the baseline model predicts the most frequent outcome in the training set for all observations.)
#The baseline model predicts the most frequent outcome in the training set for all observations.
cmat_baseline <-table(train$trial)
cmat_baseline
##
## 0 1
## 730 572
#Baseline accuracy
accu_baseline <- max(cmat_baseline)/sum(cmat_baseline)
accu_baseline #730/(730+572)=0.5606759
## [1] 0.5606759
#Ans:0.5606759
#EXPLANATION:Just as in any binary classification problem, the naive baseline always predicts the most common class. From table(train$trial), we see 730 training set results were not trials, and 572 were trials. Therefore, the naive baseline always predicts a result is not a trial, yielding accuracy of 730/(730+572).
#############################
#PROBLEM 3.4 - BUILDING A MODEL
#Build a CART model called trialCART, using all the independent variables in the training set to train the model, and then plot the CART model. Just use the default parameters to build the model (don't add a minbucket or cp value). Remember to add the method="class" argument, since this is a classification problem.
#Now we are ready to build the model, and we will build a simple CART model using the default parameters & all the independent variables in the training set.
library(rpart)
library(rpart.plot)
#Let's use CART to build a predictive model, using the rpart() function to predict responsive using all of the other variables as our independent variables and the data set train.
trialCART= rpart(trial~., data=train, method="class") #the method="class" argument as this is a classification problem
#Plotting the CART model
prp(trialCART)
#What is the name of the first variable the model split on?
#Ans:Tphase
#The first split checks whether or not Tphase is less than 0.5
################################
#PROBLEM 3.5 - BUILDING A MODEL
#Obtain the training set predictions for the model (do not yet predict on the test set). Extract the predicted probability of a result being a trial (recall that this involves not setting a type argument, and keeping only the second column of the predict output). What is the maximum predicted probability for any result?
#Make predictions on the train set)
predTrain= predict(trialCART)
max(predTrain[,2])
## [1] 0.8718861
#or
predTrain= predict(trialCART)[,2]
summary(predTrain)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.05455 0.13640 0.28750 0.43930 0.78230 0.87190
#Ans:0.8718861
###########################################
#PROBLEM 3.6 - BUILDING A MODEL
#Without running the analysis, how do you expect the maximum predicted probability to differ in the testing set?
#Ans:The maximum predicted probability will likely be exactly the same in the testing set.
#EXPLANATION:Because the CART tree assigns the same predicted probability to each leaf node and there are a small number of leaf nodes compared to data points, we expect exactly the same maximum predicted probability.
####################################
#PROBLEM 3.7 - BUILDING A MODEL
#For these questions, use a threshold probability of 0.5 to predict that an observation is a clinical trial.
#What is the training set accuracy of the CART model?
#We are interested in the accuracy of our model on the train set.
#First we compute the confusion matrix:
cmat_CART1<-table(train$trial,predTrain >= 0.5)
cmat_CART1
##
## FALSE TRUE
## 0 631 99
## 1 131 441
#lets now compute the overall accuracy
accu_CART <- (cmat_CART1[1,1] + cmat_CART1[2,2])/sum(cmat_CART1)
accu_CART #(631+441)/(631+441+99+131) = 0.8233487
## [1] 0.8233487
#Ans:0.8233487
#What is the training set sensitivity of the CART model?
441/(131+441)
## [1] 0.770979
#Ans: 0.770979
#Sensitivity = TP rate=441/(131+441)=0.770979
#What is the training set specificity of the CART model?
631/(631+99)
## [1] 0.8643836
#ANs:0.8643836
#Specificity = (1 - FP rate)=631/(631+99)=0.8643836
#######################################################
#PROBLEM 4.1 - EVALUATING THE MODEL ON THE TESTING SET
#Evaluate the CART model on the testing set using the predict function and creating a vector of predicted probabilities predTest.
# Make predictions
predTest = predict(trialCART, newdata=test)[,2]
summary(predTest)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.05455 0.13640 0.28750 0.41730 0.78230 0.87190
#Now lets assess the accuracy of the model through confusion matrix
cmat_CART<-table(test$trial, predTest>= 0.5) #first arg is the true outcomes and the second is the predicted outcomes
cmat_CART
##
## FALSE TRUE
## 0 261 52
## 1 83 162
#What is the testing set accuracy, assuming a probability threshold of 0.5 for predicting that a result is a clinical trial?
#lets now compute the overall accuracy
accu_CART <- (cmat_CART[1,1] + cmat_CART[2,2])/sum(cmat_CART)
accu_CART # (261+162)/(261+162+83+52) = 0.7580645
## [1] 0.7580645
#Ans:0.7580645
######################################
PART 5: DECISION-MAKER TRADEOFFS
The decision maker for this problem, a researcher performing a review of the medical literature, would use a model (like the CART one we built here) in the following workflow:
For all of the papers retreived in the PubMed Search, predict which papers are clinical trials using the model. This yields some initial Set A of papers predicted to be trials, and some Set B of papers predicted not to be trials. (See the figure below.)
Then, the decision maker manually reviews all papers in Set A, verifying that each paper meets the study’s detailed inclusion criteria (for the purposes of this analysis, we assume this manual review is 100% accurate at identifying whether a paper in Set A is relevant to the study). This yields a more limited set of papers to be included in the study, which would ideally be all papers in the medical literature meeting the detailed inclusion criteria for the study.
Perform the study-specific analysis, using data extracted from the limited set of papers identified in step 2.
This process is shown in the figure below.
InfoRetrievalFigure2
PROBLEM 5.1 - DECISION-MAKER TRADEOFFS
What is the cost associated with the model in Step 1 making a false negative prediction?
Ans:A paper that should have been included in Set A will be missed, affecting the quality of the results of Step 3.
EXPLANATION:By definition, a false negative is a paper that should have been included in Set A but was missed by the model. This means a study that should have been included in Step 3 was missed, affecting the results.
PROBLEM 5.2 - DECISION-MAKER TRADEOFFS
What is the cost associated with the model in Step 1 making a false positive prediction?
Ans:A paper will be mistakenly added to Set A, yielding additional work in Step 2 of the process but not affecting the quality of the results of Step 3.
EXPLANATION:By definition, a false positive is a paper that should not have been included in Set A but that was actually included. However, because the manual review in Step 2 is assumed to be 100% effective, this extra paper will not make it into the more limited set of papers, and therefore this mistake will not affect the analysis in Step 3.
PROBLEM 5.3 - DECISION-MAKER TRADEOFFS
Given the costs associated with false positives and false negatives, which of the following is most accurate?
Ans:A false negative is more costly than a false positive; the decision maker should use a probability threshold less than 0.5 for the machine learning model.
EXPLANATION:A false negative might negatively affect the results of the literature review and analysis, while a false positive is a nuisance (one additional paper that needs to be manually checked). As a result, the cost of a false negative is much higher than the cost of a false positive, so much so that many studies actually use no machine learning (aka no Step 1) and have two people manually review each search result in Step 2. As always, we prefer a lower threshold in cases where false negatives are more costly than false positives, since we will make fewer negative predictions.
Nearly every email user has at some point encountered a “spam” email, which is an unsolicited message often advertising a product, containing links to malware, or attempting to scam the recipient. Roughly 80-90% of more than 100 billion emails sent each day are spam emails, most being sent from botnets of malware-infected computers. The remainder of emails are called “ham” emails.
As a result of the huge number of spam emails being sent across the Internet each day, most email providers offer a spam filter that automatically flags likely spam messages and separates them from the ham. Though these filters use a number of techniques (e.g. looking up the sender in a so-called “Blackhole List” that contains IP addresses of likely spammers), most rely heavily on the analysis of the contents of an email via text analytics.
In this homework problem, we will build and evaluate a spam filter using a publicly available dataset first described in the 2006 conference paper “Spam Filtering with Naive Bayes – Which Naive Bayes?” by V. Metsis, I. Androutsopoulos, and G. Paliouras. The “ham” messages in this dataset come from the inbox of former Enron Managing Director for Research Vincent Kaminski, one of the inboxes in the Enron Corpus. One source of spam messages in this dataset is the SpamAssassin corpus, which contains hand-labeled spam messages contributed by Internet users. The remaining spam was collected by Project Honey Pot, a project that collects spam messages and identifies spammers by publishing email address that humans would know not to contact but that bots might target with spam. The full dataset we will use was constructed as roughly a 75/25 mix of the ham and spam messages.
The dataset contains just two fields:
#PROBLEM 1.1 - LOADING THE DATASET
emails <- read.csv("emails.csv", stringsAsFactors=FALSE)
str(emails)
## 'data.frame': 5728 obs. of 2 variables:
## $ text: chr "Subject: naturally irresistible your corporate identity lt is really hard to recollect a company : the market is full of suqg"| __truncated__ "Subject: the stock trading gunslinger fanny is merrill but muzo not colza attainder and penultimate like esmark perspicuous ra"| __truncated__ "Subject: unbelievable new homes made easy im wanting to show you this homeowner you have been pre - approved for a $ 454 , 1"| __truncated__ "Subject: 4 color printing special request additional information now ! click here click here for a printable version of our o"| __truncated__ ...
## $ spam: int 1 1 1 1 1 1 1 1 1 1 ...
#How many emails are in the dataset?
nrow(emails)
## [1] 5728
#Ans:5728
######################################
#PROBLEM 1.2 - LOADING THE DATASET
#How many of the emails are spam?
table(emails$spam)
##
## 0 1
## 4360 1368
#or
sum(emails$spam)
## [1] 1368
#Ans:1368
#################################
#PROBLEM 1.3 - LOADING THE DATASET
#Which word appears at the beginning of every email in the dataset? Respond as a lower-case word with punctuation removed.
emails$text[1]
## [1] "Subject: naturally irresistible your corporate identity lt is really hard to recollect a company : the market is full of suqgestions and the information isoverwhelminq ; but a good catchy logo , stylish statlonery and outstanding website will make the task much easier . we do not promise that havinq ordered a iogo your company will automaticaily become a world ieader : it isguite ciear that without good products , effective business organization and practicable aim it will be hotat nowadays market ; but we do promise that your marketing efforts will become much more effective . here is the list of clear benefits : creativeness : hand - made , original logos , specially done to reflect your distinctive company image . convenience : logo and stationery are provided in all formats ; easy - to - use content management system letsyou change your website content and even its structure . promptness : you will see logo drafts within three business days . affordability : your marketing break - through shouldn ' t make gaps in your budget . 100 % satisfaction guaranteed : we provide unlimited amount of changes with no extra fees for you to be surethat you will love the result of this collaboration . have a look at our portfolio _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ not interested . . . _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _"
emails$text[1000]
## [1] "Subject: 70 percent off your life insurance get a free quote instantly . question : are you paying too much for life insurance ? most likely the answer is yes ! here ' s why . fact . . . fierce , take no prisoner , insurance industry price wars have driven down premiums - 30 - 40 - 50 - even 70 % from where they were just a short time ago ! that ' s why your insurance company doesn ' t want you to read this . . . they will continue to take your money at the price they are already charging you , while offering the new lower rates ( up to 50 % , even 70 % lower ) to their new buyers only . but , don ' t take our word for it . . . click hereand request a free online quote . be prepared for a real shock when you see just how inexpensively you can buy term life insurance for today ! removal instructions : this message is sent in compliance with the proposed bill section 301 , paragraph ( a ) ( 2 ) ( c ) of s . 1618 . we obtain our list data from a variety of online sources , including opt - in lists . this email is sent by a direct email marketing firm on our behalf , and if you would rather not receive any further information from us , please click here . in this way , you can instantly opt - out from the list your email address was obtained from , whether this was an opt - in or otherwise . please accept our apologies if this message has reached you in error . please allow 5 - 10 business days for your email address to be removed from all lists in our control . meanwhile , simply delete any duplicate emails that you may receive and rest assured that your request to be taken off this list will be honored . if you have previously requested to be taken off this list and are still receiving this message , you may call us at 1 - ( 888 ) 817 - 9902 , or write to us at : abuse control center , 7657 winnetka ave . , canoga park , ca 91306 "
#Ans:subject
#EXPLANATION:You can review emails with, for instance, emails$text[1] or emails$text[1000]. Every email begins with the word "Subject:".
#######################################
#PROBLEM 1.4 - LOADING THE DATASET
#Could a spam classifier potentially benefit from including the frequency of the word that appears in every email?
#Ans:Yes -- the number of times the word appears might help us differentiate spam from ham.
#EXPLANATION:We know that each email has the word "subject" appear at least once, but the frequency with which it appears might help us differentiate spam from ham. For instance, a long email chain would have the word "subject" appear a number of times, and this higher frequency might be indicative of a ham message.
#############################################################
#PROBLEM 1.5 - LOADING THE DATASET
#The nchar() function counts the number of characters in a piece of text. How many characters are in the longest email in the dataset (where longest is measured in terms of the maximum number of characters)?
max(nchar(emails$text)) #or
## [1] 43952
summary(nchar(emails$text))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.0 508.8 979.0 1557.0 1894.0 43950.0
#Ans:43952
###############################################
#PROBLEM 1.6 - LOADING THE DATASET
#Which row contains the shortest email in the dataset? (Just like in the previous problem, shortest is measured in terms of the fewest number of characters.)
which.min(nchar(emails$text))
## [1] 1992
#or
min(nchar(emails$text)) #determining the min length of the email
## [1] 13
which(nchar(emails$text) == 13) #extracting the row having the shortest email
## [1] 1992
#Ans:1992
########################################
#ROBLEM 2.1 - PREPARING THE CORPUS
#Follow the standard steps to build and pre-process the corpus:
#1) Build a new corpus variable called corpus.
library(tm)
library(SnowballC)
# Create corpus
corpus = Corpus(VectorSource(emails$text))
#To inspect the first doc in our corpus, we select the first element as:
corpus[[1]]$content
## [1] "Subject: naturally irresistible your corporate identity lt is really hard to recollect a company : the market is full of suqgestions and the information isoverwhelminq ; but a good catchy logo , stylish statlonery and outstanding website will make the task much easier . we do not promise that havinq ordered a iogo your company will automaticaily become a world ieader : it isguite ciear that without good products , effective business organization and practicable aim it will be hotat nowadays market ; but we do promise that your marketing efforts will become much more effective . here is the list of clear benefits : creativeness : hand - made , original logos , specially done to reflect your distinctive company image . convenience : logo and stationery are provided in all formats ; easy - to - use content management system letsyou change your website content and even its structure . promptness : you will see logo drafts within three business days . affordability : your marketing break - through shouldn ' t make gaps in your budget . 100 % satisfaction guaranteed : we provide unlimited amount of changes with no extra fees for you to be surethat you will love the result of this collaboration . have a look at our portfolio _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ not interested . . . _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _"
#2) Using tm_map, convert the text to lowercase.
#IMPORTANT NOTE: If you are using the latest version of the tm package, you will need to run the following line before continuing (it converts corpus to a Plain Text Document). This is a recent change having to do with the tolower function that occurred after this video was recorded.
corpus = tm_map(corpus, PlainTextDocument)
#Converting text to lower case
#We use the tm_map() function which takes as its first argument the name of a corpus and as second argument a function performing the transformation that we want to apply to the text.
corpus = tm_map(corpus, content_transformer(tolower))
#Checking the same "document" as before:
corpus[[1]]$content
## [1] "subject: naturally irresistible your corporate identity lt is really hard to recollect a company : the market is full of suqgestions and the information isoverwhelminq ; but a good catchy logo , stylish statlonery and outstanding website will make the task much easier . we do not promise that havinq ordered a iogo your company will automaticaily become a world ieader : it isguite ciear that without good products , effective business organization and practicable aim it will be hotat nowadays market ; but we do promise that your marketing efforts will become much more effective . here is the list of clear benefits : creativeness : hand - made , original logos , specially done to reflect your distinctive company image . convenience : logo and stationery are provided in all formats ; easy - to - use content management system letsyou change your website content and even its structure . promptness : you will see logo drafts within three business days . affordability : your marketing break - through shouldn ' t make gaps in your budget . 100 % satisfaction guaranteed : we provide unlimited amount of changes with no extra fees for you to be surethat you will love the result of this collaboration . have a look at our portfolio _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ not interested . . . _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _"
#3) Using tm_map, remove all punctuation from the corpus.
corpus = tm_map(corpus, removePunctuation)
corpus[[1]]$content #Checking the same "document" as before
## [1] "subject naturally irresistible your corporate identity lt is really hard to recollect a company the market is full of suqgestions and the information isoverwhelminq but a good catchy logo stylish statlonery and outstanding website will make the task much easier we do not promise that havinq ordered a iogo your company will automaticaily become a world ieader it isguite ciear that without good products effective business organization and practicable aim it will be hotat nowadays market but we do promise that your marketing efforts will become much more effective here is the list of clear benefits creativeness hand made original logos specially done to reflect your distinctive company image convenience logo and stationery are provided in all formats easy to use content management system letsyou change your website content and even its structure promptness you will see logo drafts within three business days affordability your marketing break through shouldn t make gaps in your budget 100 satisfaction guaranteed we provide unlimited amount of changes with no extra fees for you to be surethat you will love the result of this collaboration have a look at our portfolio not interested "
#4) Using tm_map, remove all English stopwords from the corpus.
#We will remove all of these English stop words as it probably won't be very useful in our prediction problem.
corpus = tm_map(corpus, removeWords, stopwords("english"))
corpus[[1]]$content #Checking the same "document" as before to see if stop words are removed
## [1] "subject naturally irresistible corporate identity lt really hard recollect company market full suqgestions information isoverwhelminq good catchy logo stylish statlonery outstanding website will make task much easier promise havinq ordered iogo company will automaticaily become world ieader isguite ciear without good products effective business organization practicable aim will hotat nowadays market promise marketing efforts will become much effective list clear benefits creativeness hand made original logos specially done reflect distinctive company image convenience logo stationery provided formats easy use content management system letsyou change website content even structure promptness will see logo drafts within three business days affordability marketing break shouldn t make gaps budget 100 satisfaction guaranteed provide unlimited amount changes extra fees surethat will love result collaboration look portfolio interested "
#5) Using tm_map, stem the words in the corpus.
#Lastly, we want to stem our document with the stemDocument argument.
corpus = tm_map(corpus, stemDocument)
# Now that we have gone through the above preprocessing steps, we can take a second look at the first email in the corpus.
corpus[[1]]$content
## [1] "subject natur irresist corpor ident lt realli hard recollect compani market full suqgest inform isoverwhelminq good catchi logo stylish statloneri outstand websit will make task much easier promis havinq order iogo compani will automaticaili becom world ieader isguit ciear without good product effect busi organ practic aim will hotat nowaday market promis market effort will becom much effect list clear benefit creativ hand made origin logo special done reflect distinct compani imag conveni logo stationeri provid format easi use content manag system letsyou chang websit content even structur prompt will see logo draft within three busi day afford market break shouldn t make gap budget 100 satisfact guarante provid unlimit amount chang extra fee surethat will love result collabor look portfolio interest "
#6) Build a document term matrix from the corpus, called dtm.
#Create a Document Term Matrix
corpus = tm_map(corpus, PlainTextDocument)
#The values in the matrix are the number of times that word appears in each document.
dtm = DocumentTermMatrix(corpus)
dtm
## <<DocumentTermMatrix (documents: 5728, terms: 28687)>>
## Non-/sparse entries: 481719/163837417
## Sparsity : 100%
## Maximal term length: 24
## Weighting : term frequency (tf)
#How many terms are in dtm?
ncol(dtm)
## [1] 28687
#Ans:28687
##############################################
#PROBLEM 2.2 - PREPARING THE CORPUS
#To obtain a more reasonable number of terms, limit dtm to contain terms appearing in at least 5% of documents, and store this result as spdtm (don't overwrite dtm, because we will use it in a later step of this homework). How many terms are in spdtm?
# Remove sparse terms that don't appear very often.
spdtm= removeSparseTerms(dtm, 1-0.05)
spdtm
## <<DocumentTermMatrix (documents: 5728, terms: 330)>>
## Non-/sparse entries: 213551/1676689
## Sparsity : 89%
## Maximal term length: 10
## Weighting : term frequency (tf)
ncol(spdtm)
## [1] 330
#Ans:330
######################################
#PROBLEM 2.3 - PREPARING THE CORPUS
#Build a data frame called emailsSparse from spdtm, and use the make.names function to make the variable names of emailsSparse valid.
#Create data frame emailsSparse from spdtm
emailsSparse <- as.data.frame(as.matrix(spdtm)) #convert the sparse matrix into a data frame that we will be able to use for our predictive models.
#To make all variable names R-friendly use:
colnames(emailsSparse)<- make.names(colnames(emailsSparse))
#colSums() is an R function that returns the sum of values for each variable in our data frame. Our data frame contains the number of times each word stem (columns) appeared in each email (rows). Therefore, colSums(emailsSparse) returns the number of times a word stem appeared across all the emails in the dataset. Hint: think about how you can use sort() or which.max() to pick out the maximum frequency.
head(colSums(emailsSparse),20)
## X000 X2000 X2001 X713 X853 abl access account addit
## 1007 4967 3089 1097 462 590 789 829 774
## address allow alreadi also analysi anoth applic appreci approv
## 1154 450 446 1864 495 435 567 541 648
## april area
## 682 489
#What is the word stem that shows up most frequently across all the emails in the dataset?
which.max(colSums(emailsSparse))
## enron
## 92
#or sort(colSums(emailsSparse))
#Ans:enron
#############################################
#PROBLEM 2.4 - PREPARING THE CORPUS
#Add a variable called "spam" to emailsSparse containing the email spam labels. You can do this by copying over the "spam" variable from the original data frame (remember how we did this in the Twitter lecture).
emailsSparse$spam <- emails$spam
#How many word stems appear at least 5000 times in the ham emails in the dataset? Hint: in this and the next question, remember not to count the dependent variable we just added.
sum(colSums(subset(emailsSparse, emailsSparse$spam==0)) >= 5000)
## [1] 6
#or
head(sort(colSums(subset(emailsSparse, spam == 0)),decreasing = T),10)#We can read the most frequent terms in the ham datase
## enron ect subject vinc will hou X2000 kaminski
## 13388 11417 8625 8531 6802 5569 4935 4801
## pleas com
## 4494 4444
#Ans:6
#EXPLANATION:"enron", "ect", "subject", "vinc", "will", and "hou" appear at least 5000 times in the ham dataset.
##################################################
#PROBLEM 2.5 - PREPARING THE CORPUS
#How many word stems appear at least 1000 times in the spam emails in the dataset?
sum(colSums(subset(emailsSparse, emailsSparse$spam==1)) >= 1000) - 1
## [1] 3
#or
head(sort(colSums(subset(emailsSparse, spam == 1)),decreasing = T),10)
## subject will spam compani com mail busi email can
## 1577 1450 1368 1065 999 917 897 865 831
## inform
## 818
#Ans:3
#EXPLANATION:"subject", "will", and "compani" are the three stems that appear at least 1000 times. Note that the variable "spam" is the dependent variable and is not the frequency of a word stem.
#######################################
#PROBLEM 2.6 - PREPARING THE CORPUS
#The lists of most common words are significantly different between the spam and ham emails. What does this likely imply
#Ans:The frequencies of these most common words are likely to help differentiate between spam and ham.
#EXPLANATION:A word stem like "enron", which is extremely common in the ham emails but does not occur in any spam message, will help us correctly identify a large number of ham messages.
##############################
#PROBLEM 2.7 - PREPARING THE CORPUS
#Several of the most common word stems from the ham documents, such as "enron", "hou" (short for Houston), "vinc" (the word stem of "Vince") and "kaminski", are likely specific to Vincent Kaminski's inbox. What does this mean about the applicability of the text analytics models we will train for the spam filtering problem?
#Ans:The models we build are personalized, and would need to be further tested before being used as a spam filter for another person.
#EXPLANATION:The ham dataset is certainly personalized to Vincent Kaminski, and therefore it might not generalize well to a general email user. Caution is definitely necessary before applying the filters derived in this problem to other email users.
####################################
#PROBLEM 3.1 - BUILDING MACHINE LEARNING MODELS
#First, convert the dependent variable to a factor with "emailsSparse$spam = as.factor(emailsSparse$spam)".
emailsSparse$spam = as.factor(emailsSparse$spam)
#Next, set the random seed to 123 and use the sample.split function to split emailsSparse 70/30 into a training set called "train" and a testing set called "test". Make sure to perform this step on emailsSparse instead of emails.
#Split data in training/testing sets
library(caTools)
set.seed(123)
#let's split our data into a training set and a testing set, putting 70% of the data in the training set.
spl <- sample.split(emailsSparse$spam, SplitRatio=0.7)
train <- subset(emailsSparse, spl==TRUE)
test <- subset(emailsSparse, spl==FALSE)
#Using the training set, train the following three machine learning models. The models should predict the dependent variable "spam", using all other available variables as independent variables. Please be patient, as these models may take a few minutes to train.
#1) A logistic regression model called spamLog. You may see a warning message here - we'll discuss this more later.
spamLog <- glm(spam ~ ., train, family=binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#2) A CART model called spamCART, using the default parameters to train the model (don't worry about adding minbucket or cp). Remember to add the argument method="class" since this is a binary classification problem.
#Build a CART model
library(rpart)
library(rpart.plot)
#Let's use CART to build a predictive model, using the rpart() function to predict responsive using all of the other variables as our independent variables and the data set train.
spamCART <- rpart(spam ~ ., train, method="class")
#3) A random forest model called spamRF, using the default parameters to train the model (don't worry about specifying ntree or nodesize). Directly before training the random forest model, set the random seed to 123 (even though we've already done this earlier in the problem, it's important to set the seed right before training the model so we all obtain the same results. Keep in mind though that on certain operating systems, your results might still be slightly different).
library(randomForest)
set.seed(123)
spamRF <- randomForest(spam ~ ., train)
#For each model, obtain the predicted spam probabilities for the training set. Be careful to obtain probabilities instead of predicted classes, because we will be using these values to compute training set AUC values. Recall that you can obtain probabilities for CART models by not passing any type parameter to the predict() function, and you can obtain probabilities from a random forest by adding the argument type="prob". For CART and random forest, you need to select the second column of the output of the predict() function, corresponding to the probability of a message being spam.
#Get predicted spam probabilities for the training set for each model:
predTrainLog <- predict(spamLog, type="response")
predTrainCART<- predict(spamCART)[,2]
predTrainRF<- predict(spamRF, type="prob")[,2]
#You may have noticed that training the logistic regression model yielded the messages "algorithm did not converge" and "fitted probabilities numerically 0 or 1 occurred". Both of these messages often indicate overfitting and the first indicates particularly severe overfitting, often to the point that the training set observations are fit perfectly by the model. Let's investigate the predicted probabilities from the logistic regression model.
#How many of the training set predicted probabilities from spamLog are less than 0.00001?
a<-sum(predTrainLog< 0.00001)
a
## [1] 3046
#or
table(predTrainLog < 0.00001)
##
## FALSE TRUE
## 964 3046
#Ans:3046
#How many of the training set predicted probabilities from spamLog are more than 0.99999?
b<-sum(predTrainLog> 0.99999)
b
## [1] 954
#or
table(predTrainLog > 0.99999)
##
## FALSE TRUE
## 3056 954
#Ans:954
#How many of the training set predicted probabilities from spamLog are between 0.00001 and 0.99999?
nrow(train) - a - b
## [1] 10
#or
table(predTrainLog >= 0.00001 & predTrainLog <= 0.99999)
##
## FALSE TRUE
## 4000 10
#Ans:10
#################################################
#PROBLEM 3.2 - BUILDING MACHINE LEARNING MODELS (1 point possible)
#How many variables are labeled as significant (at the p=0.05 level) in the logistic regression summary output?
summary(spamLog)
##
## Call:
## glm(formula = spam ~ ., family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.011 0.000 0.000 0.000 1.354
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.082e+01 1.055e+04 -0.003 0.998
## X000 1.474e+01 1.058e+04 0.001 0.999
## X2000 -3.631e+01 1.556e+04 -0.002 0.998
## X2001 -3.215e+01 1.318e+04 -0.002 0.998
## X713 -2.427e+01 2.914e+04 -0.001 0.999
## X853 -1.212e+00 5.942e+04 0.000 1.000
## abl -2.049e+00 2.088e+04 0.000 1.000
## access -1.480e+01 1.335e+04 -0.001 0.999
## account 2.488e+01 8.165e+03 0.003 0.998
## addit 1.463e+00 2.703e+04 0.000 1.000
## address -4.613e+00 1.113e+04 0.000 1.000
## allow 1.899e+01 6.436e+03 0.003 0.998
## alreadi -2.407e+01 3.319e+04 -0.001 0.999
## also 2.990e+01 1.378e+04 0.002 0.998
## analysi -2.405e+01 3.860e+04 -0.001 1.000
## anoth -8.744e+00 2.032e+04 0.000 1.000
## applic -2.649e+00 1.674e+04 0.000 1.000
## appreci -2.145e+01 2.762e+04 -0.001 0.999
## approv -1.302e+00 1.589e+04 0.000 1.000
## april -2.620e+01 2.208e+04 -0.001 0.999
## area 2.041e+01 2.266e+04 0.001 0.999
## arrang 1.069e+01 2.135e+04 0.001 1.000
## ask -7.746e+00 1.976e+04 0.000 1.000
## assist -1.128e+01 2.490e+04 0.000 1.000
## associ 9.049e+00 1.909e+04 0.000 1.000
## attach -1.037e+01 1.534e+04 -0.001 0.999
## attend -3.451e+01 3.257e+04 -0.001 0.999
## avail 8.651e+00 1.709e+04 0.001 1.000
## back -1.323e+01 2.272e+04 -0.001 1.000
## base -1.354e+01 2.122e+04 -0.001 0.999
## begin 2.228e+01 2.973e+04 0.001 0.999
## believ 3.233e+01 2.136e+04 0.002 0.999
## best -8.201e+00 1.333e+03 -0.006 0.995
## better 4.263e+01 2.360e+04 0.002 0.999
## book 4.301e+00 2.024e+04 0.000 1.000
## bring 1.607e+01 6.767e+04 0.000 1.000
## busi -4.803e+00 1.000e+04 0.000 1.000
## buy 4.170e+01 3.892e+04 0.001 0.999
## call -1.145e+00 1.111e+04 0.000 1.000
## can 3.762e+00 7.674e+03 0.000 1.000
## case -3.372e+01 2.880e+04 -0.001 0.999
## chang -2.717e+01 2.215e+04 -0.001 0.999
## check 1.425e+00 1.963e+04 0.000 1.000
## click 1.376e+01 7.077e+03 0.002 0.998
## com 1.936e+00 4.039e+03 0.000 1.000
## come -1.166e+00 1.511e+04 0.000 1.000
## comment -3.251e+00 3.387e+04 0.000 1.000
## communic 1.580e+01 8.958e+03 0.002 0.999
## compani 4.781e+00 9.186e+03 0.001 1.000
## complet -1.363e+01 2.024e+04 -0.001 0.999
## confer -7.503e-01 8.557e+03 0.000 1.000
## confirm -1.300e+01 1.514e+04 -0.001 0.999
## contact 1.530e+00 1.262e+04 0.000 1.000
## continu 1.487e+01 1.535e+04 0.001 0.999
## contract -1.295e+01 1.498e+04 -0.001 0.999
## copi -4.274e+01 3.070e+04 -0.001 0.999
## corp 1.606e+01 2.708e+04 0.001 1.000
## corpor -8.286e-01 2.818e+04 0.000 1.000
## cost -1.938e+00 1.833e+04 0.000 1.000
## cours 1.665e+01 1.834e+04 0.001 0.999
## creat 1.338e+01 3.946e+04 0.000 1.000
## credit 2.617e+01 1.314e+04 0.002 0.998
## crenshaw 9.994e+01 6.769e+04 0.001 0.999
## current 3.629e+00 1.707e+04 0.000 1.000
## custom 1.829e+01 1.008e+04 0.002 0.999
## data -2.609e+01 2.271e+04 -0.001 0.999
## date -2.786e+00 1.699e+04 0.000 1.000
## day -6.100e+00 5.866e+03 -0.001 0.999
## deal -1.129e+01 1.448e+04 -0.001 0.999
## dear -2.313e+00 2.306e+04 0.000 1.000
## depart -4.068e+01 2.509e+04 -0.002 0.999
## deriv -4.971e+01 3.587e+04 -0.001 0.999
## design -7.923e+00 2.939e+04 0.000 1.000
## detail 1.197e+01 2.301e+04 0.001 1.000
## develop 5.976e+00 9.455e+03 0.001 0.999
## differ -2.293e+00 1.075e+04 0.000 1.000
## direct -2.051e+01 3.194e+04 -0.001 0.999
## director -1.770e+01 1.793e+04 -0.001 0.999
## discuss -1.051e+01 1.915e+04 -0.001 1.000
## doc -2.597e+01 2.603e+04 -0.001 0.999
## don 2.129e+01 1.456e+04 0.001 0.999
## done 6.828e+00 1.882e+04 0.000 1.000
## due -4.163e+00 3.532e+04 0.000 1.000
## ect 8.685e-01 5.342e+03 0.000 1.000
## edu -2.122e-01 6.917e+02 0.000 1.000
## effect 1.948e+01 2.100e+04 0.001 0.999
## effort 1.606e+01 5.670e+04 0.000 1.000
## either -2.744e+01 4.000e+04 -0.001 0.999
## email 3.833e+00 1.186e+04 0.000 1.000
## end -1.311e+01 2.938e+04 0.000 1.000
## energi -1.620e+01 1.646e+04 -0.001 0.999
## engin 2.664e+01 2.394e+04 0.001 0.999
## enron -8.789e+00 5.719e+03 -0.002 0.999
## etc 9.470e-01 1.569e+04 0.000 1.000
## even -1.654e+01 2.289e+04 -0.001 0.999
## event 1.694e+01 1.851e+04 0.001 0.999
## expect -1.179e+01 1.914e+04 -0.001 1.000
## experi 2.460e+00 2.240e+04 0.000 1.000
## fax 3.537e+00 3.386e+04 0.000 1.000
## feel 2.596e+00 2.348e+04 0.000 1.000
## file -2.943e+01 2.165e+04 -0.001 0.999
## final 8.075e+00 5.008e+04 0.000 1.000
## financ -9.122e+00 7.524e+03 -0.001 0.999
## financi -9.747e+00 1.727e+04 -0.001 1.000
## find -2.623e+00 9.727e+03 0.000 1.000
## first -4.666e-01 2.043e+04 0.000 1.000
## follow 1.766e+01 3.080e+03 0.006 0.995
## form 8.483e+00 1.674e+04 0.001 1.000
## forward -3.484e+00 1.864e+04 0.000 1.000
## free 6.113e+00 8.121e+03 0.001 0.999
## friday -1.146e+01 1.996e+04 -0.001 1.000
## full 2.125e+01 2.190e+04 0.001 0.999
## futur 4.146e+01 1.439e+04 0.003 0.998
## gas -3.901e+00 4.160e+03 -0.001 0.999
## get 5.154e+00 9.737e+03 0.001 1.000
## gibner 2.901e+01 2.460e+04 0.001 0.999
## give -2.518e+01 2.130e+04 -0.001 0.999
## given -2.186e+01 5.426e+04 0.000 1.000
## good 5.399e+00 1.619e+04 0.000 1.000
## great 1.222e+01 1.090e+04 0.001 0.999
## group 5.264e-01 1.037e+04 0.000 1.000
## happi 1.939e-02 1.202e+04 0.000 1.000
## hear 2.887e+01 2.281e+04 0.001 0.999
## hello 2.166e+01 1.361e+04 0.002 0.999
## help 1.731e+01 2.791e+03 0.006 0.995
## high -1.982e+00 2.554e+04 0.000 1.000
## home 5.973e+00 8.965e+03 0.001 0.999
## hope -1.435e+01 2.179e+04 -0.001 0.999
## hou 6.852e+00 6.437e+03 0.001 0.999
## hour 2.478e+00 1.333e+04 0.000 1.000
## houston -1.855e+01 7.305e+03 -0.003 0.998
## howev -3.449e+01 3.562e+04 -0.001 0.999
## http 2.528e+01 2.107e+04 0.001 0.999
## idea -1.845e+01 3.892e+04 0.000 1.000
## immedi 6.285e+01 3.346e+04 0.002 0.999
## import -1.859e+00 2.236e+04 0.000 1.000
## includ -3.454e+00 1.799e+04 0.000 1.000
## increas 6.476e+00 2.329e+04 0.000 1.000
## industri -3.160e+01 2.373e+04 -0.001 0.999
## info -1.255e+00 4.857e+03 0.000 1.000
## inform 2.078e+01 8.549e+03 0.002 0.998
## interest 2.698e+01 1.159e+04 0.002 0.998
## intern -7.991e+00 3.351e+04 0.000 1.000
## internet 8.749e+00 1.100e+04 0.001 0.999
## interview -1.640e+01 1.873e+04 -0.001 0.999
## invest 3.201e+01 2.393e+04 0.001 0.999
## invit 4.304e+00 2.215e+04 0.000 1.000
## involv 3.815e+01 3.315e+04 0.001 0.999
## issu -3.708e+01 3.396e+04 -0.001 0.999
## john -5.326e-01 2.856e+04 0.000 1.000
## join -3.824e+01 2.334e+04 -0.002 0.999
## juli -1.358e+01 3.009e+04 0.000 1.000
## just -1.021e+01 1.114e+04 -0.001 0.999
## kaminski -1.812e+01 6.029e+03 -0.003 0.998
## keep 1.867e+01 2.782e+04 0.001 0.999
## kevin -3.779e+01 4.738e+04 -0.001 0.999
## know 1.277e+01 1.526e+04 0.001 0.999
## last 1.046e+00 1.372e+04 0.000 1.000
## let -2.763e+01 1.462e+04 -0.002 0.998
## life 5.812e+01 3.864e+04 0.002 0.999
## like 5.649e+00 7.660e+03 0.001 0.999
## line 8.743e+00 1.236e+04 0.001 0.999
## link -6.929e+00 1.345e+04 -0.001 1.000
## list -8.692e+00 2.149e+03 -0.004 0.997
## locat 2.073e+01 1.597e+04 0.001 0.999
## london 6.745e+00 1.642e+04 0.000 1.000
## long -1.489e+01 1.934e+04 -0.001 0.999
## look -7.031e+00 1.563e+04 0.000 1.000
## lot -1.964e+01 1.321e+04 -0.001 0.999
## made 2.820e+00 2.743e+04 0.000 1.000
## mail 7.584e+00 1.021e+04 0.001 0.999
## make 2.901e+01 1.528e+04 0.002 0.998
## manag 6.014e+00 1.445e+04 0.000 1.000
## mani 1.885e+01 1.442e+04 0.001 0.999
## mark -3.350e+01 3.208e+04 -0.001 0.999
## market 7.895e+00 8.012e+03 0.001 0.999
## may -9.434e+00 1.397e+04 -0.001 0.999
## mean 6.078e-01 2.952e+04 0.000 1.000
## meet -1.063e+00 1.263e+04 0.000 1.000
## member 1.381e+01 2.343e+04 0.001 1.000
## mention -2.279e+01 2.714e+04 -0.001 0.999
## messag 1.716e+01 2.562e+03 0.007 0.995
## might 1.244e+01 1.753e+04 0.001 0.999
## model -2.292e+01 1.049e+04 -0.002 0.998
## monday -1.034e+00 3.233e+04 0.000 1.000
## money 3.264e+01 1.321e+04 0.002 0.998
## month -3.727e+00 1.112e+04 0.000 1.000
## morn -2.645e+01 3.403e+04 -0.001 0.999
## move -3.834e+01 3.011e+04 -0.001 0.999
## much 3.775e-01 1.392e+04 0.000 1.000
## name 1.672e+01 1.322e+04 0.001 0.999
## need 8.437e-01 1.221e+04 0.000 1.000
## net 1.256e+01 2.197e+04 0.001 1.000
## new 1.003e+00 1.009e+04 0.000 1.000
## next. 1.492e+01 1.724e+04 0.001 0.999
## note 1.446e+01 2.294e+04 0.001 0.999
## now 3.790e+01 1.219e+04 0.003 0.998
## number -9.622e+00 1.591e+04 -0.001 1.000
## offer 1.174e+01 1.084e+04 0.001 0.999
## offic -1.344e+01 2.311e+04 -0.001 1.000
## one 1.241e+01 6.652e+03 0.002 0.999
## onlin 3.589e+01 1.665e+04 0.002 0.998
## open 2.114e+01 2.961e+04 0.001 0.999
## oper -1.696e+01 2.757e+04 -0.001 1.000
## opportun -4.131e+00 1.918e+04 0.000 1.000
## option -1.085e+00 9.325e+03 0.000 1.000
## order 6.533e+00 1.242e+04 0.001 1.000
## origin 3.226e+01 3.818e+04 0.001 0.999
## part 4.594e+00 3.483e+04 0.000 1.000
## particip -1.154e+01 1.738e+04 -0.001 0.999
## peopl -1.864e+01 1.439e+04 -0.001 0.999
## per 1.367e+01 1.273e+04 0.001 0.999
## person 1.870e+01 9.575e+03 0.002 0.998
## phone -6.957e+00 1.172e+04 -0.001 1.000
## place 9.005e+00 3.661e+04 0.000 1.000
## plan -1.830e+01 6.320e+03 -0.003 0.998
## pleas -7.961e+00 9.484e+03 -0.001 0.999
## point 5.498e+00 3.403e+04 0.000 1.000
## posit -1.543e+01 2.316e+04 -0.001 0.999
## possibl -1.366e+01 2.492e+04 -0.001 1.000
## power -5.643e+00 1.173e+04 0.000 1.000
## present -6.163e+00 1.278e+04 0.000 1.000
## price 3.428e+00 7.850e+03 0.000 1.000
## problem 1.262e+01 9.763e+03 0.001 0.999
## process -2.957e-01 1.191e+04 0.000 1.000
## product 1.016e+01 1.345e+04 0.001 0.999
## program 1.444e+00 1.183e+04 0.000 1.000
## project 2.173e+00 1.497e+04 0.000 1.000
## provid 2.422e-01 1.859e+04 0.000 1.000
## public -5.250e+01 2.341e+04 -0.002 0.998
## put -1.052e+01 2.681e+04 0.000 1.000
## question -3.467e+01 1.859e+04 -0.002 0.999
## rate -3.112e+00 1.319e+04 0.000 1.000
## read -1.527e+01 2.145e+04 -0.001 0.999
## real 2.046e+01 2.358e+04 0.001 0.999
## realli -2.667e+01 4.640e+04 -0.001 1.000
## receiv 5.765e-01 1.585e+04 0.000 1.000
## recent -2.067e+00 1.780e+04 0.000 1.000
## regard -3.668e+00 1.511e+04 0.000 1.000
## relat -5.114e+01 1.793e+04 -0.003 0.998
## remov 2.325e+01 2.484e+04 0.001 0.999
## repli 1.538e+01 2.916e+04 0.001 1.000
## report -1.482e+01 1.477e+04 -0.001 0.999
## request -1.232e+01 1.167e+04 -0.001 0.999
## requir 5.004e-01 2.937e+04 0.000 1.000
## research -2.826e+01 1.553e+04 -0.002 0.999
## resourc -2.735e+01 3.522e+04 -0.001 0.999
## respond 2.974e+01 3.888e+04 0.001 0.999
## respons -1.960e+01 3.667e+04 -0.001 1.000
## result -5.002e-01 3.140e+04 0.000 1.000
## resum -9.219e+00 2.100e+04 0.000 1.000
## return 1.745e+01 1.844e+04 0.001 0.999
## review -4.825e+00 1.013e+04 0.000 1.000
## right 2.312e+01 1.590e+04 0.001 0.999
## risk -4.001e+00 1.718e+04 0.000 1.000
## robert -2.096e+01 2.907e+04 -0.001 0.999
## run -5.162e+01 4.434e+04 -0.001 0.999
## say 7.366e+00 2.217e+04 0.000 1.000
## schedul 1.919e+00 3.580e+04 0.000 1.000
## school -3.870e+00 2.882e+04 0.000 1.000
## secur -1.604e+01 2.201e+03 -0.007 0.994
## see -1.120e+01 1.293e+04 -0.001 0.999
## send -2.427e+01 1.222e+04 -0.002 0.998
## sent -1.488e+01 2.195e+04 -0.001 0.999
## servic -7.164e+00 1.235e+04 -0.001 1.000
## set -9.353e+00 2.627e+04 0.000 1.000
## sever 2.041e+01 3.093e+04 0.001 0.999
## shall 1.930e+01 3.075e+04 0.001 0.999
## shirley -7.133e+01 6.329e+04 -0.001 0.999
## short -8.974e+00 1.721e+04 -0.001 1.000
## sinc -3.438e+00 3.546e+04 0.000 1.000
## sincer -2.073e+01 3.515e+04 -0.001 1.000
## site 8.689e+00 1.496e+04 0.001 1.000
## softwar 2.575e+01 1.059e+04 0.002 0.998
## soon 2.350e+01 3.731e+04 0.001 0.999
## sorri 6.036e+00 2.299e+04 0.000 1.000
## special 1.777e+01 2.755e+04 0.001 0.999
## specif -2.337e+01 3.083e+04 -0.001 0.999
## start 1.437e+01 1.897e+04 0.001 0.999
## state 1.221e+01 1.677e+04 0.001 0.999
## still 3.878e+00 2.622e+04 0.000 1.000
## stinson -4.345e+01 2.697e+04 -0.002 0.999
## student -1.815e+01 2.186e+04 -0.001 0.999
## subject 3.041e+01 1.055e+04 0.003 0.998
## success 4.344e+00 2.783e+04 0.000 1.000
## suggest -3.842e+01 4.475e+04 -0.001 0.999
## support -1.539e+01 1.976e+04 -0.001 0.999
## sure -5.503e+00 2.078e+04 0.000 1.000
## system 3.778e+00 9.149e+03 0.000 1.000
## take 5.731e+00 1.716e+04 0.000 1.000
## talk -1.011e+01 2.021e+04 -0.001 1.000
## team 7.940e+00 2.570e+04 0.000 1.000
## term 2.013e+01 2.303e+04 0.001 0.999
## thank -3.890e+01 1.059e+04 -0.004 0.997
## thing 2.579e+01 1.341e+04 0.002 0.998
## think -1.218e+01 2.077e+04 -0.001 1.000
## thought 1.243e+01 3.023e+04 0.000 1.000
## thursday -1.491e+01 3.262e+04 0.000 1.000
## time -5.921e+00 8.335e+03 -0.001 0.999
## today -1.762e+01 1.965e+04 -0.001 0.999
## togeth -2.355e+01 1.869e+04 -0.001 0.999
## trade -1.755e+01 1.483e+04 -0.001 0.999
## tri 9.278e-01 1.282e+04 0.000 1.000
## tuesday -2.808e+01 3.959e+04 -0.001 0.999
## two -2.573e+01 1.844e+04 -0.001 0.999
## type -1.447e+01 2.755e+04 -0.001 1.000
## understand 9.307e+00 2.342e+04 0.000 1.000
## unit -4.020e+00 3.008e+04 0.000 1.000
## univers 1.228e+01 2.197e+04 0.001 1.000
## updat -1.510e+01 1.448e+04 -0.001 0.999
## use -1.385e+01 9.382e+03 -0.001 0.999
## valu 9.024e-01 1.360e+04 0.000 1.000
## version -3.606e+01 2.939e+04 -0.001 0.999
## vinc -3.735e+01 8.647e+03 -0.004 0.997
## visit 2.585e+01 1.170e+04 0.002 0.998
## vkamin -6.649e+01 5.703e+04 -0.001 0.999
## want -2.555e+00 1.106e+04 0.000 1.000
## way 1.339e+01 1.138e+04 0.001 0.999
## web 2.791e+00 1.686e+04 0.000 1.000
## websit -2.563e+01 1.848e+04 -0.001 0.999
## wednesday -1.526e+01 2.642e+04 -0.001 1.000
## week -6.795e+00 1.046e+04 -0.001 0.999
## well -2.222e+01 9.713e+03 -0.002 0.998
## will -1.119e+01 5.980e+03 -0.002 0.999
## wish 1.173e+01 3.175e+04 0.000 1.000
## within 2.900e+01 2.163e+04 0.001 0.999
## without 1.942e+01 1.763e+04 0.001 0.999
## work -1.099e+01 1.160e+04 -0.001 0.999
## write 4.406e+01 2.825e+04 0.002 0.999
## www -7.867e+00 2.224e+04 0.000 1.000
## year -1.010e+01 1.039e+04 -0.001 0.999
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4409.49 on 4009 degrees of freedom
## Residual deviance: 13.46 on 3679 degrees of freedom
## AIC: 675.46
##
## Number of Fisher Scoring iterations: 25
#Ans:0
#EXPLANATION:From summary(spamLog), we see that none of the variables are labeled as significant (a symptom of the logistic regression algorithm not converging).
###########################################
#PROBLEM 3.3 - BUILDING MACHINE LEARNING MODELS
#How many of the word stems "enron", "hou", "vinc", and "kaminski" appear in the CART tree? Recall that we suspect these word stems are specific to Vincent Kaminski and might affect the generalizability of a spam filter built with his ham data.
prp(spamCART)
#Ans:2
#EXPLANATION:From prp(spamCART), we see that "vinc" and "enron" appear in the CART tree as the top two branches, but that "hou" and "kaminski" do not appear.
####################################
#PROBLEM 3.4 - BUILDING MACHINE LEARNING MODELS
#What is the training set accuracy of spamLog, using a threshold of 0.5 for predictions?
#We are interested in the overall accuracy of our model
#First we compute the confusion matrix
cmat_log<-table(train$spam, predTrainLog > 0.5)
cmat_log
##
## FALSE TRUE
## 0 3052 0
## 1 4 954
#lets now compute the overall accuracy
accu_log <- (cmat_log[1,1] + cmat_log[2,2])/sum(cmat_log)
accu_log #(3052+954)/nrow(train) = 0.9990025
## [1] 0.9990025
#Ans:0.9990025
###############################################
#PROBLEM 3.5 - BUILDING MACHINE LEARNING MODELS
#What is the training set AUC of spamLog?
library(ROCR)
predictionTrainLog = prediction(predTrainLog, train$spam)
perf <- performance(predictionTrainLog, "tpr", "fpr")
as.numeric(performance(predictionTrainLog, "auc")@y.values)
## [1] 0.9999959
#Ans:0.9999959
#We then plot the ROC curve, with the option that color-codes the different cutoff thresholds.
plot(perf, colorize=TRUE)
#visually showing perfect fitting i.e. overfitting which is confirmed by very high auc value
##################################
#PROBLEM 3.6 - BUILDING MACHINE LEARNING MODELS
#What is the training set accuracy of spamCART, using a threshold of 0.5 for predictions? (Remember that if you used the type="class" argument when making predictions, you automatically used a threshold of 0.5. If you did not add in the type argument to the predict function, the probabilities are in the second column of the predict output.)
#We are interested in the overall accuracy of our model
#First we compute the confusion matrix
cmat_CART<-table(train$spam, predTrainCART > 0.5)
cmat_CART
##
## FALSE TRUE
## 0 2885 167
## 1 64 894
#lets now compute the overall accuracy
accu_CART <- (cmat_CART[1,1] + cmat_CART[2,2])/sum(cmat_CART)
accu_CART #(2885+894)/nrow(train) = 0.942394
## [1] 0.942394
#Ans:0.942394
############################################
#PROBLEM 3.7 - BUILDING MACHINE LEARNING MODELS
#What is the training set AUC of spamCART? (Remember that you have to pass the prediction function predicted probabilities, so don't include the type argument when making predictions for your CART model.)
library(ROCR)
predictionTrainCART = prediction(predTrainCART, train$spam)
perf <- performance(predictionTrainCART, "tpr", "fpr")
as.numeric(performance(predictionTrainCART, "auc")@y.values)
## [1] 0.9696044
#Ans:0.9696044
#We then plot the ROC curve, with the option that color-codes the different cutoff thresholds.
plot(perf, colorize=TRUE)
#visually showing excellent fitting which is confirmed by very high auc value
########################################
#PROBLEM 3.8 - BUILDING MACHINE LEARNING MODELS
#What is the training set accuracy of spamRF, using a threshold of 0.5 for predictions? (Remember that your answer might not match ours exactly, due to random behavior in the random forest algorithm on different operating systems.)
#We are interested in the overall accuracy of our model
#First we compute the confusion matrix
cmat_RF<-table(train$spam, predTrainRF > 0.5)
cmat_RF
##
## FALSE TRUE
## 0 3013 39
## 1 44 914
#lets now compute the overall accuracy
accu_RF <- (cmat_RF[1,1] + cmat_RF[2,2])/sum(cmat_RF)
accu_RF #(3013+914)/nrow(train)=0.9793017
## [1] 0.9793017
#Ans:0.9793017
#####################################################
#PROBLEM 3.9 - BUILDING MACHINE LEARNING MODELS
#What is the training set AUC of spamRF? (Remember to pass the argument type="prob" to the predict function to get predicted probabilities for a random forest model. The probabilities will be the second column of the output.)
library(ROCR)
# Make predictions:
predictionTrainRF = prediction(predTrainRF, train$spam)
perf<-performance(predictionTrainRF, "tpr", "fpr")
as.numeric(performance(predictionTrainRF, "auc")@y.values)
## [1] 0.9979116
#Ans:0.9979116
#We then plot the ROC curve, with the option that color-codes the different cutoff thresholds.
plot(perf, colorize=TRUE)
#visually showing excellent fitting which is confirmed by very high auc value
############################################
#PROBLEM 3.10 - BUILDING MACHINE LEARNING MODELS
#Which model had the best training set performance, in terms of accuracy and AUC?
#Ans: Logistic regression
#EXPLANATION:In terms of both accuracy and AUC, logistic regression is nearly perfect and outperforms the other two models.
######################################
#PROBLEM 4.1 - EVALUATING ON THE TEST SET
#Obtain predicted probabilities for the testing set for each of the models, again ensuring that probabilities instead of classes are obtained.
## Make Out-of-Sample predictions on the testing set :
predTestLog<- predict(spamLog, newdata=test, type="response")
predTestCART <- predict(spamCART, newdata=test)[,2]
predTestRF <- predict(spamRF, newdata=test, type="prob")[,2]
#What is the testing set accuracy of spamLog, using a threshold of 0.5 for predictions?
#Build a confusion matrix (with a threshold of 0.5) and compute the accuracy of the model.What is the accuracy?
# Out of sample confusion matrix with threshold of 0.5
cmat_log<-table(test$spam, predTestLog> 0.5)
cmat_log
##
## FALSE TRUE
## 0 1257 51
## 1 34 376
#lets now compute the overall accuracy
accu_log <- (cmat_log[1,1] + cmat_log[2,2])/sum(cmat_log)
accu_log #(1257+376)/nrow(test) = 0.9505239
## [1] 0.9505239
#Ans:0.9505239
#######################################
#PROBLEM 4.2 - EVALUATING ON THE TEST SET
#What is the testing set AUC of spamLog?
library(ROCR)
predictionTestLog = prediction(predTestLog,test$spam)
perf <- performance(predictionTestLog, "tpr", "fpr")
as.numeric(performance(predictionTestLog, "auc")@y.values)
## [1] 0.9627517
#Ans:0.9627517
#We then plot the ROC curve, with the option that color-codes the different cutoff thresholds.
plot(perf, colorize=TRUE)
#visually showing excellent fitting i.e. overfitting which is confirmed by very high auc value
#############################
#PROBLEM 4.3 - EVALUATING ON THE TEST SET
#What is the testing set accuracy of spamCART, using a threshold of 0.5 for predictions?
#We are interested in the overall accuracy of our model
#First we compute the out of sample confusion matrix with threshold of 0.5
cmat_CART<-table(test$spam, predTestCART > 0.5)
cmat_CART
##
## FALSE TRUE
## 0 1228 80
## 1 24 386
#lets now compute the overall accuracy
accu_CART <- (cmat_CART[1,1] + cmat_CART[2,2])/sum(cmat_CART)
accu_CART #(1228+386)/nrow(test) = 0.9394645
## [1] 0.9394645
#Ans: 0.9394645
######################################
#PROBLEM 4.4 - EVALUATING ON THE TEST SET
#What is the testing set AUC of spamCART?
library(ROCR)
predictionTestCART = prediction(predTestCART,test$spam)
perf <- performance(predictionTestCART, "tpr", "fpr")
as.numeric(performance(predictionTestCART, "auc")@y.values)
## [1] 0.963176
#Ans:0.963176
#We then plot the ROC curve, with the option that color-codes the different cutoff thresholds.
plot(perf, colorize=TRUE)
#visually showing excellent fitting which is confirmed by very high auc value
##########################################
#PROBLEM 4.5 - EVALUATING ON THE TEST SET
#What is the testing set accuracy of spamRF, using a threshold of 0.5 for predictions?
#computing the confusion matrix with threshold of 0.5
cmat_RF<-table(test$spam, predTestRF > 0.5)
cmat_RF
##
## FALSE TRUE
## 0 1290 18
## 1 25 385
#lets now compute the overall accuracy
accu_RF <- (cmat_RF[1,1] + cmat_RF[2,2])/sum(cmat_RF)
accu_RF # (1290+385)/nrow(test)=0.9749709
## [1] 0.9749709
#Ans:0.9749709
########################################
#PROBLEM 4.6 - EVALUATING ON THE TEST SET
#What is the testing set AUC of spamRF?
library(ROCR)
predictionTestRF = prediction(predTestRF,test$spam)
perf <- performance(predictionTestRF, "tpr", "fpr")
as.numeric(performance(predictionTestRF, "auc")@y.values)
## [1] 0.9975656
#Ans:0.9975656
#We then plot the ROC curve, with the option that color-codes the different cutoff thresholds.
plot(perf, colorize=TRUE)
#visually showing excellent fitting which is confirmed by very high auc value
#########################################
#PROBLEM 4.7 - EVALUATING ON THE TEST SET
#Which model had the best testing set performance, in terms of accuracy and AUC?
#Ans:Random forest
#EXPLANATION:The random forest outperformed logistic regression and CART in both measures, obtaining an impressive AUC of 0.997 on the test set.
################################################
#PROBLEM 4.8 - EVALUATING ON THE TEST SET
#Which model demonstrated the greatest degree of overfitting?
#Ans: Logistic regression
#EXPLANATION:Both CART and random forest had very similar accuracies on the training and testing sets. However, logistic regression obtained nearly perfect accuracy and AUC on the training set and had far-from-perfect performance on the testing set. This is an indicator of overfitting.
PROBLEM 5.1 - ASSIGNING WEIGHTS TO DIFFERENT TYPES OF ERRORS
Thus far, we have used a threshold of 0.5 as the cutoff for predicting that an email message is spam, and we have used accuracy as one of our measures of model quality. As we have previously learned, these are good choices when we have no preference for different types of errors (false positives vs. false negatives), but other choices might be better if we assign a higher cost to one type of error.
Consider the case of an email provider using the spam filter we have developed. The email provider moves all of the emails flagged as spam to a separate “Junk Email” folder, meaning those emails are not displayed in the main inbox. The emails not flagged as spam by the algorithm are displayed in the inbox. Many of this provider’s email users never check the spam folder, so they will never see emails delivered there.
Q:In this scenario, what is the cost associated with the model making a false negative error?
Ans:A spam email will be displayed in the main inbox, a nuisance for the email user
EXPLANATION:A false negative means the model labels a spam email as ham. This results in a spam email being displayed in the main inbox.
Q:In this scenario, what is the cost associated with our model making a false positive error?
Ans:A ham email will be sent to the Junk Email folder, potentially resulting in the email user never seeing that message
EXPLANATION:A false positive means the model labels a ham email as spam. This results in a ham email being sent to the Junk Email folder.
PROBLEM 5.2 - ASSIGNING WEIGHTS TO DIFFERENT TYPES OF ERRORS
Q:Which sort of mistake is more costly (less desirable), assuming that the user will never check the Junk Email folder?
Ans:False positive
EXPLANATION:A false negative is largely a nuisance (the user will need to delete the unsolicited email). However a false positive can be very costly, since the user might completely miss an important email due to it being delivered to the spam folder. Therefore, the false positive is more costly.
PROBLEM 5.3 - ASSIGNING WEIGHTS TO DIFFERENT TYPES OF ERRORS
Q:What sort of user might assign a particularly high cost to a false negative result?
Ans:A user who is particularly annoyed by spam email reaching their main inbox
EXPLANATION:A false negative results in spam reaching a user’s main inbox, which is a nuisance. A user who is particularly annoyed by such spam would assign a particularly high cost to a false negative.
PROBLEM 5.4 - ASSIGNING WEIGHTS TO DIFFERENT TYPES OF ERRORS
Q:What sort of user might assign a particularly high cost to a false positive result?
Ans:A user who never checks his/her Junk Email folder
EXPLANATION:A false positive results in ham being sent to a user’s Junk Email folder. While the user might catch the mistake upon checking the Junk Email folder, users who never check this folder will miss the email, incurring a particularly high cost.
PROBLEM 5.5 - ASSIGNING WEIGHTS TO DIFFERENT TYPES OF ERRORS
Q:Consider another use case for the spam filter, in which messages labeled as spam are still delivered to the main inbox but are flagged as “potential spam.” Therefore, there is no risk of the email user missing an email regardless of whether it is flagged as spam. What is the largest way in which this change in spam filter design affects the costs of false negative and false positive results?
Ans:The cost of false positive results is decreased
EXPLANATION:While before many users would completely miss a ham email labeled as spam (false positive), now users will not miss an email after this sort of mistake. As a result, the cost of a false positive has been decreased.
PROBLEM 5.6 - ASSIGNING WEIGHTS TO DIFFERENT TYPES OF ERRORS
Q:Consider a large-scale email provider with more than 100,000 customers. Which of the following represents an approach for approximating each customer’s preferences between a false positive and false negative that is both practical and personalized?
Ans:Automatically collect information about how often each user accesses his/her Junk Email folder to infer preferences
EXPLANATION:While using expert opinion is practical, it is not personalized (we would use the same cost for all users). Likewise, a random sample of user preferences doesn’t enable personalized costs for each user.
While a survey of all users would enable personalization, it is impractical to obtain survey results from all or most of the users.
While it’s impractical to survey all users, it is easy to automatically collect their usage patterns. This could enable us to select higher regression thresholds for users who rarely check their Junk Email folder but lower thresholds for users who regularly check the folder.
PROBLEM 6.1 - INTEGRATING WORD COUNT INFORMATION
While we have thus far mostly dealt with frequencies of specific words in our analysis, we can extract other information from text. The last two sections of this problem will deal with two other types of information we can extract.
First, we will use the number of words in the each email as an independent variable. We can use the original document term matrix called dtm for this task. The document term matrix has documents (in this case, emails) as its rows, terms (in this case word stems) as its columns, and frequencies as its values. As a result, the sum of all the elements in a row of the document term matrix is equal to the number of terms present in the document corresponding to the row. Obtain the word counts for each email with the command:
wordCount = rowSums(as.matrix(dtm))
library(slam)
wordCount = rollup(dtm, 2, FUN=sum)$v
When you have successfully created wordCount, answer the following question.
Q:What would have occurred if we had instead created wordCount using spdtm instead of dtm?
Ans:wordCount would have only counted some of the words, but would have returned a result for all the emails
EXPLANATION:spdtm has had sparse terms removed, which means we have removed some of the columns but none of the rows from dtm. This means rowSums will still return a sum for each row (one for each email), but it will not have counted the frequencies of any uncommon words in the dataset. As a result, wordCount will only count some of the words.
PROBLEM 6.2 - INTEGRATING WORD COUNT INFORMATION
Use the hist() function to plot the distribution of wordCount in the dataset. What best describes the distribution of the data?
hist(wordCount)
Ans: The data is skew right – there are a large number of small wordCount values and a small number of large values.
EXPLANATION:From hist(wordCount), nearly all the observations are in the very left of the graph, representing small values. Therefore, this distribution is skew right.
PROBLEM 6.3 - INTEGRATING WORD COUNT INFORMATION
Now, use the hist() function to plot the distribution of log(wordCount) in the dataset. What best describes the distribution of the data?
hist(log(wordCount))
Ans:The data is not skewed – there are roughly the same number of unusually large and unusually small log(wordCount) values.
EXPLANATION:From hist(log(wordCount)), the frequencies are quite balanced, suggesting log(wordCount) is not skewed.
PROBLEM 6.4 - INTEGRATING WORD COUNT INFORMATION
Create a variable called logWordCount in emailsSparse that is equal to log(wordCount). Use the boxplot() command to plot logWordCount against whether a message is spam. Which of the following best describes the box plot?
emailsSparse$logWordCount<-log(wordCount)
boxplot(emailsSparse$logWordCount~emailsSparse$spam)
Ans: logWordCount is slightly smaller in spam messages than in ham messages
EXPLANATION:We can see that the 1st quartile, median, and 3rd quartiles are all slightly lower for spam messages than for ham messages.
PROBLEM 6.5 - INTEGRATING WORD COUNT INFORMATION
Because logWordCount differs between spam and ham messages, we hypothesize that it might be useful in predicting whether an email is spam. Take the following steps:
#1) Use the same sample.split output you obtained earlier (do not re-run sample.split) to split emailsSparse into a training and testing set, which you should call train2 and test2.
train2 = subset(emailsSparse, spl == TRUE)
test2 = subset(emailsSparse, spl == FALSE)
#2) Use train2 to train a CART tree with the default parameters, saving the model to the variable spam2CART.
library(rpart)
library(rpart.plot)
spam2CART = rpart(spam~., data=train2, method="class")
#Plotting the CART model
prp(spam2CART)
#3) Use train2 to train a random forest with the default parameters, saving the model to the variable spam2RF. Again, set the random seed to 123 directly before training spam2RF.
set.seed(123)
spam2RF = randomForest(spam~., data=train2)
#Was the new variable used in the new CART tree spam2CART?
#Ans:Yes
#EXPLANATION:From prp(spam2CART), we see that the logWordCount was integrated into the tree (it might only display as "logWord", because prp shortens some of the variable names when it outputs them).
#####################################
#PROBLEM 6.6 - INTEGRATING WORD COUNT INFORMATION
#Perform test-set predictions using the new CART and random forest models.
predTest2CART = predict(spam2CART, newdata=test2)[,2]
predTest2RF = predict(spam2RF, newdata=test2, type="prob")[,2]
#What is the test-set accuracy of spam2CART, using threshold 0.5 for predicting an email is spam?
#Now lets assess the accuracy of the model through confusion matrix
cmat_CART<-table(test2$spam, predTest2CART > 0.5) #first arg is the true outcomes and the second is the predicted outcomes
cmat_CART
##
## FALSE TRUE
## 0 1214 94
## 1 26 384
#lets now compute the overall accuracy
accu_CART <- (cmat_CART[1,1] + cmat_CART[2,2])/sum(cmat_CART)
accu_CART # (1214+384)/nrow(test2)= 0.9301513
## [1] 0.9301513
#Ans:0.9301513
#######################################
#PROBLEM 6.7 - INTEGRATING WORD COUNT INFORMATION
#What is the test-set AUC of spam2CART?
library(ROCR)
predictionTest2CART= prediction(predTest2CART, test2$spam)
perf <- performance(predictionTest2CART, "tpr", "fpr")
as.numeric(performance(predictionTrainCART, "auc")@y.values)
## [1] 0.9696044
#Ans:0.9582438
#We then plot the ROC curve, with the option that color-codes the different cutoff thresholds.
plot(perf, colorize=TRUE)
######################################
#PROBLEM 6.8 - INTEGRATING WORD COUNT INFORMATION
#What is the test-set accuracy of spam2RF, using a threshold of 0.5 for predicting if an email is spam? (Remember that you might get a different accuracy than us even if you set the seed, due to the random behavior of randomForest on some operating systems.)
#We are interested in the overall accuracy of our out of sample model
#First we compute the confusion matrix
cmat_RF<-table(test2$spam, predTest2RF > 0.5)
cmat_RF
##
## FALSE TRUE
## 0 1298 10
## 1 28 382
#lets now compute the overall accuracy
accu_RF <- (cmat_RF[1,1] + cmat_RF[2,2])/sum(cmat_RF)
accu_RF #(1298+382)/nrow(test2)=0.9778813
## [1] 0.9778813
#Ans:0.9778813
################################
#PROBLEM 6.9 - INTEGRATING WORD COUNT INFORMATION
#What is the test-set AUC of spam2RF? (Remember that you might get a different AUC than us even if you set the seed when building your model, due to the random behavior of randomForest on some operating systems.)
library(ROCR)
predictionTest2RF = prediction(predTest2RF, test2$spam)
perf<-performance(predictionTest2RF, "tpr", "fpr")
as.numeric(performance(predictionTrainRF, "auc")@y.values)
## [1] 0.9979116
#Ans:0.9979116
#We then plot the ROC curve, with the option that color-codes the different cutoff thresholds.
plot(perf, colorize=TRUE)
#In this case, adding the logWordCounts variable did not result in improved results on the test set for the CART or random forest model.
USING N-GRAMS
Another source of information that might be extracted from text is the frequency of various n-grams. An n-gram is a sequence of n consecutive words in the document. For instance, for the document “Text analytics rocks!”, which we would preprocess to “text analyt rock”, the 1-grams are “text”, “analyt”, and “rock”, the 2-grams are “text analyt” and “analyt rock”, and the only 3-gram is “text analyt rock”. n-grams are order-specific, meaning the 2-grams “text analyt” and “analyt text” are considered two separate n-grams. We can see that so far our analysis has been extracting only 1-grams.
We do not have exercises in this class covering n-grams, but if you are interested in learning more, the “RTextTools”, “tau”, “RWeka”, and “textcat” packages in R are all good resources.
sessionInfo()
## R version 3.3.0 (2016-05-03)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 10586)
##
## locale:
## [1] C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] slam_0.1-34 ROCR_1.0-7 gplots_3.0.1
## [4] randomForest_4.6-12 rpart.plot_1.5.3 rpart_4.1-10
## [7] caTools_1.17.1 SnowballC_0.5.1 tm_0.6-2
## [10] NLP_0.1-9 DataComputing_0.8.3 curl_0.9.7
## [13] base64enc_0.1-3 manipulate_1.0.1 mosaic_0.13.0
## [16] mosaicData_0.13.0 car_2.1-2 lattice_0.20-33
## [19] knitr_1.13 stringr_1.0.0 tidyr_0.4.1
## [22] lubridate_1.5.6 dplyr_0.4.3 ggplot2_2.1.0
##
## loaded via a namespace (and not attached):
## [1] gtools_3.5.0 reshape2_1.4.1 splines_3.3.0
## [4] colorspace_1.2-6 htmltools_0.3.5 yaml_2.1.13
## [7] mgcv_1.8-12 nloptr_1.0.4 DBI_0.4-1
## [10] plyr_1.8.4 MatrixModels_0.4-1 munsell_0.4.3
## [13] gtable_0.2.0 codetools_0.2-14 evaluate_0.9
## [16] SparseM_1.7 quantreg_5.26 pbkrtest_0.4-6
## [19] parallel_3.3.0 Rcpp_0.12.5 KernSmooth_2.23-15
## [22] scales_0.4.0 formatR_1.4 gdata_2.17.0
## [25] mime_0.4 lme4_1.1-12 gridExtra_2.2.1
## [28] digest_0.6.9 stringi_1.1.1 grid_3.3.0
## [31] tools_3.3.0 bitops_1.0-6 magrittr_1.5
## [34] lazyeval_0.1.10 ggdendro_0.1-20 MASS_7.3-45
## [37] Matrix_1.2-6 assertthat_0.1 minqa_1.2.4
## [40] rmarkdown_0.9.6 R6_2.1.2 nnet_7.3-12
## [43] nlme_3.1-128
#############################That’s All folks….Phew!##########################