Source file ⇒ The_Analytics_Edge_edX_MIT15.071x_June2015_4.rmd

Unit 5: Text Analytics

Preliminaries

NOTE:I have gone head with some summary outputs which i could have restrained from doing, but the main intention was to see whether the function was performing as desired as there were some issues related to tm package version as used in the lecture and the latest version.

Turning Tweets into Knowldege_An Introduction to Text Analytics_1

INTRODUCTION

We will be trying to understand sentiment of tweets about the company Apple.

While Apple has a large number of fans, they also have a large number of people who don’t like their products. They also have several competitors.
To better understand public perception, Apple wants to monitor how people feel over time and how people receive new announcements.

Our challenge in this lecture is to see if we can correctly classify tweets as being negative,positive, or neither about Apple.

The Data

To collect the data needed for this task, we had to perform two steps.

  • Collect Twitter data

The first was to collect data about tweets from the internet.
Twitter data is publicly available, and it can be collected it through scraping the website or via the Twitter API.

The sender of the tweet might be useful to predict sentiment, but we will ignore it to keep our data anonymized.
So we will just be using the text of the tweet.

  • Construct the outcome variable

Then we need to construct the outcome variable for these tweets, which means that we have to label them as positive, negative, or neutral sentiment.

We would like to label thousands of tweets, and we know that two people might disagree over the correct classification of a tweet. To do this efficiently, one option is to use the Amazon Mechanical Turk.

The task that we put on the Amazon Mechanical Turk was to judge the sentiment expressed by the following item toward the software company Apple.
The items we gave them were tweets that we had collected. The workers could pick from the following options as their response:

  • strongly negative,
  • negative,
  • neutral,
  • positive, and
  • strongly positive.

These outcomes were represented as a number on the scale from -2 to 2.

Each tweet was labeled by five workers. For each tweet, we take the average of the five scores given by the five workers, hence the final scores can range from -2 to 2 in increments of 0.2.

The following graph shows the distribution of the number of tweets classified into each of the categories. We can see here that the majority of tweets were classified as neutral, with a small number classified as strongly negative or strongly positive.

distribution of score

distribution of score

So now we have a bunch of tweets that are labeled with their sentiment. But how do we build independent variables from the text of a tweet to be used to predict the sentiment?

A Bag of Words

One of the most used techniques to transforms text into independent variables is that called Bag of Words.

Fully understanding text is difficult, but Bag of Words provides a very simple approach: it just counts the number of times each word appears in the text and uses these counts as the independent variables.

For example, in the sentence,

"This course is great.  I would recommend this course to my friends,"

the word this is seen twice, the word course is seen twice, the word great is seen once, et cetera.

bag of words

bag of words

In Bag of Words, there is one feature for each word. This is a very simple approach, but is often very effective, too. It is used as a baseline in text analytics projects and for Natural Language Processing.

This is not the whole story, though. Preprocessing the text can dramatically improve the performance of the Bag of Words method.

Cleaning Up Irregularities

One part of preprocessing the text is to clean up irregularities.
Text data often as many inconsistencies that will cause algorithms trouble. Computers are very literal by default.

  • One common irregularity concerns the case of the letters, and it is customary to change all words to either lower-case or upper-case.

  • Punctuation also causes problems, and the basic approach is to remove everything that is not a letter. However some punctuation is meaningful, and therefore the removal of punctuation should be tailored to the specific problem.

There are also unhelpful terms:

  • Stopwords: they are words used frequently but that are only meaningful in a sentence. Examples are the, is, at, and which. It’s unlikely that these words will improve the machine learning prediction quality, so we want to remove them to reduce the size of the data.
    • There are some potential problems with this approach. Sometimes, two stop words taken together have a very important meaning (e.g. the name of the band “The Who”). By removing the stop words, we remove both of these words, but The Who might actually have a significant meaning for our prediction task.
  • Stemming: This step is motivated by the desire to represent words with different endings as the same word. We probably do not need to draw a distinction between argue, argued, argues, and arguing. They could all be represented by a common stem, argu. The algorithmic process of performing this reduction is called stemming.
    There are many ways to approach the problem.

    1. One approach is to build a database of words and their stems.
      • A pro is that this approach handles exceptions very nicely, since we have defined all of the stems.
      • However, it will not handle new words at all, since they are not in the database.
        This is especially bad for problems where we’re using data from the internet, since we have no idea what words will be used.
    2. A different approach is to write a rule-based algorithm.
      In this approach, if a word ends in things like ed, ing, or ly, we would remove the ending.
      • A pro of this approach is that it handles new or unknown words well.
      • However, there are many exceptions, and this approach would miss all of these.
        Words like child and children would be considered different, but it would get other plurals, like dog and dogs.

    This second approach is widely popular and is called the Porter Stemmer, designed by Martin Porter in 1980, and it’s still used today.

VIDEO 2: TEXT ANALYTICS

QUICK QUESTION

Which of these problems is the LEAST likely to be a good application of natural language processing?

Ans:Judging the winner of a poetry contest

EXPLANATION:Judging the winner of a poetry contest requires a deep level of human understanding and emotion. Perhaps someday a computer will be able to accurately judge the winner of a poetry contest, but currently the other three tasks are much better suited for natural language processing.

VIDEO 3: CREATING THE DATASET

QUICK QUESTION

For each tweet, we computed an overall score by averaging all five scores assigned by the Amazon Mechanical Turk workers. However, Amazon Mechanical Turk workers might make significant mistakes when labeling a tweet. The mean could be highly affected by this.

Which of the three alternative metrics below would best capture the typical opinion of the five Amazon Mechanical Turk workers, would be less affected by mistakes, and is well-defined regardless of the five labels?

Ans:An overall score equal to the median (middle) score

EXPLANATION:The correct answer is the first one - the median would capture the typical opinion of the workers and tends to be less affected by significant mistakes. The majority score might not have given a score to all tweets because they might not all have a majority score (consider a tweet with scores 0, 0, 1, 1, and 2). The minimum score does not necessarily capture the typical opinion and could be highly affected by mistakes (consider a tweet with scores -2, 1, 1, 1, 1).

VIDEO 4: BAG OF WORDS

QUICK QUESTION

For each of the following questions, pick the preprocessing task that we discussed in the previous video that would change the sentence “Data is useful AND powerful!” to the new sentence listed in the question.

New sentence: Data useful powerful!

Ans:Removing stop words

New sentence: data is useful and powerful

Ans:Cleaning up irregularities (changing to lowercase and removing punctuation)

New sentence: Data is use AND power!

Ans:Stemming

EXPLANATION:The first new sentence has the stop words “is” and “and” removed. The second new sentence has the irregularities removed (no capital letters or punctuation). The third new sentence has the words stemmed - the “ful” is removed from “useful” and “powerful

VIDEO 5: PRE-PROCESSING IN R (R script reproduced here)

Sys.setlocale("LC_ALL", "C")
## [1] "C"
# Unit 5 - Twitter

# VIDEO 5

#LOADING AND PROCESSING DATA IN R
tweets = read.csv("tweets.csv", stringsAsFactors=FALSE) 
#Note: when working on a text analytics problem it is important (necessary!) to add the extra argument stringsAsFactors = FALSE, so that the text is read in properly.

#Let's take a look at the structure of our data:
str(tweets)
## 'data.frame':    1181 obs. of  2 variables:
##  $ Tweet: chr  "I have to say, Apple has by far the best customer care service I have ever received! @Apple @AppStore" "iOS 7 is so fricking smooth & beautiful!! #ThanxApple @Apple" "LOVE U @APPLE" "Thank you @apple, loving my new iPhone 5S!!!!!  #apple #iphone5S pic.twitter.com/XmHJCU4pcb" ...
##  $ Avg  : num  2 2 1.8 1.8 1.8 1.8 1.8 1.6 1.6 1.6 ...
#We have 1181 observations of 2 variables:
##Tweet: the text of the tweet.
##Avg: the average sentiment score.

#The tweet texts are real tweets that gathered on the internet directed to Apple with a few cleaned up words.We are more interested in being able to detect the tweets with clear negative sentiment, so let's define a new variable in our data set called Negative.

#equal to TRUE if the average sentiment score is less than or equal to -1
#equal to FALSE if the average sentiment score is greater than -1.

# Create dependent variable
tweets$Negative = as.factor(tweets$Avg <= -1)

table(tweets$Negative)
## 
## FALSE  TRUE 
##   999   182
#Now to pre process our text data so that we could we could use the 'Bag of words' approach , we will be using the'tm'-- text mining package

#install.packages("tm")
library(tm)
#install.packages("SnowballC")
library(SnowballC)


#One of the concepts introduced by tm package is that of a corpus.A corpus is a collection of documents.We need to convert our tweets into corpus for pre processing.

#Various function in the tm package can be used to create a corpus in many different ways.We will create it from the tweet column of our data frame using two functions, Corpus() and VectorSource(). We feed to this latter the Tweets variable of the tweets data frame.

# Create corpus
corpus = Corpus(VectorSource(tweets$Tweet))

# Look at corpus
corpus
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1181
#We can check that the documents match our tweets by using double brackets [[.
#To inspect the first (or 10th) tweet in our corpus, we select the first (or 10th) element as:
attributes(corpus[[1]])
## $names
## [1] "content" "meta"   
## 
## $class
## [1] "PlainTextDocument" "TextDocument"
corpus[[1]]$content
## [1] "I have to say, Apple has by far the best customer care service I have ever received! @Apple @AppStore"
corpus[[10]]$content
## [1] "Just checked out the specs on the new iOS 7...wow is all I have to say! I can't wait to get the new update ?? Bravo @Apple"
# IMPORTANT NOTE: If you are using the latest version of the tm package, you will need to run the following line before continuing (it converts corpus to a Plain Text Document). This is a recent change having to do with the tolower function that occurred after this video was recorded.
corpus = tm_map(corpus, PlainTextDocument)


#Converting text to lower case

#Pre-processing is easy in tm.
#Each operation, like stemming or removing stop words, can be done with one line in R, where we use the tm_map() function which takes as its first argument the name of a corpus and as second argument a function performing the transformation that we want to apply to the text.

#To transform all text to lower case:
corpus = tm_map(corpus, content_transformer(tolower))

#Checking the same two "documents" as before:
corpus[[1]]$content
## [1] "i have to say, apple has by far the best customer care service i have ever received! @apple @appstore"
corpus[[10]]$content
## [1] "just checked out the specs on the new ios 7...wow is all i have to say! i can't wait to get the new update ?? bravo @apple"
# Removing punctuation
corpus = tm_map(corpus, removePunctuation)
corpus[[1]]$content
## [1] "i have to say apple has by far the best customer care service i have ever received apple appstore"
corpus[[10]]$content
## [1] "just checked out the specs on the new ios 7wow is all i have to say i cant wait to get the new update  bravo apple"
# Look at stop words provided by tm package.It is necessary to define a list of words that we regard as being stop words, and for this the tm package provides a default list for the English language. We can check it out with:
stopwords("english")[1:10]
##  [1] "i"         "me"        "my"        "myself"    "we"       
##  [6] "our"       "ours"      "ourselves" "you"       "your"
length(stopwords("english"))
## [1] 174
#Next we want to remove the stop words in our tweets.
#Removing words can be done with the removeWords argument to the tm_map() function, with an extra argument, i.e. what the stop words are that we want to remove.
#We will remove all of these English stop words, but we will also remove the word "apple" since all of these tweets have the word "apple" and it probably won't be very useful in our prediction problem.

# Removing stopwords and apple
corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))
corpus[[1]]$content
## [1] "   say    far  best customer care service   ever received  appstore"
corpus[[10]]$content
## [1] "just checked   specs   new ios 7wow      say  cant wait  get  new update  bravo "
#Stemming

#Lastly, we want to stem our document with the stemDocument argument.

# Stem document 
corpus = tm_map(corpus, stemDocument)
corpus[[1]]$content
## [1] "   say    far  best custom care servic   ever receiv  appstor"
corpus[[10]]$content
## [1] "just check   spec   new io 7wow      say  cant wait  get  new updat  bravo"
#We can see that this took off the ending of "customer," "service," "received," and "appstore."

##################################

#QUICK QUESTION  

#Q:Given a corpus in R, how many commands do you need to run in R to clean up the irregularities (removing capital letters and punctuation)?
#Ans:2

#Q:How many commands do you need to run to stem the document?
#Ans:1

#EXPLANATION:In R, you can clean up the irregularities with two lines:
#corpus = tm_map(corpus, tolower)
#corpus = tm_map(corpus, removePunctuation) And you can stem the document with one line:
#corpus = tm_map(corpus, stemDocument)

VIDEO 6: BAG OF WORDS IN R (R script reproduced here)

# Video 6

#Create a Document Term Matrix

#We are now ready to extract the word frequencies to be used in our prediction problem. The tm package provides a function called DocumentTermMatrix() that generates a matrix where:
#the rows correspond to documents, in our case tweets, and
#the columns correspond to words in those tweets.
#The values in the matrix are the number of times that word appears in each document.

corpus = tm_map(corpus, PlainTextDocument)

# Create matrix
frequencies=DocumentTermMatrix(corpus)

frequencies
## <<DocumentTermMatrix (documents: 1181, terms: 3289)>>
## Non-/sparse entries: 8980/3875329
## Sparsity           : 100%
## Maximal term length: 115
## Weighting          : term frequency (tf)
#We see that in the corpus there are 3289 unique words.

#Let's see what this matrix looks like using the inspect() function, in particular slicing a block of rows/columns from the Document Term Matrix by calling by their indices:

# Look at matrix 
inspect(frequencies[1000:1005,505:515])
## <<DocumentTermMatrix (documents: 6, terms: 11)>>
## Non-/sparse entries: 1/65
## Sparsity           : 98%
## Maximal term length: 9
## Weighting          : term frequency (tf)
## 
##               Terms
## Docs           cheapen cheaper check cheep cheer cheerio cherylcol chief
##   character(0)       0       0     0     0     0       0         0     0
##   character(0)       0       0     0     0     0       0         0     0
##   character(0)       0       0     0     0     0       0         0     0
##   character(0)       0       0     0     0     0       0         0     0
##   character(0)       0       0     0     0     0       0         0     0
##   character(0)       0       0     0     0     1       0         0     0
##               Terms
## Docs           chiiiiqu child children
##   character(0)        0     0        0
##   character(0)        0     0        0
##   character(0)        0     0        0
##   character(0)        0     0        0
##   character(0)        0     0        0
##   character(0)        0     0        0
#In this range we see that the word "cheer" appears in the tweet 1005, but "cheap" does not appear in any of these tweets. This data is what we call sparse. This means that there are many zeros in our matrix.

#We can look at what the most popular terms are, or words, with the function findFreqTerms(), selecting a minimum number of 20 occurrences over the whole corpus:
# Check for sparsity
findFreqTerms(frequencies, lowfreq=20)
##  [1] "android"              "anyon"                "app"                 
##  [4] "appl"                 "back"                 "batteri"             
##  [7] "better"               "buy"                  "can"                 
## [10] "cant"                 "come"                 "dont"                
## [13] "fingerprint"          "freak"                "get"                 
## [16] "googl"                "ios7"                 "ipad"                
## [19] "iphon"                "iphone5"              "iphone5c"            
## [22] "ipod"                 "ipodplayerpromo"      "itun"                
## [25] "just"                 "like"                 "lol"                 
## [28] "look"                 "love"                 "make"                
## [31] "market"               "microsoft"            "need"                
## [34] "new"                  "now"                  "one"                 
## [37] "phone"                "pleas"                "promo"               
## [40] "promoipodplayerpromo" "realli"               "releas"              
## [43] "samsung"              "say"                  "store"               
## [46] "thank"                "think"                "time"                
## [49] "twitter"              "updat"                "use"                 
## [52] "via"                  "want"                 "well"                
## [55] "will"                 "work"
#Out of the 3289 words in our matrix, only 56 words appear at least 20 times in our tweets.
#This means that we probably have a lot of terms that will be pretty useless for our prediction model. The number of terms is an issue for two main reasons:
#One is computational: more terms means more independent variables, which usually means it takes longer to build our models.
#The other is that in building models the ratio of independent variables to observations will affect how well the model will generalize.

# Remove sparse terms(removing some terms that don't appear very often.)
sparse = removeSparseTerms(frequencies, 0.995)

#This function takes a second parameters, the sparsity threshold. The sparsity threshold works as follows.
#If we say 0.98, this means to only keep terms that appear in 2% or more of the tweets.
#If we say 0.99, that means to only keep terms that appear in 1% or more of the tweets.
#If we say 0.995, that means to only keep terms that appear in 0.5% or more of the tweets, about six or more tweets.

#Let's see what the new Document Term Matrix properties look like:
sparse
## <<DocumentTermMatrix (documents: 1181, terms: 309)>>
## Non-/sparse entries: 4669/360260
## Sparsity           : 99%
## Maximal term length: 20
## Weighting          : term frequency (tf)
#It only contains 309 unique terms, i.e. only about 9.4% of the full set.

# Convert sparse to a data frame to use for predictive modeling
tweetsSparse = as.data.frame(as.matrix(sparse))


#Fix variables names in the data frame

#Since R struggles with variable names that start with a number, and we probably have some words here that start with a number, we should run the make.names() function to make sure all of our words are appropriate variable names. It will convert the variable names to make sure they are all appropriate names for R before we build our predictive models. You should do this each time you build a data frame using text analytics.

# Make all variable names R-friendly
colnames(tweetsSparse) = make.names(colnames(tweetsSparse))

# Add dependent variable
#We should add back to this data frame our dependent variable to this data set. We'll call it tweetsSparse$Negative and set it equal to the original Negative variable from the tweets data frame.
tweetsSparse$Negative = tweets$Negative


# Split the data in training/testing sets
library(caTools)

set.seed(123)

split = sample.split(tweetsSparse$Negative, SplitRatio = 0.7)

trainSparse = subset(tweetsSparse, split==TRUE)
testSparse = subset(tweetsSparse, split==FALSE)

#QUICK QUESTION  

#In the previous video, we showed a list of all words that appear at least 20 times in our tweets. Which of the following words appear at least 100 times? Select all that apply. (HINT: use the findFreqTerms function)
findFreqTerms(frequencies, lowfreq=100)
## [1] "iphon" "itun"  "new"
#Ans:"iphon", "itun", and "new"

VIDEO 7: PREDICTING SENTIMENT (R script reproduced here)

# Video 7

# Build a CART model

library(rpart)
library(rpart.plot)


#Let's first use CART to build a predictive model, using the rpart() function to predict Negative using all of the other variables as our independent variables and the data set trainSparse.

#We'll add one more argument here, which is method = "class" so that the rpart() function knows to build a classification model. We keep default settings for all other parameters, in particular we are not adding anything for minbucket or cp.
#Building the classification model with all the IVs 
tweetCART = rpart(Negative ~ ., data=trainSparse, method="class")

#plotting the tree
prp(tweetCART)

#The tree says that
#if the word "freak" is in the tweet, then predict TRUE, or negative sentiment.
#If the word "freak" is not in the tweet, but the word "hate" is again predict TRUE.
#If neither of these two words are in the tweet, but the word "wtf" is, also predict TRUE, or negative sentiment.
#If none of these three words are in the tweet, then predict FALSE, or non-negative sentiment.

#This tree makes sense intuitively since these three words are generally seen as negative words.


# Evaluate the Out-of-Sample numerical performance of the model to get class predictions
#sing the predict() function we compute the predictions of our model tweetCART on the new data set testSparse. Be careful to add the argument type = "class" to make sure we get class predictions.
predictCART = predict(tweetCART, newdata=testSparse, type="class")

#computing the confusion matrix from the predictions
cmat_CART<-table(testSparse$Negative, predictCART)
cmat_CART
##        predictCART
##         FALSE TRUE
##   FALSE   294    6
##   TRUE     37   18
# Compute accuracy
accu_CART <- (cmat_CART[1,1] + cmat_CART[2,2])/sum(cmat_CART) #(294+18)/(294+6+37+18)=0.8788732
#Ans:Overall accuracy=0.8788732
#Sensitivity = 18 / 55 = 0.3273 ( = TP rate)
#Specificity = 294 / 300 = 0.98
#FP rate = 6 / 300 = 0.02


#Comparison with theBaseline accuracy 
#Let's compare this to a simple baseline model that always predicts non-negative (i.e. the most common value of the dependent variable).
#To compute the accuracy of the baseline model, let's make a table of just the outcome variable Negative.
cmat_baseline<-table(testSparse$Negative)
cmat_baseline
## 
## FALSE  TRUE 
##   300    55
accu_baseline <- max(cmat_baseline)/sum(cmat_baseline)#300/(300+55)=08450704
#Ans:Baseline model accuracy=0.8450704

#So our CARTt model does better than the baseline model.Lets see how Random Forest does?


#Random forest model

library(randomForest)
set.seed(123)


#Building the Random forest model with all the IVs (Takes considerably a long time since er have a large no. of IVs)
#We use the randomForest() function to predict Negative again using all of our other variables as independent variables and the data set trainSparse. Again we use the default parameter settings:
tweetRF = randomForest(Negative ~ ., data=trainSparse)

# Make Out-of-Sample predictions:
predictRF = predict(tweetRF, newdata=testSparse)

#computing the confusion matrix
cmat_RF<-table(testSparse$Negative, predictRF)
cmat_RF
##        predictRF
##         FALSE TRUE
##   FALSE   293    7
##   TRUE     34   21
#Overall model Accuracy:
accu_RF <- (cmat_RF[1,1] + cmat_RF[2,2])/sum(cmat_RF)
accu_RF #(293+21)/(293+7+34+21)=0.884507
## [1] 0.884507
#The overall accuracy of this Random Forest model is 0.884507

#The accuracy is slightly better than the CART model, but the interpretability of CART model is more compared to Random Forest and hence probably i would use the CART model
#If you were to use cross-validation to pick the cp parameter for the CART model, the accuracy would increase to about the same as the random forest model.So by using a bag-of-words approach and these models, we can reasonably predict sentiment even with a relatively small data set of tweets.

##################################

#QUICK QUESTION

#Comparison with logistic regression model

#In the previous video, we used CART and Random Forest to predict sentiment. Let's see how well logistic regression does. Build a logistic regression model (using the training set) to predict "Negative" using all of the independent variables. You may get a warning message after building your model - don't worry (we explain what it means in the explanation).

#Build the model, using all independent variables as predictors:
tweetLog<- glm(Negative ~ . , data =trainSparse, family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#summary(tweetLog)


#Now, make predictions on the testing set using the logistic regression model:
predictLog= predict(tweetLog, newdata=testSparse, type="response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
#where "tweetLog" should be the name of your logistic regression model. You might also get a warning message after this command, but don't worry - it is due to the same problem as the previous warning message.

#Build a confusion matrix (with a threshold of 0.5) and compute the accuracy of the model.What is the accuracy?

# Confusion matrix with threshold of 0.5
cmat_log<-table(testSparse$Negative, predictLog> 0.5)
cmat_log
##        
##         FALSE TRUE
##   FALSE   253   47
##   TRUE     22   33
#lets now compute the overall accuracy
accu_log <- (cmat_log[1,1] + cmat_log[2,2])/sum(cmat_log)
accu_log #(253+33)/(253+47+22+33) = 0.8056338
## [1] 0.8056338
#Ans:0.8056338
#EXPLANATION:The accuracy is (253+33)/(253+47+22+33) = 0.8056338, which is worse than the baseline.

#The Perils of Over-fitting:
#If you were to compute the accuracy on the training set instead, you would see that the model does really well on the training set - this is an example of over-fitting. The model fits the training set really well, but does not perform well on the test set. A logistic regression model with a large number of variables is particularly at risk for overfitting.

#Note that you might have gotten a different answer than us, because the glm function struggles with this many variables. The warning messages that you might have seen in this problem have to do with the number of variables, and the fact that the model is overfitting to the training set. We'll discuss this in more detail in the Homework Assignment.

THE ANALYTICS EDGE

  • Analytical sentiment analysis can replace more labor-intensive methods like polling.
  • Text analytics can deal with the massive amounts of unstructured data being generated on the internet.
  • Computers are becoming more and more capable of interacting with humans and performing human tasks.

Man vs Machine_How IBM Built a jeopardy Champion_2

INTRODUCTION

How IBM Built a Jeopardy! Champion

A Grand Challenge

  • In 2004, IBM Vice President Charles Lickel and coworkers were having dinner at a restaurant
  • All of a sudden, the restaurant fell silent
  • Everyone was watching the game show Jeopardy! on the television in the bar
  • A contestant, Ken Jennings, was setting the record for the longest winning streak of all time (75 days)

Why was everyone so interested?

  • Jeopardy! is a quiz show that asks complex and clever questions (puns, obscure facts, uncommon words)
  • Originally aired in 1964
  • A huge variety of topics
  • Generally viewed as an impressive feat to do well
  • No computer system had ever been developed that could even come close to competing with humans on Jeopardy!

A Tradition of Challenges

  • IBM Research strives to push the limits of science
    • Have a tradition of inspiring and difficult challenges
  • Deep Blue - a computer to compete against the best human chess players
    • A task that people thought was restricted to human intelligence
  • Blue Gene - a computer to map the human genome
    • A challenge for computer speed and performance

The Challenge Begins

  • In 2005, a team at IBM Research started creating a computer that could compete at Jeopardy!
    • No one knew how to beat humans, or if it was even possible
  • Six years later, a two-game exhibition match aired on television
    • The winner would receive $1,000,000

The Contestants

Ken_Jennings

Ken_Jennings

  • Ken Jennings
    • Longest winning streak of 75 days
Brad_Rutter

Brad_Rutter

  • Brad Rutter
    • Biggest money winner of over $3.5 million
Watson

Watson

  • Watson
    • A supercomputer with 3,000 processors and a database of 200 million pages of information

The Match Begins

match begins!

match begins!

VIDEO 1: IBM WATSON

QUICK QUESTION

What were the goals of IBM when they set out to build Watson? Select all that apply.

Ans:To build a computer that could compete with the best human players at Jeopardy!.& To build a computer that could answer questions that are commonly believed to require human intelligence.

EXPLANATION:The main goals of IBM were to build a computer that could answer questions that are commonly believed to require human intelligence, and to therefore compete with the best human players at Jeopardy!.

VIDEO 2: THE GAME OF JEOPARDY

Overview of the Jeopardy! game

jeopardy

jeopardy

  • Three rounds per game
    • Jeopardy
    • Double Jeopardy (dollar values doubled)
    • Final Jeopardy (wager on response to one question)
  • Each round has five questions in six categories
    • Wide variety of topics (over 2,500 different categories)
  • Each question has a dollar value - the first to buzz in and answer correctly wins the money
    • If they answer incorrectly they lose the money

ExampleRound

Example_Round

Example_Round

Jeopardy! Questions

  • Cryptic definitions of categories and clues
  • Answer in the form of a question
    • Q: Mozart’s last and perhaps most powerful symphony shares its name with this planet.
    • A: What is Jupiter?
  • Q: Smaller than only Greenland, it’s the world’s second largest island.
    • A: What is New Guinea?

QUICK QUESTION

For which of the following reasons is Jeopardy! challenging? Select all that apply.

Ans:A wide variety of categories. , Speed is required - you have to buzz in faster than your competitors. , The categories and clues are often cryptic.

EXPLANATION:Jeopardy! is challenging because there are a wide variety of categories, speed is required, and the categories and clues are cryptic. Expert knowledge is not generally required.

VIDEO 3: WATSON’S DATABASE AND TOOLS

Why is Jeopardy Hard?

  • Wide variety of categories, purposely made cryptic
  • Computers can easily answer precise questions
    • What is the square root of (35672-183)/33?
  • Understanding natural language is hard
    • Where was Albert Einstein born?
    • Suppose you have the following information:
    “One day, from his city views of Ulm, Otto chose a water color to send to Albert Einstein as a remembrance of his birthplace.”
    • Ulm? Otto?

A Straightforward Approach

  • Let’s just store answers to all possible questions
  • This would be impossible
    • An analysis of 200,000 previous questions yielded over 2,500 different categories
  • Let’s just search Google
    • No links to the outside world permitted
    • It can take considerable skill to find the right webpage with the right information

Using Analytics

  • Watson received each question in text form
    • Normally, players see and hear the questions
  • IBM used analytics to make Watson a competitive player
  • Used over 100 different techniques for analyzing natural language, finding hypotheses, and ranking hypotheses

Watson’s Database and Tools

  • A massive number of data sources
    • Encyclopedias, texts, manuals, magazines, Wikipedia, etc.
  • Lexicon
    • Describes the relationship between different words
    • Ex: “Water” is a “clear liquid” but not all “clear liquids” are “water”
  • Part of speech tagger and parser
    • Identifies functions of words in text
    • Ex: “Race” can be a verb or a noun
      • He won the race by 10 seconds.
      • Please indicate your race.

How Watson Works

  • Step 1: Question Analysis
    • Figure out what the question is looking for
  • Step 2: Hypothesis Generation
    • Search information sources for possible answers
  • Step 3: Scoring Hypotheses
    • Compute confidence levels for each answer
  • Step 4: Final Ranking
    • Look for a highly supported answer

QUICK QUESTION

Which of the following two questions do you think would be EASIEST for a computer to answer?

Ans:What year was Abraham Lincoln born?

EXPLANATION:The second question would be the easiest, because the answer is a fact. The first question is much more subjective.

VIDEO 4: HOW WATSON WORKS - STEPS 1 AND 2

Step 1: Question Analysis

  • What is the question looking for?
  • Trying to find the Lexical Answer Type (LAT) of the question
    • Word or noun in the question that specifies the type of answer
  • Ex: “Mozart’s last and perhaps most powerful symphony shares its name with this planet.”
  • Ex: “Smaller than only Greenland, it’s the world’s secondlargest island.”

Step 1: Question Analysis

  • If we know the LAT, we know what to look for
  • In an analysis of 20,000 questions
    • 2,500 distinct LATs were found
    • 12% of the questions do not have an explicit LAT
    • The most frequent 200 explicit LATs cover less than 50% of the questions
  • Also performs relation detection to find relationships among words, and decomposition to split the question into different clues

Step 2: Hypothesis Generation

  • Uses the question analysis from Step 1 to produce candidate answers by searching the databases
  • Several hundred candidate answers are generated
  • Ex: “Mozart’s last and perhaps most powerful symphony shares its name with this planet.”
    • Candidate answers: Mercury, Earth, Jupiter, etc.
  • Then each candidate answer plugged back into the question in place of the LAT is considered a hypothesis
    • Hypothesis 1: “Mozart’s last and perhaps most powerful symphony shares its name with Mercury.”
    • Hypothesis 2: “Mozart’s last and perhaps most powerful symphony shares its name with Jupiter.”
    • Hypothesis 3: “Mozart’s last and perhaps most powerful symphony shares its name with Earth.”
  • If the correct answer is not generated at this stage,Watson has no hope of getting the question right
  • This step errors on the side of generating a lot of hypotheses, and leaves it up to the next step to find the correct answer

QUICK QUESTION

Select the LAT of the following Jeopardy question: NICHOLAS II WAS THE LAST RULING CZAR OF THIS ROYAL FAMILY (Hint: The answer is “The Romanovs”)

Ans:THIS ROYAL FAMILY

Select the LAT of the following Jeopardy question: REGARDING THIS DEVICE, ARCHIMEDES SAID, “GIVE ME A PLACE TO STAND ON, AND I WILL MOVE THE EARTH” (Hint: The answer is “A lever”)

Ans: THIS DEVICE

EXPLANATION:The LAT in the first question is “THIS ROYAL FAMILY” and the LAT in the second question is “THIS DEVICE”. Remember that if you replace the LAT with the correct answer, the sentence should make sense.

VIDEO 5: HOW WATSON WORKS - STEPS 3 AND 4

Step 3: Scoring Hypotheses

  • Compute confidence levels for each possible answer
    • Need to accurately estimate the probability of a proposed answer being correct
    • Watson will only buzz in if a confidence level is above a threshold
  • Combines a large number of different methods

Lightweight Scoring Algorithms

  • Starts with “lightweight scoring algorithms” to prune down large set of hypotheses
  • Ex: What is the likelihood that a candidate answer is an instance of the LAT?
    • If this likelihood is not very high, throw away the hypothesis
  • Candidate answers that pass this step proceed the next stage
    • Watson lets about 100 candidates pass into the next stage

Scoring Analytics

  • Need to gather supporting evidence for each candidate answer
  • Passage Search
    • Retrieve passages that contain the hypothesis text
    • Let’s see what happens when we search for our hypotheses on Google
    • Hypothesis 1: “Mozart’s last and perhaps most powerful symphony shares its name with Mercury.”
    • Hypothesis 2: “Mozart’s last and perhaps most powerful symphony shares its name with Jupiter.”

Passage Search

Passage Search

Passage Search

Passage_Search_diff

Passage_Search_diff

Scoring Analytics

  • Determine the degree of certainty that the evidence supports the candidate answers
  • More than 50 different scoring components
  • Ex: Temporal relationships
    • “In 1594, he took a job as a tax collector in Andalusia”
    • Two candidate answers: Thoreau and Cervantes
    • Thoreau was not born until 1817, so we are more confident about Cervantes

Step 4: Final Merging and Ranking

  • Selecting the single best supported hypothesis
  • First need to merge similar answers
    • Multiple candidate answers may be equivalent
      • Ex: “Abraham Lincoln” and “Honest Abe”
    • Combine scores
  • Rank the hypotheses and estimate confidence
    • Use predictive analytics

Ranking and Confidence Estimation

  • Training data is a set of historical Jeopardy! questions
  • Each of the scoring algorithms is an independent variable
  • Use logistic regression to predict whether or not a candidate answer is correct, using the scores
  • If the confidence for the best answer is high enough,Watson buzzes in to answer the question

The Watson System

Watson System

Watson System

  • Eight refrigerator-sized cabinets
  • High speed local storage for all information
  • Originally took over two hours to answer one question
  • This had to be reduced to 2-6 seconds

QUICK QUESTION

To predict which candidate answer is correct, we said that Watson uses logistic regression. Which of the other techniques that we have learned could be used instead? Select all that apply.

Ans:CART , Random Forests

EXPLANATION:CART and Random Forests are both techniques that are also used for classification, and could provide confidence probabilities.

VIDEO 6: THE RESULTS

Progress from 2006 - 2010

progress

progress

Let the games begin!

  • The games were scheduled for February 2011
  • Two games were played, and the winner would be the contestant with the highest winnings over the two games

The Results

results

results

What’s Next for Watson

  • Apply to other domains
    • Watson is ideally suited to answering questions which cover a wide range of material and often have to deal with inconsistent or incomplete information
  • Medicine
    • The amount of medical information available is doubling every 5 years and a lot of the data is unstructured
    • Cancer diagnosis and selecting the best course of treatment
      • MD Anderson and Memorial Sloan-Kettering Cancer Centers

The Analytics Edge

  • Combine many algorithms to increase accuracy and confidence
    • Any one algorithm wouldn’t have worked
  • Approach the problem in a different way than how a human does
    • Hypothesis generation
  • Deal with massive amounts of data, often in unstructured form
    • 90% of data is unstructured

Predictive Coding_Bringing Text Analytics to the Courtroom (Recitation)_3

VIDEO 1: THE STORY OF ENRON

INTRODUCTION

We will be looking into how to use the text of emails in the inboxes of Enron executives to predict if those emails are relevant to an investigation into the company.

We will be extracting word frequencies from the text of the documents, and then integrating those frequencies into predictive models.

We are going to talk about predictive coding – an emerging use of text analytics in the area of criminal justice.

The case we will consider concerns Enron, a US energy company based out of Houston, Texas that was involved in a number of electricity production and distribution markets and that collapsed in the early 2000’s after widespread account fraud was exposed. To date Enron remains a stunning symbol of corporate corruption.

While Enron’s collapse stemmed largely from accounting fraud, the firm also faced sanctions for its involvement in the California electricity crisis.
In 2000 to 2001, California experienced a number of power blackouts, despite having sufficient generating capacity.
It later surfaced that Enron played a key role in this energy crisis by artificially reducing power supply to spike prices and then making a profit from this market instability.

The Federal Energy Regulatory Commission, or FERC, investigated Enron’s involvement in the crisis, and its investigation eventually led to a $1.52 billion settlement.
FERC’s investigation into Enron will be the topic of today’s recitation.

The eDiscovery Problem

Enron was a huge company, and its corporate servers contained millions of emails and other electronic files. Sifting through these documents to find the ones relevant to an investigation is no simple task.

In law, this electronic document retrieval process is called the eDiscovery problem, and relevant files are called responsive documents.
Traditionally, the eDiscovery problem has been solved by using keyword search. In our case, perhaps, searching for phrases like “electricity bid” or “energy schedule”,followed by an expensive and time-consuming manual review process, in which attorneys read through thousands of documents to determine which ones are responsive.

Predictive Coding

Predictive coding is a new technique in which attorneys manually label some documents and then use text analytics models trained on the manually labeled documents to predict which of the remaining documents are responsive.

VIDEO 2: THE DATA (R script reproduced here)

The Data

As part of its investigation, the FERC released hundreds of thousands of emails from top executives at Enron creating the largest publicly available set of emails today.
We will use this data set called the Enron Corpus to perform predictive coding in this recitation.

The data set contains just two fields:

  • email: the text of the email in question,
  • responsive: a binary (0/1) variable telling whether the email relates to energy schedules or bids.

The labels for these emails were made by attorneys as part of the 2010 text retrieval conference legal track, a predictive coding competition.

# Unit 5 - Recitation


# Video 2

# Load the dataset
emails = read.csv("energy_bids.csv", stringsAsFactors=FALSE)

#Lets see the structure
str(emails)
## 'data.frame':    855 obs. of  2 variables:
##  $ email     : chr  "North America's integrated electricity market requires cooperation on environmental policies Commission for Environmental Coope"| __truncated__ "FYI -----Original Message----- From: \t\"Ginny Feliciano\" <gfeliciano@earthlink.net>@ENRON [mailto:IMCEANOTES-+22Ginny+20Felic"| __truncated__ "14:13:53 Synchronizing Mailbox 'Kean, Steven J.' 14:13:53 Synchronizing Hierarchy 14:13:53 Synchronizing Favorites 14:13:53 Syn"| __truncated__ "^ ----- Forwarded by Steven J Kean/NA/Enron on 03/02/2001 12:27 PM ----- Suzanne_Nimocks@mckinsey.com Sent by: Carol_Benter@mck"| __truncated__ ...
##  $ responsive: int  0 1 0 1 0 0 1 0 0 0 ...
#Let's look at a few examples (using the strwrap() function for easier-to-read formatting):
strwrap(emails$email[1])
##  [1] "North America's integrated electricity market requires cooperation"                         
##  [2] "on environmental policies Commission for Environmental Cooperation"                         
##  [3] "releases working paper on North America's electricity market"                               
##  [4] "Montreal, 27 November 2001 -- The North American Commission for"                            
##  [5] "Environmental Cooperation (CEC) is releasing a working paper"                               
##  [6] "highlighting the trend towards increasing trade, competition and"                           
##  [7] "cross-border investment in electricity between Canada, Mexico and"                          
##  [8] "the United States. It is hoped that the working paper,"                                     
##  [9] "Environmental Challenges and Opportunities in the Evolving North"                           
## [10] "American Electricity Market, will stimulate public discussion"                              
## [11] "around a CEC symposium of the same title about the need to"                                 
## [12] "coordinate environmental policies trinationally as a North"                                 
## [13] "America-wide electricity market develops. The CEC symposium will"                           
## [14] "take place in San Diego on 29-30 November, and will bring together"                         
## [15] "leading experts from industry, academia, NGOs and the governments"                          
## [16] "of Canada, Mexico and the United States to consider the impact of"                          
## [17] "the evolving continental electricity market on human health and"                            
## [18] "the environment. \"Our goal [with the working paper and the"                                
## [19] "symposium] is to highlight key environmental issues that must be"                           
## [20] "addressed as the electricity markets in North America become more"                          
## [21] "and more integrated,\" said Janine Ferretti, executive director of"                         
## [22] "the CEC. \"We want to stimulate discussion around the important"                            
## [23] "policy questions being raised so that countries can cooperate in"                           
## [24] "their approach to energy and the environment.\" The CEC, an"                                
## [25] "international organization created under an environmental side"                             
## [26] "agreement to NAFTA known as the North American Agreement on"                                
## [27] "Environmental Cooperation, was established to address regional"                             
## [28] "environmental concerns, help prevent potential trade and"                                   
## [29] "environmental conflicts, and promote the effective enforcement of"                          
## [30] "environmental law. The CEC Secretariat believes that greater North"                         
## [31] "American cooperation on environmental policies regarding the"                               
## [32] "continental electricity market is necessary to: * protect air"                              
## [33] "quality and mitigate climate change, * minimize the possibility of"                         
## [34] "environment-based trade disputes, * ensure a dependable supply of"                          
## [35] "reasonably priced electricity across North America * avoid"                                 
## [36] "creation of pollution havens, and * ensure local and national"                              
## [37] "environmental measures remain effective. The Changing Market The"                           
## [38] "working paper profiles the rapid changing North American"                                   
## [39] "electricity market. For example, in 2001, the US is projected to"                           
## [40] "export 13.1 thousand gigawatt-hours (GWh) of electricity to Canada"                         
## [41] "and Mexico. By 2007, this number is projected to grow to 16.9"                              
## [42] "thousand GWh of electricity. \"Over the past few decades, the North"                        
## [43] "American electricity market has developed into a complex array of"                          
## [44] "cross-border transactions and relationships,\" said Phil Sharp,"                            
## [45] "former US congressman and chairman of the CEC's Electricity"                                
## [46] "Advisory Board. \"We need to achieve this new level of cooperation"                         
## [47] "in our environmental approaches as well.\" The Environmental"                               
## [48] "Profile of the Electricity Sector The electricity sector is the"                            
## [49] "single largest source of nationally reported toxins in the United"                          
## [50] "States and Canada and a large source in Mexico. In the US, the"                             
## [51] "electricity sector emits approximately 25 percent of all NOx"                               
## [52] "emissions, roughly 35 percent of all CO2 emissions, 25 percent of"                          
## [53] "all mercury emissions and almost 70 percent of SO2 emissions."                              
## [54] "These emissions have a large impact on airsheds, watersheds and"                            
## [55] "migratory species corridors that are often shared between the"                              
## [56] "three North American countries. \"We want to discuss the possible"                          
## [57] "outcomes from greater efforts to coordinate federal, state or"                              
## [58] "provincial environmental laws and policies that relate to the"                              
## [59] "electricity sector,\" said Ferretti. \"How can we develop more"                             
## [60] "compatible environmental approaches to help make domestic"                                  
## [61] "environmental policies more effective?\" The Effects of an"                                 
## [62] "Integrated Electricity Market One key issue raised in the paper is"                         
## [63] "the effect of market integration on the competitiveness of"                                 
## [64] "particular fuels such as coal, natural gas or renewables. Fuel"                             
## [65] "choice largely determines environmental impacts from a specific"                            
## [66] "facility, along with pollution control technologies, performance"                           
## [67] "standards and regulations. The paper highlights other impacts of a"                         
## [68] "highly competitive market as well. For example, concerns about so"                          
## [69] "called \"pollution havens\" arise when significant differences in"                          
## [70] "environmental laws or enforcement practices induce power companies"                         
## [71] "to locate their operations in jurisdictions with lower standards."                          
## [72] "\"The CEC Secretariat is exploring what additional environmental"                           
## [73] "policies will work in this restructured market and how these"                               
## [74] "policies can be adapted to ensure that they enhance"                                        
## [75] "competitiveness and benefit the entire region,\" said Sharp."                               
## [76] "Because trade rules and policy measures directly influence the"                             
## [77] "variables that drive a successfully integrated North American"                              
## [78] "electricity market, the working paper also addresses fuel choice,"                          
## [79] "technology, pollution control strategies and subsidies. The CEC"                            
## [80] "will use the information gathered during the discussion period to"                          
## [81] "develop a final report that will be submitted to the Council in"                            
## [82] "early 2002. For more information or to view the live video webcast"                         
## [83] "of the symposium, please go to: http://www.cec.org/electricity."                            
## [84] "You may download the working paper and other supporting documents"                          
## [85] "from:"                                                                                      
## [86] "http://www.cec.org/programs_projects/other_initiatives/electricity/docs.cfm?varlan=english."
## [87] "Commission for Environmental Cooperation 393, rue St-Jacques"                               
## [88] "Ouest, Bureau 200 Montr<U+00C3><U+00A9>al (Qu<U+00C3><U+00A9>bec) Canada H2Y 1N9 Tel: (514)"                            
## [89] "350-4300; Fax: (514) 350-4314 E-mail: info@ccemtl.org ***********"
# Look at emails
emails$email[1]
## [1] "North America's integrated electricity market requires cooperation on environmental policies Commission for Environmental Cooperation releases working paper on North America's electricity market Montreal, 27 November 2001 -- The North American Commission for Environmental Cooperation (CEC) is releasing a working paper highlighting the trend towards increasing trade, competition and cross-border investment in electricity between Canada, Mexico and the United States. It is hoped that the working paper, Environmental Challenges and Opportunities in the Evolving North American Electricity Market, will stimulate public discussion around a CEC symposium of the same title about the need to coordinate environmental policies trinationally as a North America-wide electricity market develops. The CEC symposium will take place in San Diego on 29-30 November, and will bring together leading experts from industry, academia, NGOs and the governments of Canada, Mexico and the United States to consider the impact of the evolving continental electricity market on human health and the environment. \"Our goal [with the working paper and the symposium] is to highlight key environmental issues that must be addressed as the electricity markets in North America become more and more integrated,\" said Janine Ferretti, executive director of the CEC. \"We want to stimulate discussion around the important policy questions being raised so that countries can cooperate in their approach to energy and the environment.\" The CEC, an international organization created under an environmental side agreement to NAFTA known as the North American Agreement on Environmental Cooperation, was established to address regional environmental concerns, help prevent potential trade and environmental conflicts, and promote the effective enforcement of environmental law. The CEC Secretariat believes that greater North American cooperation on environmental policies regarding the continental electricity market is necessary to: *  protect air quality and mitigate climate change, *  minimize the possibility of environment-based trade disputes, *  ensure a dependable supply of reasonably priced electricity across North America *  avoid creation of pollution havens, and *  ensure local and national environmental measures remain effective. The Changing Market The working paper profiles the rapid changing North American electricity market. For example, in 2001, the US is projected to export 13.1 thousand gigawatt-hours (GWh) of electricity to Canada and Mexico. By 2007, this number is projected to grow to 16.9 thousand GWh of electricity. \"Over the past few decades, the North American electricity market has developed into a complex array of cross-border transactions and relationships,\" said Phil Sharp, former US congressman and chairman of the CEC's Electricity Advisory Board. \"We need to achieve this new level of cooperation in our environmental approaches as well.\" The Environmental Profile of the Electricity Sector The electricity sector is the single largest source of nationally reported toxins in the United States and Canada and a large source in Mexico. In the US, the electricity sector emits approximately 25 percent of all NOx emissions, roughly 35 percent of all CO2 emissions, 25 percent of all mercury emissions and almost 70 percent of SO2 emissions. These emissions have a large impact on airsheds, watersheds and migratory species corridors that are often shared between the three North American countries. \"We want to discuss the possible outcomes from greater efforts to coordinate federal, state or provincial environmental laws and policies that relate to the electricity sector,\" said Ferretti. \"How can we develop more compatible environmental approaches to help make domestic environmental policies more effective?\" The Effects of an Integrated Electricity Market One key issue raised in the paper is the effect of market integration on the competitiveness of particular fuels such as coal, natural gas or renewables. Fuel choice largely determines environmental impacts from a specific facility, along with pollution control technologies, performance standards and regulations. The paper highlights other impacts of a highly competitive market as well. For example, concerns about so called \"pollution havens\" arise when significant differences in environmental laws or enforcement practices induce power companies to locate their operations in jurisdictions with lower standards. \"The CEC Secretariat is exploring what additional environmental policies will work in this restructured market and how these policies can be adapted to ensure that they enhance competitiveness and benefit the entire region,\" said Sharp. Because trade rules and policy measures directly influence the variables that drive a successfully integrated North American electricity market, the working paper also addresses fuel choice, technology, pollution control strategies and subsidies. The CEC will use the information gathered during the discussion period to develop a final report that will be submitted to the Council in early 2002. For more information or to view the live video webcast of the symposium, please go to: http://www.cec.org/electricity. You may download the working paper and other supporting documents from: http://www.cec.org/programs_projects/other_initiatives/electricity/docs.cfm?varlan=english. Commission for Environmental Cooperation 393, rue St-Jacques Ouest, Bureau 200 Montr<U+00C3><U+00A9>al (Qu<U+00C3><U+00A9>bec) Canada H2Y 1N9 Tel: (514) 350-4300; Fax: (514) 350-4314 E-mail: info@ccemtl.org ***********"
#We can see just by parsing through the first couple of lines that this is an email about a new working paper, "The Environmental Challenges and Opportunities in the Evolving North American Electricity Market", released by the Commission for Environmental Cooperation, or CEC.While this certainly deals with electricity markets, it doesn't have to do with energy schedules or bids, hence it is not responsive to our query.


#If we look at the value in the responsive variable for this email:
emails$responsive[1]
## [1] 0
#we see that its value is 0, as expected.

#Let's check the second email:
emails$email[2]
## [1] "FYI -----Original Message----- From: \t\"Ginny Feliciano\" <gfeliciano@earthlink.net>@ENRON [mailto:IMCEANOTES-+22Ginny+20Feliciano+22+20+3Cgfeliciano+40earthlink+2Enet+3E+40ENRON@ENRON.com] Sent:\tThursday, June 28, 2001 3:40 PM To:\tSilvia Woodard; Paul Runci; Katrin Thomas; John A. Riggs; Kurt E. Yeager; Gregg Ward; Philip K. Verleger; Admiral Richard H. Truly; Susan Tomasky; Tsutomu Toichi; Susan F. Tierney; John A. Strom; Gerald M. Stokes; Kevin Stoffer; Edward M. Stern; Irwin M. Stelzer; Hoff Stauffer; Steven R. Spencer; Robert Smart; Bernie Schroeder; George A. Schreiber, Jr.; Robert N. Schock; James R. Schlesinger; Roger W. Sant; John W. Rowe; James E. Rogers; John F. Riordan; James Ragland; Frank J. Puzio; Tony Prophet; Robert Priddle; Michael Price; John B. Phillips; Robert Perciasepe; D. Louis Peoples; Robert Nordhaus; Walker Nolan; William A. Nitze; Kazutoshi Muramatsu; Ernest J. Moniz; Nancy C. Mohn; Callum McCarthy; Thomas R. Mason; Edward P. Martin; Jan W. Mares; James K. Malernee; S. David Freeman; Edwin Lupberger; Amory B. Lovins; Lynn LeMaster; Hoesung Lee; Lay, Kenneth; Lester Lave; Wilfrid L. Kohl; Soo Kyung Kim; Melanie Kenderdine; Paul L. Joskow; Ira H. Jolles; Frederick E. John; John Jimison; William W. Hogan; Robert A. Hefner, III; James K. Gray; Craig G. Goodman; Charles F. Goff, Jr.; Jerry D. Geist; Fritz Gautschi; Larry G. Garberding; Roger Gale; William Fulkerson; Stephen E. Frank; George Frampton; Juan Eibenschutz; Theodore R. Eck; Congressman John Dingell; Brian N. Dickie; William E. Dickenson; Etienne Deffarges; Wilfried Czernie; Loren C. Cox; Anne Cleary; Bernard H. Cherry; Red Cavaney; Ralph Cavanagh; Thomas R. Casten; Peter Bradford; Peter D. Blair; Ellen Berman; Roger A. Berliner; Michael L. Beatty; Vicky A. Bailey; Merribel S. Ayres; Catherine G. Abbott Subject:\tEnergy Deregulation - California State Auditor Report Attached is my report prepared on behalf of the  California State Auditor. I look forward to seeing you at The Aspen  Institute Energy Policy Forum. Charles J. Cicchetti Pacific Economics Group, LLC - ca report new.pdf ***********"
#The original message is actually very short, it just says FYI, and most of it is a forwarded message. We have the list of recipients, and down at the very bottom is the message itself. "Attached is my report prepared on behalf of the California State auditor." There is also an attached report.

#Our data set contains just the text of the emails and not the text of the attachments. It turns out, as we might expect, that this attachment had to do with Enron's electricity bids in California, and therefore this email is responsive to our query.
#We can check this in the value of the responsive variable.
emails$responsive[2]
## [1] 1
#We see that that it is a 1.

#Let's look at the breakdown of the number of emails that are responsive to our query.
# Responsive emails
table(emails$responsive)
## 
##   0   1 
## 716 139
#We see that the data set is unbalanced, with a relatively small proportion of emails responsive to the query. This is typical in predictive coding problems.

VIDEO 3: PRE-PROCESSING (R script reproduced here)

# Video 3


# Load tm package
library(tm)


#CREATING A CORPUS

#We will need to convert our tweets to a corpus for pre-processing. Various function in the tm package can be used to create a corpus in many different ways.
#We will create it from the tweet column of our data frame using two functions, Corpus() and VectorSource(). We feed to this latter the Tweets variable of the tweets data frame

corpus = Corpus(VectorSource(emails$email))

#Let's take a look at corpus:
corpus
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 855
#Lets see the first  email in our corpus
corpus[[1]]$content
## [1] "North America's integrated electricity market requires cooperation on environmental policies Commission for Environmental Cooperation releases working paper on North America's electricity market Montreal, 27 November 2001 -- The North American Commission for Environmental Cooperation (CEC) is releasing a working paper highlighting the trend towards increasing trade, competition and cross-border investment in electricity between Canada, Mexico and the United States. It is hoped that the working paper, Environmental Challenges and Opportunities in the Evolving North American Electricity Market, will stimulate public discussion around a CEC symposium of the same title about the need to coordinate environmental policies trinationally as a North America-wide electricity market develops. The CEC symposium will take place in San Diego on 29-30 November, and will bring together leading experts from industry, academia, NGOs and the governments of Canada, Mexico and the United States to consider the impact of the evolving continental electricity market on human health and the environment. \"Our goal [with the working paper and the symposium] is to highlight key environmental issues that must be addressed as the electricity markets in North America become more and more integrated,\" said Janine Ferretti, executive director of the CEC. \"We want to stimulate discussion around the important policy questions being raised so that countries can cooperate in their approach to energy and the environment.\" The CEC, an international organization created under an environmental side agreement to NAFTA known as the North American Agreement on Environmental Cooperation, was established to address regional environmental concerns, help prevent potential trade and environmental conflicts, and promote the effective enforcement of environmental law. The CEC Secretariat believes that greater North American cooperation on environmental policies regarding the continental electricity market is necessary to: *  protect air quality and mitigate climate change, *  minimize the possibility of environment-based trade disputes, *  ensure a dependable supply of reasonably priced electricity across North America *  avoid creation of pollution havens, and *  ensure local and national environmental measures remain effective. The Changing Market The working paper profiles the rapid changing North American electricity market. For example, in 2001, the US is projected to export 13.1 thousand gigawatt-hours (GWh) of electricity to Canada and Mexico. By 2007, this number is projected to grow to 16.9 thousand GWh of electricity. \"Over the past few decades, the North American electricity market has developed into a complex array of cross-border transactions and relationships,\" said Phil Sharp, former US congressman and chairman of the CEC's Electricity Advisory Board. \"We need to achieve this new level of cooperation in our environmental approaches as well.\" The Environmental Profile of the Electricity Sector The electricity sector is the single largest source of nationally reported toxins in the United States and Canada and a large source in Mexico. In the US, the electricity sector emits approximately 25 percent of all NOx emissions, roughly 35 percent of all CO2 emissions, 25 percent of all mercury emissions and almost 70 percent of SO2 emissions. These emissions have a large impact on airsheds, watersheds and migratory species corridors that are often shared between the three North American countries. \"We want to discuss the possible outcomes from greater efforts to coordinate federal, state or provincial environmental laws and policies that relate to the electricity sector,\" said Ferretti. \"How can we develop more compatible environmental approaches to help make domestic environmental policies more effective?\" The Effects of an Integrated Electricity Market One key issue raised in the paper is the effect of market integration on the competitiveness of particular fuels such as coal, natural gas or renewables. Fuel choice largely determines environmental impacts from a specific facility, along with pollution control technologies, performance standards and regulations. The paper highlights other impacts of a highly competitive market as well. For example, concerns about so called \"pollution havens\" arise when significant differences in environmental laws or enforcement practices induce power companies to locate their operations in jurisdictions with lower standards. \"The CEC Secretariat is exploring what additional environmental policies will work in this restructured market and how these policies can be adapted to ensure that they enhance competitiveness and benefit the entire region,\" said Sharp. Because trade rules and policy measures directly influence the variables that drive a successfully integrated North American electricity market, the working paper also addresses fuel choice, technology, pollution control strategies and subsidies. The CEC will use the information gathered during the discussion period to develop a final report that will be submitted to the Council in early 2002. For more information or to view the live video webcast of the symposium, please go to: http://www.cec.org/electricity. You may download the working paper and other supporting documents from: http://www.cec.org/programs_projects/other_initiatives/electricity/docs.cfm?varlan=english. Commission for Environmental Cooperation 393, rue St-Jacques Ouest, Bureau 200 Montr<U+00C3><U+00A9>al (Qu<U+00C3><U+00A9>bec) Canada H2Y 1N9 Tel: (514) 350-4300; Fax: (514) 350-4314 E-mail: info@ccemtl.org ***********"
#or
strwrap(corpus[[1]])
##  [1] "North America's integrated electricity market requires cooperation"                         
##  [2] "on environmental policies Commission for Environmental Cooperation"                         
##  [3] "releases working paper on North America's electricity market"                               
##  [4] "Montreal, 27 November 2001 -- The North American Commission for"                            
##  [5] "Environmental Cooperation (CEC) is releasing a working paper"                               
##  [6] "highlighting the trend towards increasing trade, competition and"                           
##  [7] "cross-border investment in electricity between Canada, Mexico and"                          
##  [8] "the United States. It is hoped that the working paper,"                                     
##  [9] "Environmental Challenges and Opportunities in the Evolving North"                           
## [10] "American Electricity Market, will stimulate public discussion"                              
## [11] "around a CEC symposium of the same title about the need to"                                 
## [12] "coordinate environmental policies trinationally as a North"                                 
## [13] "America-wide electricity market develops. The CEC symposium will"                           
## [14] "take place in San Diego on 29-30 November, and will bring together"                         
## [15] "leading experts from industry, academia, NGOs and the governments"                          
## [16] "of Canada, Mexico and the United States to consider the impact of"                          
## [17] "the evolving continental electricity market on human health and"                            
## [18] "the environment. \"Our goal [with the working paper and the"                                
## [19] "symposium] is to highlight key environmental issues that must be"                           
## [20] "addressed as the electricity markets in North America become more"                          
## [21] "and more integrated,\" said Janine Ferretti, executive director of"                         
## [22] "the CEC. \"We want to stimulate discussion around the important"                            
## [23] "policy questions being raised so that countries can cooperate in"                           
## [24] "their approach to energy and the environment.\" The CEC, an"                                
## [25] "international organization created under an environmental side"                             
## [26] "agreement to NAFTA known as the North American Agreement on"                                
## [27] "Environmental Cooperation, was established to address regional"                             
## [28] "environmental concerns, help prevent potential trade and"                                   
## [29] "environmental conflicts, and promote the effective enforcement of"                          
## [30] "environmental law. The CEC Secretariat believes that greater North"                         
## [31] "American cooperation on environmental policies regarding the"                               
## [32] "continental electricity market is necessary to: * protect air"                              
## [33] "quality and mitigate climate change, * minimize the possibility of"                         
## [34] "environment-based trade disputes, * ensure a dependable supply of"                          
## [35] "reasonably priced electricity across North America * avoid"                                 
## [36] "creation of pollution havens, and * ensure local and national"                              
## [37] "environmental measures remain effective. The Changing Market The"                           
## [38] "working paper profiles the rapid changing North American"                                   
## [39] "electricity market. For example, in 2001, the US is projected to"                           
## [40] "export 13.1 thousand gigawatt-hours (GWh) of electricity to Canada"                         
## [41] "and Mexico. By 2007, this number is projected to grow to 16.9"                              
## [42] "thousand GWh of electricity. \"Over the past few decades, the North"                        
## [43] "American electricity market has developed into a complex array of"                          
## [44] "cross-border transactions and relationships,\" said Phil Sharp,"                            
## [45] "former US congressman and chairman of the CEC's Electricity"                                
## [46] "Advisory Board. \"We need to achieve this new level of cooperation"                         
## [47] "in our environmental approaches as well.\" The Environmental"                               
## [48] "Profile of the Electricity Sector The electricity sector is the"                            
## [49] "single largest source of nationally reported toxins in the United"                          
## [50] "States and Canada and a large source in Mexico. In the US, the"                             
## [51] "electricity sector emits approximately 25 percent of all NOx"                               
## [52] "emissions, roughly 35 percent of all CO2 emissions, 25 percent of"                          
## [53] "all mercury emissions and almost 70 percent of SO2 emissions."                              
## [54] "These emissions have a large impact on airsheds, watersheds and"                            
## [55] "migratory species corridors that are often shared between the"                              
## [56] "three North American countries. \"We want to discuss the possible"                          
## [57] "outcomes from greater efforts to coordinate federal, state or"                              
## [58] "provincial environmental laws and policies that relate to the"                              
## [59] "electricity sector,\" said Ferretti. \"How can we develop more"                             
## [60] "compatible environmental approaches to help make domestic"                                  
## [61] "environmental policies more effective?\" The Effects of an"                                 
## [62] "Integrated Electricity Market One key issue raised in the paper is"                         
## [63] "the effect of market integration on the competitiveness of"                                 
## [64] "particular fuels such as coal, natural gas or renewables. Fuel"                             
## [65] "choice largely determines environmental impacts from a specific"                            
## [66] "facility, along with pollution control technologies, performance"                           
## [67] "standards and regulations. The paper highlights other impacts of a"                         
## [68] "highly competitive market as well. For example, concerns about so"                          
## [69] "called \"pollution havens\" arise when significant differences in"                          
## [70] "environmental laws or enforcement practices induce power companies"                         
## [71] "to locate their operations in jurisdictions with lower standards."                          
## [72] "\"The CEC Secretariat is exploring what additional environmental"                           
## [73] "policies will work in this restructured market and how these"                               
## [74] "policies can be adapted to ensure that they enhance"                                        
## [75] "competitiveness and benefit the entire region,\" said Sharp."                               
## [76] "Because trade rules and policy measures directly influence the"                             
## [77] "variables that drive a successfully integrated North American"                              
## [78] "electricity market, the working paper also addresses fuel choice,"                          
## [79] "technology, pollution control strategies and subsidies. The CEC"                            
## [80] "will use the information gathered during the discussion period to"                          
## [81] "develop a final report that will be submitted to the Council in"                            
## [82] "early 2002. For more information or to view the live video webcast"                         
## [83] "of the symposium, please go to: http://www.cec.org/electricity."                            
## [84] "You may download the working paper and other supporting documents"                          
## [85] "from:"                                                                                      
## [86] "http://www.cec.org/programs_projects/other_initiatives/electricity/docs.cfm?varlan=english."
## [87] "Commission for Environmental Cooperation 393, rue St-Jacques"                               
## [88] "Ouest, Bureau 200 Montr<U+00C3><U+00A9>al (Qu<U+00C3><U+00A9>bec) Canada H2Y 1N9 Tel: (514)"                            
## [89] "350-4300; Fax: (514) 350-4314 E-mail: info@ccemtl.org ***********"
# Pre-process data

#IMPORTANT NOTE: If you are using the latest version of the tm package, you will need to run the following line before continuing (it converts corpus to a Plain Text Document). This is a recent change having to do with the tolower function that occurred after this video was recorded.
corpus = tm_map(corpus, PlainTextDocument)


#Converting text to lower case

#We use the tm_map() function which takes as its first argument the name of a corpus and as second argument a function performing the transformation that we want to apply to the text.

#To transform all text to lower case:
corpus = tm_map(corpus, content_transformer(tolower))

#Checking the same  "documents" as before:
corpus[[1]]$content
## [1] "north america's integrated electricity market requires cooperation on environmental policies commission for environmental cooperation releases working paper on north america's electricity market montreal, 27 november 2001 -- the north american commission for environmental cooperation (cec) is releasing a working paper highlighting the trend towards increasing trade, competition and cross-border investment in electricity between canada, mexico and the united states. it is hoped that the working paper, environmental challenges and opportunities in the evolving north american electricity market, will stimulate public discussion around a cec symposium of the same title about the need to coordinate environmental policies trinationally as a north america-wide electricity market develops. the cec symposium will take place in san diego on 29-30 november, and will bring together leading experts from industry, academia, ngos and the governments of canada, mexico and the united states to consider the impact of the evolving continental electricity market on human health and the environment. \"our goal [with the working paper and the symposium] is to highlight key environmental issues that must be addressed as the electricity markets in north america become more and more integrated,\" said janine ferretti, executive director of the cec. \"we want to stimulate discussion around the important policy questions being raised so that countries can cooperate in their approach to energy and the environment.\" the cec, an international organization created under an environmental side agreement to nafta known as the north american agreement on environmental cooperation, was established to address regional environmental concerns, help prevent potential trade and environmental conflicts, and promote the effective enforcement of environmental law. the cec secretariat believes that greater north american cooperation on environmental policies regarding the continental electricity market is necessary to: *  protect air quality and mitigate climate change, *  minimize the possibility of environment-based trade disputes, *  ensure a dependable supply of reasonably priced electricity across north america *  avoid creation of pollution havens, and *  ensure local and national environmental measures remain effective. the changing market the working paper profiles the rapid changing north american electricity market. for example, in 2001, the us is projected to export 13.1 thousand gigawatt-hours (gwh) of electricity to canada and mexico. by 2007, this number is projected to grow to 16.9 thousand gwh of electricity. \"over the past few decades, the north american electricity market has developed into a complex array of cross-border transactions and relationships,\" said phil sharp, former us congressman and chairman of the cec's electricity advisory board. \"we need to achieve this new level of cooperation in our environmental approaches as well.\" the environmental profile of the electricity sector the electricity sector is the single largest source of nationally reported toxins in the united states and canada and a large source in mexico. in the us, the electricity sector emits approximately 25 percent of all nox emissions, roughly 35 percent of all co2 emissions, 25 percent of all mercury emissions and almost 70 percent of so2 emissions. these emissions have a large impact on airsheds, watersheds and migratory species corridors that are often shared between the three north american countries. \"we want to discuss the possible outcomes from greater efforts to coordinate federal, state or provincial environmental laws and policies that relate to the electricity sector,\" said ferretti. \"how can we develop more compatible environmental approaches to help make domestic environmental policies more effective?\" the effects of an integrated electricity market one key issue raised in the paper is the effect of market integration on the competitiveness of particular fuels such as coal, natural gas or renewables. fuel choice largely determines environmental impacts from a specific facility, along with pollution control technologies, performance standards and regulations. the paper highlights other impacts of a highly competitive market as well. for example, concerns about so called \"pollution havens\" arise when significant differences in environmental laws or enforcement practices induce power companies to locate their operations in jurisdictions with lower standards. \"the cec secretariat is exploring what additional environmental policies will work in this restructured market and how these policies can be adapted to ensure that they enhance competitiveness and benefit the entire region,\" said sharp. because trade rules and policy measures directly influence the variables that drive a successfully integrated north american electricity market, the working paper also addresses fuel choice, technology, pollution control strategies and subsidies. the cec will use the information gathered during the discussion period to develop a final report that will be submitted to the council in early 2002. for more information or to view the live video webcast of the symposium, please go to: http://www.cec.org/electricity. you may download the working paper and other supporting documents from: http://www.cec.org/programs_projects/other_initiatives/electricity/docs.cfm?varlan=english. commission for environmental cooperation 393, rue st-jacques ouest, bureau 200 montr<U+00E3><U+00A9>al (qu<U+00E3><U+00A9>bec) canada h2y 1n9 tel: (514) 350-4300; fax: (514) 350-4314 e-mail: info@ccemtl.org ***********"
#Removing punctuation
corpus = tm_map(corpus, removePunctuation)
corpus[[1]]$content
## [1] "north americas integrated electricity market requires cooperation on environmental policies commission for environmental cooperation releases working paper on north americas electricity market montreal 27 november 2001  the north american commission for environmental cooperation cec is releasing a working paper highlighting the trend towards increasing trade competition and crossborder investment in electricity between canada mexico and the united states it is hoped that the working paper environmental challenges and opportunities in the evolving north american electricity market will stimulate public discussion around a cec symposium of the same title about the need to coordinate environmental policies trinationally as a north americawide electricity market develops the cec symposium will take place in san diego on 2930 november and will bring together leading experts from industry academia ngos and the governments of canada mexico and the united states to consider the impact of the evolving continental electricity market on human health and the environment our goal with the working paper and the symposium is to highlight key environmental issues that must be addressed as the electricity markets in north america become more and more integrated said janine ferretti executive director of the cec we want to stimulate discussion around the important policy questions being raised so that countries can cooperate in their approach to energy and the environment the cec an international organization created under an environmental side agreement to nafta known as the north american agreement on environmental cooperation was established to address regional environmental concerns help prevent potential trade and environmental conflicts and promote the effective enforcement of environmental law the cec secretariat believes that greater north american cooperation on environmental policies regarding the continental electricity market is necessary to   protect air quality and mitigate climate change   minimize the possibility of environmentbased trade disputes   ensure a dependable supply of reasonably priced electricity across north america   avoid creation of pollution havens and   ensure local and national environmental measures remain effective the changing market the working paper profiles the rapid changing north american electricity market for example in 2001 the us is projected to export 131 thousand gigawatthours gwh of electricity to canada and mexico by 2007 this number is projected to grow to 169 thousand gwh of electricity over the past few decades the north american electricity market has developed into a complex array of crossborder transactions and relationships said phil sharp former us congressman and chairman of the cecs electricity advisory board we need to achieve this new level of cooperation in our environmental approaches as well the environmental profile of the electricity sector the electricity sector is the single largest source of nationally reported toxins in the united states and canada and a large source in mexico in the us the electricity sector emits approximately 25 percent of all nox emissions roughly 35 percent of all co2 emissions 25 percent of all mercury emissions and almost 70 percent of so2 emissions these emissions have a large impact on airsheds watersheds and migratory species corridors that are often shared between the three north american countries we want to discuss the possible outcomes from greater efforts to coordinate federal state or provincial environmental laws and policies that relate to the electricity sector said ferretti how can we develop more compatible environmental approaches to help make domestic environmental policies more effective the effects of an integrated electricity market one key issue raised in the paper is the effect of market integration on the competitiveness of particular fuels such as coal natural gas or renewables fuel choice largely determines environmental impacts from a specific facility along with pollution control technologies performance standards and regulations the paper highlights other impacts of a highly competitive market as well for example concerns about so called pollution havens arise when significant differences in environmental laws or enforcement practices induce power companies to locate their operations in jurisdictions with lower standards the cec secretariat is exploring what additional environmental policies will work in this restructured market and how these policies can be adapted to ensure that they enhance competitiveness and benefit the entire region said sharp because trade rules and policy measures directly influence the variables that drive a successfully integrated north american electricity market the working paper also addresses fuel choice technology pollution control strategies and subsidies the cec will use the information gathered during the discussion period to develop a final report that will be submitted to the council in early 2002 for more information or to view the live video webcast of the symposium please go to httpwwwcecorgelectricity you may download the working paper and other supporting documents from httpwwwcecorgprogramsprojectsotherinitiativeselectricitydocscfmvarlanenglish commission for environmental cooperation 393 rue stjacques ouest bureau 200 montr<U+00E3>al qu<U+00E3>bec canada h2y 1n9 tel 514 3504300 fax 514 3504314 email infoccemtlorg "
#Removing stop words

#Removing words can be done with the removeWords argument to the tm_map() function, with an extra argument, i.e. what the stop words are that we want to remove, for which we simply use the list for english that is provided by the tm package.
#We will remove all of these English stop words as it probably won't be very useful in our prediction problem.
corpus = tm_map(corpus, removeWords, stopwords("english"))

#Stemming

#Lastly, we want to stem our document with the stemDocument argument.
corpus = tm_map(corpus, stemDocument)

# Now that we have gone through those four preprocessing steps, we can take a second look at the first email in the corpus.
corpus[[1]]$content
## [1] "north america integr electr market requir cooper  environment polici commiss  environment cooper releas work paper  north america electr market montreal 27 novemb 2001   north american commiss  environment cooper cec  releas  work paper highlight  trend toward increas trade competit  crossbord invest  electr  canada mexico   unit state   hope   work paper environment challeng  opportun   evolv north american electr market will stimul public discuss around  cec symposium    titl   need  coordin environment polici trinat   north americawid electr market develop  cec symposium will take place  san diego  2930 novemb  will bring togeth lead expert  industri academia ngos   govern  canada mexico   unit state  consid  impact   evolv continent electr market  human health   environ  goal   work paper   symposium   highlight key environment issu  must  address   electr market  north america becom    integr said janin ferretti execut director   cec  want  stimul discuss around  import polici question  rais   countri can cooper   approach  energi   environ  cec  intern organ creat   environment side agreement  nafta known   north american agreement  environment cooper  establish  address region environment concern help prevent potenti trade  environment conflict  promot  effect enforc  environment law  cec secretariat believ  greater north american cooper  environment polici regard  continent electr market  necessari    protect air qualiti  mitig climat chang   minim  possibl  environmentbas trade disput   ensur  depend suppli  reason price electr across north america   avoid creation  pollut haven    ensur local  nation environment measur remain effect  chang market  work paper profil  rapid chang north american electr market  exampl  2001  us  project  export 131 thousand gigawatthour gwh  electr  canada  mexico  2007  number  project  grow  169 thousand gwh  electr   past  decad  north american electr market  develop   complex array  crossbord transact  relationship said phil sharp former us congressman  chairman   cec electr advisori board  need  achiev  new level  cooper   environment approach  well  environment profil   electr sector  electr sector   singl largest sourc  nation report toxin   unit state  canada   larg sourc  mexico   us  electr sector emit approxim 25 percent   nox emiss rough 35 percent   co2 emiss 25 percent   mercuri emiss  almost 70 percent  so2 emiss  emiss   larg impact  airsh watersh  migratori speci corridor   often share   three north american countri  want  discuss  possibl outcom  greater effort  coordin feder state  provinci environment law  polici  relat   electr sector said ferretti  can  develop  compat environment approach  help make domest environment polici  effect  effect   integr electr market one key issu rais   paper   effect  market integr   competit  particular fuel   coal natur gas  renew fuel choic larg determin environment impact   specif facil along  pollut control technolog perform standard  regul  paper highlight  impact   high competit market  well  exampl concern   call pollut haven aris  signific differ  environment law  enforc practic induc power compani  locat  oper  jurisdict  lower standard  cec secretariat  explor  addit environment polici will work   restructur market    polici can  adapt  ensur   enhanc competit  benefit  entir region said sharp  trade rule  polici measur direct influenc  variabl  drive  success integr north american electr market  work paper also address fuel choic technolog pollut control strategi  subsidi  cec will use  inform gather   discuss period  develop  final report  will  submit   council  earli 2002   inform   view  live video webcast   symposium pleas go  httpwwwcecorgelectr  may download  work paper   support document  httpwwwcecorgprogramsprojectsotherinitiativeselectricitydocscfmvarlanenglish commiss  environment cooper 393 rue stjacqu ouest bureau 200 montrcal qucbec canada h2i 1n9 tel 514 3504300 fax 514 3504314 email infoccemtlorg"
#or
strwrap(corpus[[1]])
##  [1] "north america integr electr market requir cooper environment"                
##  [2] "polici commiss environment cooper releas work paper north america"           
##  [3] "electr market montreal 27 novemb 2001 north american commiss"                
##  [4] "environment cooper cec releas work paper highlight trend toward"             
##  [5] "increas trade competit crossbord invest electr canada mexico unit"           
##  [6] "state hope work paper environment challeng opportun evolv north"             
##  [7] "american electr market will stimul public discuss around cec"                
##  [8] "symposium titl need coordin environment polici trinat north"                 
##  [9] "americawid electr market develop cec symposium will take place san"          
## [10] "diego 2930 novemb will bring togeth lead expert industri academia"           
## [11] "ngos govern canada mexico unit state consid impact evolv continent"          
## [12] "electr market human health environ goal work paper symposium"                
## [13] "highlight key environment issu must address electr market north"             
## [14] "america becom integr said janin ferretti execut director cec want"           
## [15] "stimul discuss around import polici question rais countri can"               
## [16] "cooper approach energi environ cec intern organ creat environment"           
## [17] "side agreement nafta known north american agreement environment"             
## [18] "cooper establish address region environment concern help prevent"            
## [19] "potenti trade environment conflict promot effect enforc"                     
## [20] "environment law cec secretariat believ greater north american"               
## [21] "cooper environment polici regard continent electr market necessari"          
## [22] "protect air qualiti mitig climat chang minim possibl"                        
## [23] "environmentbas trade disput ensur depend suppli reason price"                
## [24] "electr across north america avoid creation pollut haven ensur"               
## [25] "local nation environment measur remain effect chang market work"             
## [26] "paper profil rapid chang north american electr market exampl 2001"           
## [27] "us project export 131 thousand gigawatthour gwh electr canada"               
## [28] "mexico 2007 number project grow 169 thousand gwh electr past decad"          
## [29] "north american electr market develop complex array crossbord"                
## [30] "transact relationship said phil sharp former us congressman"                 
## [31] "chairman cec electr advisori board need achiev new level cooper"             
## [32] "environment approach well environment profil electr sector electr"           
## [33] "sector singl largest sourc nation report toxin unit state canada"            
## [34] "larg sourc mexico us electr sector emit approxim 25 percent nox"             
## [35] "emiss rough 35 percent co2 emiss 25 percent mercuri emiss almost"            
## [36] "70 percent so2 emiss emiss larg impact airsh watersh migratori"              
## [37] "speci corridor often share three north american countri want"                
## [38] "discuss possibl outcom greater effort coordin feder state provinci"          
## [39] "environment law polici relat electr sector said ferretti can"                
## [40] "develop compat environment approach help make domest environment"            
## [41] "polici effect effect integr electr market one key issu rais paper"           
## [42] "effect market integr competit particular fuel coal natur gas renew"          
## [43] "fuel choic larg determin environment impact specif facil along"              
## [44] "pollut control technolog perform standard regul paper highlight"             
## [45] "impact high competit market well exampl concern call pollut haven"           
## [46] "aris signific differ environment law enforc practic induc power"             
## [47] "compani locat oper jurisdict lower standard cec secretariat explor"          
## [48] "addit environment polici will work restructur market polici can"             
## [49] "adapt ensur enhanc competit benefit entir region said sharp trade"           
## [50] "rule polici measur direct influenc variabl drive success integr"             
## [51] "north american electr market work paper also address fuel choic"             
## [52] "technolog pollut control strategi subsidi cec will use inform"               
## [53] "gather discuss period develop final report will submit council"              
## [54] "earli 2002 inform view live video webcast symposium pleas go"                
## [55] "httpwwwcecorgelectr may download work paper support document"                
## [56] "httpwwwcecorgprogramsprojectsotherinitiativeselectricitydocscfmvarlanenglish"
## [57] "commiss environment cooper 393 rue stjacqu ouest bureau 200"                 
## [58] "montrcal qucbec canada h2i 1n9 tel 514 3504300 fax 514 3504314"              
## [59] "email infoccemtlorg"
#It looks quite a bit different now. It is a lot harder to read now that we removed all the stop words and punctuation and word stems, but now the emails in this corpus are ready for our machine learning algorithms.

VIDEO 4: BAG OF WORDS (R script reproduced here)

# Video 4

#Create a Document Term Matrix

corpus = tm_map(corpus, PlainTextDocument)
#We are now ready to extract the word frequencies to be used in our prediction problem. The tm package provides a function called DocumentTermMatrix() that generates a matrix where:
#the rows correspond to documents, in our case tweets, and
#the columns correspond to words in those tweets.
#The values in the matrix are the number of times that word appears in each document.
dtm = DocumentTermMatrix(corpus)
dtm
## <<DocumentTermMatrix (documents: 855, terms: 22162)>>
## Non-/sparse entries: 102858/18845652
## Sparsity           : 99%
## Maximal term length: 156
## Weighting          : term frequency (tf)
#what we can see is that even though we have only 855 emails in the corpus, we have over 22164 terms that showed up at least once, which is clearly too many variables for the number of observations we have.

#So we want to remove the terms that don't appear too often in our data set.
# Remove sparse terms,Therefore let's remove some terms that don't appear very often.
dtm = removeSparseTerms(dtm, 0.97)
dtm
## <<DocumentTermMatrix (documents: 855, terms: 788)>>
## Non-/sparse entries: 51612/622128
## Sparsity           : 92%
## Maximal term length: 19
## Weighting          : term frequency (tf)
#We can see that we have decreased the number of terms to 788, which is a much more reasonable number.

# Create data frame from dtm

#Let's convert the sparse matrix into a data frame that we will be able to use for our predictive models.
labeledTerms = as.data.frame(as.matrix(dtm))

#To make all variable names R-friendly use:
colnames(labeledTerms) <- make.names(colnames(labeledTerms))

# We also have to add back-in the outcome variable
labeledTerms$responsive = emails$responsive

#str(labeledTerms)

#The data frame contains an awful lot of variables, 789 in total, of which 788 are the frequencies of various words in the emails, and the last one is responsive, i.e. the outcome variable

VIDEO 5: BUILDING MODELS (R script reproduced here)

# Video 5


#Split data in training/testing sets
#let's split our data into a training set and a testing set, putting 70% of the data in the training set.

library(caTools)
set.seed(144)

spl = sample.split(labeledTerms$responsive, 0.7)

train = subset(labeledTerms, spl == TRUE)
test = subset(labeledTerms, spl == FALSE)

# Build a CART model
#Now we are ready to build the model, and we will build a simple CART model using the default parameters. A random forest would be another good choice from our toolset.

library(rpart)
library(rpart.plot)

#Let's  use CART to build a predictive model, using the rpart() function to predict responsive using all of the other variables as our independent variables and the data set train.
emailCART = rpart(responsive~., data=train, method="class")
prp(emailCART)

#We see at the very top is the word California.
#If californ appears at least twice in an email, we are going to take the right path and predict that a document is responsive.
#It is somewhat unsurprising that California shows up, because we know that Enron had a heavy involvement in the California energy markets.
#Further down the tree, we see a number of other terms that we could plausibly expect to be related to energy bids and energy scheduling, like system, demand, bid, and gas.
#Down at the bottom is jeff, which is perhaps a reference to Enron's CEO, Jeff Skillings, who ended up actually being jailed for his involvement in the fraud at the company.

VIDEO 6: EVALUATING THE MODEL (R script reproduced here)

# Video 6

# Out-of-Sample Performnce of the Model (Make predictions on the test set)
#Now that we have trained a model, we need to evaluate it on the test set.
#We build an object predictCART that has the predicted probabilities for each class from our CART model, by using the predict() function on the model emailCART and the test data with newdata = test.

pred = predict(emailCART, newdata=test)

#This new object gives us the predicted probabilities on the test set. We can look at the first 10 rows with
pred[1:10,]
##                        0          1
## character(0)   0.2156863 0.78431373
## character(0).1 0.9557522 0.04424779
## character(0).2 0.9557522 0.04424779
## character(0).3 0.8125000 0.18750000
## character(0).4 0.4000000 0.60000000
## character(0).5 0.9557522 0.04424779
## character(0).6 0.9557522 0.04424779
## character(0).7 0.9557522 0.04424779
## character(0).8 0.1250000 0.87500000
## character(0).9 0.1250000 0.87500000
#The first column is the predicted probability of the document being non-responsive.
#The second column is the predicted probability of the document being responsive.
#They sum to 1.

#In our case we are interested in the predicted probability of the document being responsive and it would be convenient to handle that as a separated variable.
pred.prob = pred[,2]
#This new object contains our test set predicted probabilities.

#We are interested in the accuracy of our model on the test set, i.e. out-of-sample. 
#First we compute the confusion matrix:

cmat_CART<-table(test$responsive, pred.prob >= 0.5)
cmat_CART
##    
##     FALSE TRUE
##   0   195   20
##   1    17   25
#Compute accuracy
accu_CART <- (cmat_CART[1,1] + cmat_CART[2,2])/sum(cmat_CART)
accu_CART #(195+25)/(195+25+17+20)=0.8560311
## [1] 0.8560311
#Overall accuracy of this CART model is 0.856
#Sensitivity = TP rate = 25 / 42 = 0.595
#Specificity = (1 - FP rate) = 195 / 215 = 0.907
#FP rate = 20 / 215 = 0.093
#FN rate = 17 / 42 = 0.405

#Comparison with the baseline model

#Let's compare this to a simple baseline model that always predicts non-responsive (i.e. the most common value of the dependent variable).
#To compute the accuracy of the baseline model, let's make a table of just the outcome variable responsive.

cmat_baseline <-table(test$responsive)
cmat_baseline 
## 
##   0   1 
## 215  42
#Baseline accuracy
accu_baseline <- max(cmat_baseline)/sum(cmat_baseline)
accu_baseline #215/(215+42)=0.8365759
## [1] 0.8365759
#The accuracy of the baseline model is then 0.8366.

#We see just a small improvement in accuracy using the CART model, which is a common case in unbalanced data sets.
#However, as in most document retrieval applications, there are uneven costs for different types of errors here.
#Typically, a human will still have to manually review all of the predicted responsive documents to make sure they are actually responsive.
#Therefore:
#If we have a false positive, i.e. a non-responsive document labeled as responsive, the mistake translates to a bit of additional work in the manual review process but no further harm, since the manual review process will remove this erroneous result.
#On the other hand, if we have a false negative, i.e. a responsive document labeled as non-responsive by our model, we will miss the document entirely in our predictive coding process.

#Therefore, we are going to assign a higher cost to false negatives than to false positives, which makes this a good time to look at other cut-offs on our ROC curve.

VIDEO 7: THE ROC CURVE

# Video 7

# ROC curve

library(ROCR)

##Lets evaluate our model using the ROC curve
#Let's look at the ROC curve so we can understand the performance of our model at different cutoffs.

#To plot the ROC curve we use the performance() function to extract the true positive rate and false positive rate.

##First we use the prediction() function with first argument the second column of pred, and second argument the true outcome values, test$responsive.
predROCR = prediction(pred.prob, test$responsive)

##We pass the output of prediction() to performance() to which we give also two arguments for what we want on the X and Y axes of our ROC curve, true positive rate and false positive rate.
perfROCR = performance(predROCR, "tpr", "fpr")

#We then plot the ROC curve, with the option that color-codes the different cutoff thresholds.
plot(perfROCR, colorize=TRUE)

#The best cutoff to select is entirely dependent on the costs assigned by the decision maker to false positives and true positives.
#However, we do favor cutoffs that give us a high sensitivity, i.e. we want to identify a large number of the responsive documents.

#Therefore a choice that might look promising could be in the part of the curve where it becomes flatter (going towards the right), where we have a true positive rate of around 70% (meaning that we're getting about 70% of all the responsive documents), and a false positive rate of about 20% (meaning that we are making mistakes and accidentally identifying as responsive 20% of the non-responsive documents.)

#Since, typically, the vast majority of documents are non-responsive, operating at this cutoff would result in a large decrease in the amount of manual effort needed in the eDiscovery process.

#From the blue color of the plot at this particular location we can infer that we are looking at a threshold around maybe 0.15 or so, significantly lower than 0.5, which is definitely what we would expect since we favor false positives to false negatives.


#Compute Area Under the Curve (AUC)
performance(predROCR, "auc")@y.values
## [[1]]
## [1] 0.7936323
#The AUC of the CART models is 0.7936, which means that our model can differentiate between a randomly selected responsive and non-responsive document about 79.4% of the time.

Assignment 5

DETECTING VANDALISM ON WIKIPEDIA

Wikipedia is a free online encyclopedia that anyone can edit and contribute to. It is available in many languages and is growing all the time. On the English language version of Wikipedia:

  • There are currently 4.7 million pages.
  • There have been a total over 760 million edits (also called revisions) over its lifetime.
  • There are approximately 130,000 edits per day.

One of the consequences of being editable by anyone is that some people vandalize pages. This can take the form of removing content, adding promotional or inappropriate content, or more subtle shifts that change the meaning of the article. With this many articles and edits per day it is difficult for humans to detect all instances of vandalism and revert (undo) them. As a result, Wikipedia uses bots - computer programs that automatically revert edits that look like vandalism. In this assignment we will attempt to develop a vandalism detector that uses machine learning to distinguish between a valid edit and vandalism.

The data for this problem is based on the revision history of the page Language. Wikipedia provides a history for each page that consists of the state of the page at each revision. Rather than manually considering each revision, a script was run that checked whether edits stayed or were reverted. If a change was eventually reverted then that revision is marked as vandalism. This may result in some misclassifications, but the script performs well enough for our needs.

As a result of this preprocessing, some common processing tasks have already been done, including lower-casing and punctuation removal. The columns in the dataset are:

  • Vandal = 1 if this edit was vandalism, 0 if not.
  • Minor = 1 if the user marked this edit as a “minor edit”, 0 if not.
  • Loggedin = 1 if the user made this edit while using a Wikipedia account, 0 if they did not.
  • Added = The unique words added.
  • Removed = The unique words removed.

Notice the repeated use of unique. The data we have available is not the traditional bag of words - rather it is the set of words that were removed or added. For example, if a word was removed multiple times in a revision it will only appear one time in the “Removed” column.

#PROBLEM 1.1 - BAGS OF WORDS  

#Load the data wiki.csv with the option stringsAsFactors=FALSE, calling the data frame "wiki". Convert the "Vandal" column to a factor using the command wiki$Vandal = as.factor(wiki$Vandal).

#LOADING AND PROCESSING DATA IN R
wiki<-read.csv("wiki.csv",stringsAsFactors=FALSE)
str(wiki)
## 'data.frame':    3876 obs. of  7 variables:
##  $ X.1     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ X       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Vandal  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Minor   : int  1 1 0 1 1 0 0 0 1 0 ...
##  $ Loggedin: int  1 1 1 0 1 1 1 1 1 0 ...
##  $ Added   : chr  "  represent psycholinguisticspsycholinguistics orthographyorthography help text all actions through human ethnologue relationsh"| __truncated__ " website external links" " " " afghanistan used iran mostly that farsiis is countries some xmlspacepreservepersian parts tajikestan region" ...
##  $ Removed : chr  " " " talklanguagetalk" " regarded as technologytechnologies human first" "  represent psycholinguisticspsycholinguistics orthographyorthography help all actions through ethnologue relationships linguis"| __truncated__ ...
#We have 3876 observations of 7 variables

#Convert the "Vandal" column to a factor
wiki$Vandal = as.factor(wiki$Vandal)
str(wiki)
## 'data.frame':    3876 obs. of  7 variables:
##  $ X.1     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ X       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Vandal  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Minor   : int  1 1 0 1 1 0 0 0 1 0 ...
##  $ Loggedin: int  1 1 1 0 1 1 1 1 1 0 ...
##  $ Added   : chr  "  represent psycholinguisticspsycholinguistics orthographyorthography help text all actions through human ethnologue relationsh"| __truncated__ " website external links" " " " afghanistan used iran mostly that farsiis is countries some xmlspacepreservepersian parts tajikestan region" ...
##  $ Removed : chr  " " " talklanguagetalk" " regarded as technologytechnologies human first" "  represent psycholinguisticspsycholinguistics orthographyorthography help all actions through ethnologue relationships linguis"| __truncated__ ...
#How many cases of vandalism were detected in the history of this page?
table(wiki$Vandal)
## 
##    0    1 
## 2061 1815
#Ans:1815
#EXPLANATION:There are 1815 observations with value 1, which denotes vandalism.

##############################

#PROBLEM 1.2 - BAGS OF WORDS  

#We will now use the bag of words approach to build a model. We have two columns of textual data, with different meanings. For example, adding rude words has a different meaning to removing rude words. We'll start like we did in class by building a document term matrix from the Added column. The text already is lowercase and stripped of punctuation. So to pre-process the data, just complete the following four steps:

#1) Create the corpus for the Added column, and call it "corpusAdded".
library(tm)
library(SnowballC)

##Various function in the tm package can be used to create a corpus in many different ways.We will create it from the tweet column of our data frame using two functions, Corpus() and VectorSource(). We feed to this latter the Tweets variable of the tweets data frame.

# Create corpus
corpusAdded= Corpus(VectorSource(wiki$Added))

##We can check that the documents match our tweets by using double brackets [[.
#To inspect the first in our corpus, we select the first element as:
corpusAdded[[1]]$content
## [1] "  represent psycholinguisticspsycholinguistics orthographyorthography help text all actions through human ethnologue relationships linguistics regarded writing languages to other listing xmlspacepreservelanguages metaverse formal term philology common each including phonologyphonology often ten list humans affiliation see computer are speechpathologyspeech our what for ways dialects please artificial written body be of quite hypothesis found alone refers by about language profanity study programming priorities rosenfelders technologytechnologies makes or first among useful languagephilosophy one sounds use area create phrases mark their genetic basic families complete but sapirwhorfhypothesissapirwhorf with talklanguagetalk population animals this science up vocal can concepts called at and topics locations as numbers have in pathology different develop 4000 things ideas grouped complex animal mathematics fairly literature httpwwwzompistcom philosophy most important meaningful a historicallinguisticsorphilologyhistorical semanticssemantics patterns the oral"
corpusAdded= tm_map(corpusAdded, PlainTextDocument)

#2) Remove the English-language stopwords.
##Removing stop words

#Removing words can be done with the removeWords argument to the tm_map() function, with an extra argument, i.e. what the stop words are that we want to remove, for which we simply use the list for english that is provided by the tm package.
#We will remove all of these English stop words as it probably won't be very useful in our prediction problem.
corpusAdded= tm_map(corpusAdded, removeWords, stopwords("english"))

#3) Stem the words.
#Lastly, we want to stem our document with the stemDocument argument.
corpusAdded= tm_map(corpusAdded, stemDocument)

# Now that we have gone through those four preprocessing steps, we can take a second look at the first email in the corpus.
corpusAdded[[1]]$content
## [1] "  repres psycholinguisticspsycholinguist orthographyorthographi help text  action  human ethnologu relationship linguist regard write languag   list xmlspacepreservelanguag metavers formal term philolog common  includ phonologyphonolog often ten list human affili see comput  speechpathologyspeech    way dialect pleas artifici written bodi   quit hypothesi found alon refer   languag profan studi program prioriti rosenfeld technologytechnolog make  first among use languagephilosophi one sound use area creat phrase mark  genet basic famili complet  sapirwhorfhypothesissapirwhorf  talklanguagetalk popul anim  scienc  vocal can concept call   topic locat  number   patholog differ develop 4000 thing idea group complex anim mathemat fair literatur httpwwwzompistcom philosophi  import meaning  historicallinguisticsorphilologyhistor semanticssemant pattern  oral"
#BAG OF WORDS

#4) Build the DocumentTermMatrix, and call it dtmAdded.

#Create a Document Term Matrix
corpusAdded= tm_map(corpusAdded, PlainTextDocument)
dtmAdded= DocumentTermMatrix(corpusAdded)
dtmAdded
## <<DocumentTermMatrix (documents: 3876, terms: 6675)>>
## Non-/sparse entries: 15368/25856932
## Sparsity           : 100%
## Maximal term length: 784
## Weighting          : term frequency (tf)
#How many terms appear in dtmAdded?
#Ans:6675 

#####################################

#PROBLEM 1.3 - BAGS OF WORDS  


#Filter out sparse terms by keeping only terms that appear in 0.3% or more of the revisions, and call the new matrix sparseAdded. How many terms appear in sparseAdded?

#what we can see is that even though we have only 3876 documents in the corpus, we have over 6675 terms that showed up at least once, which is clearly too many variables for the number of observations we have.
#So we want to remove the terms that don't appear too often in our data set.
# Remove sparse terms,i.e. let's remove some terms that don't appear very often.
sparseAdded= removeSparseTerms(dtmAdded, 0.997)
sparseAdded
## <<DocumentTermMatrix (documents: 3876, terms: 166)>>
## Non-/sparse entries: 2681/640735
## Sparsity           : 100%
## Maximal term length: 28
## Weighting          : term frequency (tf)
#Ans:166

######################################

#PROBLEM 1.4 - BAGS OF WORDS  

#Convert sparseAdded to a data frame called wordsAdded, and then prepend all the words with the letter A, by using the command:

# Create data frame from sparseAdded
#Let's convert the sparse matrix into a data frame that we will be able to use for our predictive models.
wordsAdded= as.data.frame(as.matrix(sparseAdded))

#prepend all the words with the letter A
colnames(wordsAdded) = paste("A", colnames(wordsAdded))
#str(wordsAdded)
dim(wordsAdded)
## [1] 3876  166
########################

#Now repeat all of the steps we've done so far (create a corpus, remove stop words, stem the document, create a sparse document term matrix, and convert it to a data frame) to create a Removed bag-of-words dataframe, called wordsRemoved, except this time, prepend all of the words with the letter R:

# Create corpus
corpusRemoved= Corpus(VectorSource(wiki$Removed))

##We can check that the documents match our tweets by using double brackets [[.
#To inspect the first in our corpus, we select the first element as:
corpusRemoved[[1]]$content
## [1] " "
corpusRemoved= tm_map(corpusRemoved, PlainTextDocument)

#2) Remove the English-language stopwords.
##Removing stop words

#Removing words can be done with the removeWords argument to the tm_map() function, with an extra argument, i.e. what the stop words are that we want to remove, for which we simply use the list for english that is provided by the tm package.
#We will remove all of these English stop words as it probably won't be very useful in our prediction problem.
corpusRemoved= tm_map(corpusRemoved, removeWords, stopwords("english"))

#3) Stem the words.
#Lastly, we want to stem our document with the stemDocument argument.
corpusRemoved= tm_map(corpusRemoved, stemDocument)

# Now that we have gone through the preprocessing steps, we can take a second look at the second doc in the corpus.
corpusRemoved[[2]]$content
## [1] " talklanguagetalk"
#BAG OF WORDS

#4) Build the DocumentTermMatrix, and call it dtmAdded.
#Create a Document Term Matrix
corpusRemoved= tm_map(corpusRemoved, PlainTextDocument)
dtmRemoved= DocumentTermMatrix(corpusRemoved)
dtmRemoved
## <<DocumentTermMatrix (documents: 3876, terms: 5403)>>
## Non-/sparse entries: 13293/20928735
## Sparsity           : 100%
## Maximal term length: 784
## Weighting          : term frequency (tf)
#BAGS OF WORDS 

#So we want to remove the terms that don't appear too often in our data set.
# Remove sparse terms,i.e. let's remove some terms that don't appear very often.
sparseRemoved= removeSparseTerms(dtmRemoved, 0.997)
sparseRemoved
## <<DocumentTermMatrix (documents: 3876, terms: 162)>>
## Non-/sparse entries: 2552/625360
## Sparsity           : 100%
## Maximal term length: 28
## Weighting          : term frequency (tf)
#Create data frame from sparseRemoved
#Let's convert the sparse matrix into a data frame that we will be able to use for our predictive models.
wordsRemoved= as.data.frame(as.matrix(sparseRemoved))

#prepend all the words with the letter R
colnames(wordsRemoved) = paste("R", colnames(wordsRemoved))

#How many words are in the wordsRemoved data frame?
#str(wordsRemoved)
#or
ncol(wordsRemoved)
## [1] 162
#Ans:162

#############################################

#PROBLEM 1.5 - BAGS OF WORDS  

#Combine the two data frames into a data frame called wikiWords with the following line of code:

wikiWords = cbind(wordsAdded, wordsRemoved)

#The cbind function combines two sets of variables for the same observations into one data frame. 

# We also have to add back-in the vandal variable
wikiWords$vandal= wiki$Vandal


#Split data in training/testing sets
#let's split our data into a training set and a testing set, putting 70% of the data in the training set.

library(caTools)
set.seed(123)

#split the data set using sample.split from the "caTools" package to put 70% in the training set.
spl = sample.split(wikiWords$vandal, 0.7)

wikiTrain = subset(wikiWords, spl == TRUE)
wikiTest=subset(wikiWords, spl == FALSE)
#str(wikiTest)

#What is the accuracy on the test set of a baseline method that always predicts "not vandalism" (the most frequent outcome)?
baseline<-table(wikiTest$vandal)
baseline
## 
##   0   1 
## 618 545
#Baseline accuracy
accu_baseline <- max(baseline)/sum(baseline)
accu_baseline #618/(618+545) = 0.531
## [1] 0.5313844
#Ans:0.5313844

######################################

#PROBLEM 1.6 - BAGS OF WORDS  

#Build a CART model to predict Vandal, using all of the other variables as independent variables. Use the training set to build the model and the default parameters (don't set values for minbucket or cp).

#Now we are ready to build the model, and we will build a simple CART model using the default parameters.

library(rpart)
library(rpart.plot)

#Let's  use CART to build a predictive model, using the rpart() function to predict responsive using all of the other variables as our independent variables and the data set wikiTrain.
wikiCART = rpart(vandal~., data=wikiTrain, method="class")

#What is the accuracy of the model on the test set, using a threshold of 0.5? (Remember that if you add the argument type="class" when making predictions, the output of predict will automatically use a threshold of 0.5.)

# Out-of-Sample Performnce of the Model (Make predictions on the test set wikiTest)
testPredictCART= predict(wikiCART, newdata=wikiTest,type="class") #if you add the argument type="class" when making predictions, the output of predict will automatically use a threshold of 0.5.

#We are interested in the accuracy of our model on the test set, i.e. out-of-sample. 
#First we compute the confusion matrix

cmat_CART<-table(wikiTest$vandal,testPredictCART)
cmat_CART
##    testPredictCART
##       0   1
##   0 618   0
##   1 533  12
#lets now compute the overall accuracy
accu_CART <- (cmat_CART[1,1] + cmat_CART[2,2])/sum(cmat_CART)
accu_CART  #(618+12)/(618+533+12) = 0.5417
## [1] 0.5417025
#Ans:0.5417025

##################################################

#PROBLEM 1.7 - BAGS OF WORDS  

#Plot the CART tree. How many word stems does the CART model use?
prp(wikiCART)

#Ans:2
#EXPLANATION:If you plot the tree with prp(wikiCART), you can see that the tree uses two words: "R arbitr" and "R thousa".

############################################

#PROBLEM 1.8 - BAGS OF WORDS  

#Given the performance of the CART model relative to the baseline, what is the best explanation of these results?
#Ans:Although it beats the baseline, bag of words is not very predictive for this problem.
#EXPLANATION:There is no reason to think there was anything wrong with the split. CART did not overfit, which you can check by computing the accuracy of the model on the training set. Over-sparsification is plausible but unlikely, since we selected a very high sparsity parameter. The only conclusion left is simply that bag of words didn't work very well in this case.

##########################################

#PROBLEM 2.1 - PROBLEM-SPECIFIC KNOWLEDGE  

#We weren't able to improve on the baseline using the raw textual information. More specifically, the words themselves were not useful. There are other options though, and in this section we will try two techniques - identifying a key class of words, and counting words.

#The key class of words we will use are website addresses. "Website addresses" (also known as URLs - Uniform Resource Locators) are comprised of two main parts. An example would be "http://www.google.com". The first part is the protocol, which is usually "http" (HyperText Transfer Protocol). The second part is the address of the site, e.g. "www.google.com". We have stripped all punctuation so links to websites appear in the data as one word, e.g. "httpwwwgooglecom". We hypothesize that given that a lot of vandalism seems to be adding links to promotional or irrelevant websites, the presence of a web address is a sign of vandalism.

#We can search for the presence of a web address in the words added by searching for "http" in the Added column. The grepl function returns TRUE if a string is found in another string, e.g.

grepl("cat","dogs and cats",fixed=TRUE) # TRUE
## [1] TRUE
grepl("cat","dogs and rats",fixed=TRUE) # FALSE
## [1] FALSE
#Create a copy of your dataframe from the previous question:

wikiWords2 = wikiWords

#Make a new column in wikiWords2 that is 1 if "http" was in Added:
wikiWords2$HTTP = ifelse(grepl("http",wiki$Added,fixed=TRUE), 1, 0)

#Based on this new column, how many revisions added a link?
table(wikiWords2$HTTP)
## 
##    0    1 
## 3659  217
#Ans: 217
#EXPLANATION:You can find this number by typing table(wikiWords2$HTTP), and seeing that there are 217 observations with value 1.

###############################

#PROBLEM 2.2 - PROBLEM-SPECIFIC KNOWLEDGE  

#In problem 1.5, you computed a vector called "spl" that identified the observations to put in the training and testing sets. Use that variable (do not recompute it with sample.split) to make new training and testing sets:

wikiTrain2 = subset(wikiWords2, spl==TRUE)
wikiTest2 = subset(wikiWords2, spl==FALSE)

#Then create a new CART model using this new variable as one of the independent variables.
#Now we are ready to build the model, and we will build a simple CART model using the default parameters. 

library(rpart)
library(rpart.plot)

#Let's  use CART to build a predictive model, using the rpart() function to predict responsive using all of the other variables as our independent variables and the data set wikiTrain2.
wikiCART2 = rpart(vandal~., data=wikiTrain2, method="class")

# Out-of-Sample Performnce of the Model (Make predictions on the test set)
testPredictCART2= predict(wikiCART2, newdata=wikiTest2,type="class") #if you add the argument type="class" when making predictions, the output of predict will automatically use a threshold of 0.5.

#We are interested in the accuracy of our model on the test set, i.e. out-of-sample. 
#First we compute the confusion matrix
cmat_CART<-table(wikiTest2$vandal,testPredictCART2)
cmat_CART
##    testPredictCART2
##       0   1
##   0 609   9
##   1 488  57
#What is the new accuracy of the CART model on the test set, using a threshold of 0.5?

#lets now compute the overall accuracy
accu_CART <- (cmat_CART[1,1] + cmat_CART[2,2])/sum(cmat_CART)
accu_CART  #(609+57)/(609+9+488+57) = 0.5726569
## [1] 0.5726569
#Ans:0.5726569

###############################

#PROBLEM 2.3 - PROBLEM-SPECIFIC KNOWLEDGE 

#Another possibility is that the number of words added and removed is predictive, perhaps more so than the actual words themselves. We already have a word count available in the form of the document-term matrices (DTMs).

#Sum the rows of dtmAdded and dtmRemoved and add them as new variables in your data frame wikiWords2 (called NumWordsAdded and NumWordsRemoved) by using the following commands:
wikiWords2$NumWordsAdded = rowSums(as.matrix(dtmAdded))
wikiWords2$NumWordsRemoved = rowSums(as.matrix(dtmRemoved))

#What is the average number of words added?
mean(wikiWords2$NumWordsAdded)
## [1] 4.050052
#Ans:4.050052

#################################################

#PROBLEM 2.4 - PROBLEM-SPECIFIC KNOWLEDGE  

#In problem 1.5, you computed a vector called "spl" that identified the observations to put in the training and testing sets. Use that variable (do not recompute it with sample.split) to make new training and testing sets with wikiWords2. Create the CART model again (using the training set and the default parameters).

wikiTrain3 = subset(wikiWords2, spl==TRUE)
wikiTest3 = subset(wikiWords2, spl==FALSE)

#Then create a new CART model using this new variable as one of the independent variables.
#Now we are ready to build the model, and we will build a simple CART model using the default parameters. 

library(rpart)
library(rpart.plot)

#Let's  use CART to build a predictive model, using the rpart() function to predict responsive using all of the other variables as our independent variables and the data set wikiTrain3.
wikiCART3 = rpart(vandal~., data=wikiTrain3, method="class")

# Out-of-Sample Performnce of the Model (Make predictions on the wikiTest3 set)
testPredictCART3= predict(wikiCART3, newdata=wikiTest3,type="class") #if you add the argument type="class" when making predictions, the output of predict will automatically use a threshold of 0.5.

#We are interested in the accuracy of our model on the test set, i.e. out-of-sample. 
#First we compute the confusion matrix
cmat_CART<-table(wikiTest3$vandal,testPredictCART3)
cmat_CART
##    testPredictCART3
##       0   1
##   0 514 104
##   1 297 248
#What is the new accuracy of the CART model on the test set, using a threshold of 0.5?

#lets now compute the overall accuracy
accu_CART <- (cmat_CART[1,1] + cmat_CART[2,2])/sum(cmat_CART)
accu_CART  #(514+248)/(514+104+297+248) = 0.6552021
## [1] 0.6552021
#Ans:0.6552021


#What is the new accuracy of the CART model on the test set?
#Ans:0.6552021

##########################################

#PROBLEM 3.1 - USING NON-TEXTUAL DATA  

#We have two pieces of "metadata" (data about data) that we haven't yet used. Make a copy of wikiWords2, and call it wikiWords3:
wikiWords3 = wikiWords2

#Then add the two original variables Minor and Loggedin to this new data frame:
wikiWords3$Minor = wiki$Minor
wikiWords3$Loggedin = wiki$Loggedin

#In problem 1.5, you computed a vector called "spl" that identified the observations to put in the training and testing sets. Use that variable (do not recompute it with sample.split) to make new training and testing sets with wikiWords3.

wikiTrain4 = subset(wikiWords3, spl==TRUE)
wikiTest4 = subset(wikiWords3, spl==FALSE)

#Build a CART model using all the training data. What is the accuracy of the model on the test set?

#Now we are ready to build the model, and we will build a simple CART model using the default parameters. 
library(rpart)
library(rpart.plot)

#Let's  use CART to build a predictive model, using the rpart() function to predict responsive using all of the other variables as our independent variables and the data set wikiTrain4.
wikiCART4 = rpart(vandal~., data=wikiTrain4, method="class")

# Out-of-Sample Performnce of the Model (Make predictions on the test set wikiTest4)
testPredictCART4= predict(wikiCART4, newdata=wikiTest4,type="class") #if you add the argument type="class" when making predictions, the output of predict will automatically use a threshold of 0.5.

#We are interested in the accuracy of our model on the test set, i.e. out-of-sample. 
#First we compute the confusion matrix
cmat_CART<-table(wikiTest4$vandal,testPredictCART4)
cmat_CART
##    testPredictCART4
##       0   1
##   0 595  23
##   1 304 241
#What is the new accuracy of the CART model on the test set, using a threshold of 0.5?

#lets now compute the overall accuracy
accu_CART <- (cmat_CART[1,1] + cmat_CART[2,2])/sum(cmat_CART)
accu_CART  #(595+241)/(595+23+304+241) = 0.7188306.
## [1] 0.7188306
#Ans:0.7188306

######################################

#PROBLEM 3.2 - USING NON-TEXTUAL DATA  

#There is a substantial difference in the accuracy of the model using the meta data. Is this because we made a more complicated model?

#Plot the CART tree. How many splits are there in the tree?
prp(wikiCART4)

#Ans:
#EXPLANATION:You can plot the tree with prp(wikiCART4). The first split is on the variable "Loggedin", the second split is on the number of words added, and the third split is on the number of words removed.By adding new independent variables, we were able to significantly improve our accuracy without making the model more complicated!

AUTOMATING REVIEWS IN MEDICINE

AUTOMATING REVIEWS IN MEDICINE

The medical literature is enormous. Pubmed, a database of medical publications maintained by the U.S. National Library of Medicine, has indexed over 23 million medical publications. Further, the rate of medical publication has increased over time, and now there are nearly 1 million new publications in the field each year, or more than one per minute.

The large size and fast-changing nature of the medical literature has increased the need for reviews, which search databases like Pubmed for papers on a particular topic and then report results from the papers found. While such reviews are often performed manually, with multiple people reviewing each search result, this is tedious and time consuming. In this problem, we will see how text analytics can be used to automate the process of information retrieval.

The dataset consists of the titles (variable title) and abstracts (variable abstract) of papers retrieved in a Pubmed search. Each search result is labeled with whether the paper is a clinical trial testing a drug therapy for cancer (variable trial). These labels were obtained by two people reviewing each search result and accessing the actual paper if necessary, as part of a literature review of clinical trials testing drug therapies for advanced and metastatic breast cancer.

#PROBLEM 1.1 - LOADING THE DATA  

trials<-read.csv("clinical_trial.csv", stringsAsFactors=FALSE)
str(trials)
## 'data.frame':    1860 obs. of  3 variables:
##  $ title   : chr  "Treatment of Hodgkin's disease and other cancers with 1,3-bis(2-chloroethyl)-1-nitrosourea (BCNU; NSC-409962)." "Cell mediated immune status in malignancy--pretherapy and post-therapy assessment." "Neoadjuvant vinorelbine-capecitabine versus docetaxel-doxorubicin-cyclophosphamide in early nonresponsive breast cancer: phase "| __truncated__ "Randomized phase 3 trial of fluorouracil, epirubicin, and cyclophosphamide alone or followed by Paclitaxel for early breast can"| __truncated__ ...
##  $ abstract: chr  "" "Twenty-eight cases of malignancies of different kinds were studied to assess T-cell activity and population before and after in"| __truncated__ "BACKGROUND: Among breast cancer patients, nonresponse to initial neoadjuvant chemotherapy is associated with unfavorable outcom"| __truncated__ "BACKGROUND: Taxanes are among the most active drugs for the treatment of metastatic breast cancer, and, as a consequence, they "| __truncated__ ...
##  $ trial   : int  1 0 1 1 1 0 1 0 0 0 ...
summary(trials)
##     title             abstract             trial       
##  Length:1860        Length:1860        Min.   :0.0000  
##  Class :character   Class :character   1st Qu.:0.0000  
##  Mode  :character   Mode  :character   Median :0.0000  
##                                        Mean   :0.4392  
##                                        3rd Qu.:1.0000  
##                                        Max.   :1.0000
#We can use R's string functions to learn more about the titles and abstracts of the located papers. The nchar() function counts the number of characters in a piece of text. Using the nchar() function on the variables in the data frame, answer the following questions:

#How many characters are there in the longest abstract? (Longest here is defined as the abstract with the largest number of characters.)
max(nchar(trials$abstract)) 
## [1] 3708
#or
summary(nchar(trials$abstract))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    1196    1583    1481    1821    3708
#Ans:3708

##############################

#PROBLEM 1.2 - LOADING THE DATA 

#How many search results provided no abstract? (HINT: A search result provided no abstract if the number of characters in the abstract field is zero.)
table(nchar(trials$abstract) == 0)
## 
## FALSE  TRUE 
##  1748   112
#or
sum(nchar(trials$abstract)==0)
## [1] 112
#Ans:112

################################

#PROBLEM 1.3 - LOADING THE DATA  

#Find the observation with the minimum number of characters in the title (the variable "title") out of all of the observations in this dataset. What is the text of the title of this article? Include capitalization and punctuation in your response, but don't include the quotes.
trials$title[which.min(nchar(trials$title))]
## [1] "A decade of letrozole: FACE."
#Ans:A decade of letrozole: FACE.

##########################################

#PROBLEM 2.1 - PREPARING THE CORPUS 

#Because we have both title and abstract information for trials, we need to build two corpera instead of one. Name them corpusTitle and corpusAbstract.

# Load tm package
library(tm)

#1) Convert the title variable to corpusTitle and the abstract variable to corpusAbstract.

#CREATING the Corpera

#We will need to convert our title & abstract to a corpus for pre-processing. Various function in the tm package can be used to create a corpus in many different ways.
#We will create it using two functions, Corpus() and VectorSource(). We feed to this latter the title & abstract variable of the trials data frame

corpusTitle= Corpus(VectorSource(trials$title))
corpusAbstract= Corpus(VectorSource(trials$abstract))

#Lets see the first  title & abstract in our corpera
corpusTitle[[1]]$content
## [1] "Treatment of Hodgkin's disease and other cancers with 1,3-bis(2-chloroethyl)-1-nitrosourea (BCNU; NSC-409962)."
corpusAbstract[[1]]$content
## [1] ""
#2) Convert corpusTitle and corpusAbstract to lowercase. After performing this step, remember to run the lines:

#corpusTitle = tm_map(corpusTitle, PlainTextDocument)
#corpusAbstract = tm_map(corpusAbstract, PlainTextDocument)

#To transform all text to lower case:
corpusTitle= tm_map(corpusTitle, content_transformer(tolower))
corpusAbstract = tm_map(corpusAbstract, content_transformer(tolower))

#Checking the same  "documents" as before:
corpusTitle[[1]]$content
## [1] "treatment of hodgkin's disease and other cancers with 1,3-bis(2-chloroethyl)-1-nitrosourea (bcnu; nsc-409962)."
corpusAbstract[[1]]$content
## [1] ""
#3) Remove the punctuation in corpusTitle and corpusAbstract.

#Removing punctuation
corpusTitle= tm_map(corpusTitle, removePunctuation)
corpusAbstract= tm_map(corpusAbstract, removePunctuation)

#Checking the same  "documents" as before:
corpusTitle[[1]]$content
## [1] "treatment of hodgkins disease and other cancers with 13bis2chloroethyl1nitrosourea bcnu nsc409962"
corpusAbstract[[1]]$content
## [1] ""
#4) Remove the English language stop words from corpusTitle and corpusAbstract.

#Removing stop words
#Removing words can be done with the removeWords argument to the tm_map() function, with an extra argument, i.e. what the stop words are that we want to remove, for which we simply use the list for english that is provided by the tm package.
#We will remove all of these English stop words as it probably won't be very useful in our prediction problem.
corpusTitle= tm_map(corpusTitle, removeWords, stopwords("english"))
corpusAbstract= tm_map(corpusAbstract, removeWords, stopwords("english"))

#Checking the same  "documents" as before:
corpusTitle[[1]]$content
## [1] "treatment  hodgkins disease   cancers  13bis2chloroethyl1nitrosourea bcnu nsc409962"
corpusAbstract[[1]]$content
## [1] ""
#5) Stem the words in corpusTitle and corpusAbstract (each stemming might take a few minutes).

#Stemming

#Lastly, we want to stem our document with the stemDocument argument.
corpusTitle= tm_map(corpusTitle, stemDocument)
corpusAbstract= tm_map(corpusAbstract, stemDocument)

#Checking the same  "documents" as before:
corpusTitle[[1]]$content
## [1] "treatment  hodgkin diseas   cancer  13bis2chloroethyl1nitrosourea bcnu nsc409962"
corpusAbstract[[1]]$content
## [1] ""
#6) Build a document term matrix called dtmTitle from corpusTitle and dtmAbstract from corpusAbstract.

#Create a Document Term Matrix
Corpus = tm_map(corpusTitle, PlainTextDocument)
Corpus = tm_map(corpusAbstract, PlainTextDocument)

#The tm package provides a function called DocumentTermMatrix() that generates a matrix
#The values in the matrix are the number of times that word appears in each document.
dtmTitle= DocumentTermMatrix(corpusTitle)
dtmTitle
## <<DocumentTermMatrix (documents: 1860, terms: 2834)>>
## Non-/sparse entries: 23416/5247824
## Sparsity           : 100%
## Maximal term length: 49
## Weighting          : term frequency (tf)
dtmAbstract= DocumentTermMatrix(corpusAbstract)
dtmAbstract
## <<DocumentTermMatrix (documents: 1860, terms: 12343)>>
## Non-/sparse entries: 153241/22804739
## Sparsity           : 99%
## Maximal term length: 67
## Weighting          : term frequency (tf)
#7) Limit dtmTitle and dtmAbstract to terms with sparseness of at most 95% (aka terms that appear in at least 5% of documents).
# Remove sparse terms which don't appear very often.
dtmTitle= removeSparseTerms(dtmTitle, 0.95)
dtmTitle
## <<DocumentTermMatrix (documents: 1860, terms: 31)>>
## Non-/sparse entries: 10683/46977
## Sparsity           : 81%
## Maximal term length: 15
## Weighting          : term frequency (tf)
dtmAbstract= removeSparseTerms(dtmAbstract, 0.95)
dtmAbstract
## <<DocumentTermMatrix (documents: 1860, terms: 335)>>
## Non-/sparse entries: 92007/531093
## Sparsity           : 85%
## Maximal term length: 15
## Weighting          : term frequency (tf)
#8) Convert dtmTitle and dtmAbstract to data frames (keep the names dtmTitle and dtmAbstract).

#Let's convert the sparse matrix into a data frame that we will be able to use for our predictive models.
dtmTitle= as.data.frame(as.matrix(dtmTitle))
dtmAbstract = as.data.frame(as.matrix(dtmAbstract))


#How many terms remain in dtmTitle after removing sparse terms (aka how many columns does it have)?
ncol(dtmTitle)
## [1] 31
#or 
dim(dtmTitle)
## [1] 1860   31
#Ans:31

#How many terms remain in dtmAbstract?
ncol(dtmAbstract)
## [1] 335
#or
dim(dtmAbstract)
## [1] 1860  335
#Ans:335

############################################

#PROBLEM 2.2 - PREPARING THE CORPUS

#What is the most likely reason why dtmAbstract has so many more terms than dtmTitle?
#Ans:Abstracts tend to have many more words than titles
#EXPLANATION:Because titles are so short, a word needs to be very common to appear in 5% of titles. Because abstracts have many more words, a word can be much less common and still appear in 5% of abstracts.While abstracts may have wider vocabulary, this is a secondary effect. As we saw in the previous subsection, all papers have titles, but not all have abstracts.

###########################################

#PROBLEM 2.3 - PREPARING THE CORPUS  

#What is the most frequent word stem across all the abstracts? Hint: you can use colSums() to compute the frequency of a word across all the abstracts
which.max(colSums(dtmAbstract))
## patient 
##     212
#Ans:patient

###############################

#ROBLEM 3.1 - BUILDING A MODEL  

#We want to combine dtmTitle and dtmAbstract into a single data frame to make predictions. However, some of the variables in these data frames have the same names. To fix this issue, run the following commands:

colnames(dtmTitle) = paste0("T", colnames(dtmTitle))
colnames(dtmAbstract) = paste0("A", colnames(dtmAbstract))

#What was the effect of these functions?
#Ans:Adding the letter T in front of all the title variable names and adding the letter A in front of all the abstract variable names.
#EXPLANATION:The first line pastes a T at the beginning of each column name for dtmTitle, which are the variable names. The second line does something similar for the Abstract variables - it pastes an A at the beginning of each column name for dtmAbstract, which are the variable names.

#############################

#PROBLEM 3.2 - BUILDING A MODEL

#Using cbind(), combine dtmTitle and dtmAbstract into a single data frame called dtm:
dtm = cbind(dtmTitle, dtmAbstract)

#As we did in class, add the dependent variable "trial" to dtm, copying it from the original data frame called trials. 

#We also have to add back-in the outcome variable
dtm$trial<-trials$trial

#How many columns are in this combined data frame?
ncol(dtm)
## [1] 367
#ANs:367

###########################

#PROBLEM 3.3 - BUILDING A MODEL

#Now that we have prepared our data frame, it's time to split it into a training and testing set and to build regression models. Set the random seed to 144 and use the sample.split function from the caTools package to split dtm into data frames named "train" and "test", putting 70% of the data in the training set.

#Split data in training/testing sets
#let's split our data into a training set and a testing set, putting 70% of the data in the training set.

library(caTools)
set.seed(144)

#split the data set using sample.split from the "caTools" package to put 70% in the training set.
spl = sample.split(dtm$trial, 0.7)

train= subset(dtm, spl == TRUE)
test=subset(dtm, spl == FALSE)
#str(train)

#What is the accuracy of the baseline model on the training set? (Remember that the baseline model predicts the most frequent outcome in the training set for all observations.)

#The baseline model predicts the most frequent outcome in the training set for all observations.
cmat_baseline <-table(train$trial) 
cmat_baseline 
## 
##   0   1 
## 730 572
#Baseline accuracy
accu_baseline <- max(cmat_baseline)/sum(cmat_baseline)
accu_baseline #730/(730+572)=0.5606759
## [1] 0.5606759
#Ans:0.5606759
#EXPLANATION:Just as in any binary classification problem, the naive baseline always predicts the most common class. From table(train$trial), we see 730 training set results were not trials, and 572 were trials. Therefore, the naive baseline always predicts a result is not a trial, yielding accuracy of 730/(730+572).

#############################

#PROBLEM 3.4 - BUILDING A MODEL

#Build a CART model called trialCART, using all the independent variables in the training set to train the model, and then plot the CART model. Just use the default parameters to build the model (don't add a minbucket or cp value). Remember to add the method="class" argument, since this is a classification problem.

#Now we are ready to build the model, and we will build a simple CART model using the default parameters & all the independent variables in the training set.

library(rpart)
library(rpart.plot)

#Let's  use CART to build a predictive model, using the rpart() function to predict responsive using all of the other variables as our independent variables and the data set train.
trialCART= rpart(trial~., data=train, method="class") #the method="class" argument as this is a classification problem

#Plotting the CART model
prp(trialCART)

#What is the name of the first variable the model split on?
#Ans:Tphase
#The first split checks whether or not Tphase is less than 0.5

################################

#PROBLEM 3.5 - BUILDING A MODEL

#Obtain the training set predictions for the model (do not yet predict on the test set). Extract the predicted probability of a result being a trial (recall that this involves not setting a type argument, and keeping only the second column of the predict output). What is the maximum predicted probability for any result?

#Make predictions on the train set)
predTrain= predict(trialCART)
max(predTrain[,2])
## [1] 0.8718861
#or
predTrain= predict(trialCART)[,2]
summary(predTrain)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.05455 0.13640 0.28750 0.43930 0.78230 0.87190
#Ans:0.8718861

###########################################

#PROBLEM 3.6 - BUILDING A MODEL

#Without running the analysis, how do you expect the maximum predicted probability to differ in the testing set?
#Ans:The maximum predicted probability will likely be exactly the same in the testing set.
#EXPLANATION:Because the CART tree assigns the same predicted probability to each leaf node and there are a small number of leaf nodes compared to data points, we expect exactly the same maximum predicted probability.

####################################

#PROBLEM 3.7 - BUILDING A MODEL  

#For these questions, use a threshold probability of 0.5 to predict that an observation is a clinical trial.

#What is the training set accuracy of the CART model?

#We are interested in the accuracy of our model on the train set. 
#First we compute the confusion matrix:

cmat_CART1<-table(train$trial,predTrain >= 0.5)
cmat_CART1
##    
##     FALSE TRUE
##   0   631   99
##   1   131  441
#lets now compute the overall accuracy

accu_CART <- (cmat_CART1[1,1] + cmat_CART1[2,2])/sum(cmat_CART1)
accu_CART  #(631+441)/(631+441+99+131) = 0.8233487
## [1] 0.8233487
#Ans:0.8233487

#What is the training set sensitivity of the CART model?
441/(131+441)
## [1] 0.770979
#Ans: 0.770979
#Sensitivity = TP rate=441/(131+441)=0.770979

#What is the training set specificity of the CART model?
631/(631+99)
## [1] 0.8643836
#ANs:0.8643836
#Specificity = (1 - FP rate)=631/(631+99)=0.8643836

#######################################################

#PROBLEM 4.1 - EVALUATING THE MODEL ON THE TESTING SET  

#Evaluate the CART model on the testing set using the predict function and creating a vector of predicted probabilities predTest.

# Make predictions
predTest = predict(trialCART, newdata=test)[,2] 
summary(predTest)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.05455 0.13640 0.28750 0.41730 0.78230 0.87190
#Now lets assess the accuracy of the model through confusion matrix
cmat_CART<-table(test$trial, predTest>= 0.5)  #first arg is the true outcomes and the second is the predicted outcomes
cmat_CART
##    
##     FALSE TRUE
##   0   261   52
##   1    83  162
#What is the testing set accuracy, assuming a probability threshold of 0.5 for predicting that a result is a clinical trial?

#lets now compute the overall accuracy
accu_CART <- (cmat_CART[1,1] + cmat_CART[2,2])/sum(cmat_CART)
accu_CART  # (261+162)/(261+162+83+52) = 0.7580645
## [1] 0.7580645
#Ans:0.7580645

######################################

PART 5: DECISION-MAKER TRADEOFFS

The decision maker for this problem, a researcher performing a review of the medical literature, would use a model (like the CART one we built here) in the following workflow:

  1. For all of the papers retreived in the PubMed Search, predict which papers are clinical trials using the model. This yields some initial Set A of papers predicted to be trials, and some Set B of papers predicted not to be trials. (See the figure below.)

  2. Then, the decision maker manually reviews all papers in Set A, verifying that each paper meets the study’s detailed inclusion criteria (for the purposes of this analysis, we assume this manual review is 100% accurate at identifying whether a paper in Set A is relevant to the study). This yields a more limited set of papers to be included in the study, which would ideally be all papers in the medical literature meeting the detailed inclusion criteria for the study.

  3. Perform the study-specific analysis, using data extracted from the limited set of papers identified in step 2.

This process is shown in the figure below.

InfoRetrievalFigure2

InfoRetrievalFigure2

PROBLEM 5.1 - DECISION-MAKER TRADEOFFS

What is the cost associated with the model in Step 1 making a false negative prediction?

Ans:A paper that should have been included in Set A will be missed, affecting the quality of the results of Step 3.

EXPLANATION:By definition, a false negative is a paper that should have been included in Set A but was missed by the model. This means a study that should have been included in Step 3 was missed, affecting the results.

PROBLEM 5.2 - DECISION-MAKER TRADEOFFS

What is the cost associated with the model in Step 1 making a false positive prediction?

Ans:A paper will be mistakenly added to Set A, yielding additional work in Step 2 of the process but not affecting the quality of the results of Step 3.

EXPLANATION:By definition, a false positive is a paper that should not have been included in Set A but that was actually included. However, because the manual review in Step 2 is assumed to be 100% effective, this extra paper will not make it into the more limited set of papers, and therefore this mistake will not affect the analysis in Step 3.

PROBLEM 5.3 - DECISION-MAKER TRADEOFFS

Given the costs associated with false positives and false negatives, which of the following is most accurate?

Ans:A false negative is more costly than a false positive; the decision maker should use a probability threshold less than 0.5 for the machine learning model.

EXPLANATION:A false negative might negatively affect the results of the literature review and analysis, while a false positive is a nuisance (one additional paper that needs to be manually checked). As a result, the cost of a false negative is much higher than the cost of a false positive, so much so that many studies actually use no machine learning (aka no Step 1) and have two people manually review each search result in Step 2. As always, we prefer a lower threshold in cases where false negatives are more costly than false positives, since we will make fewer negative predictions.

SEPARATING SPAM FROM HAM (PART 1)

Nearly every email user has at some point encountered a “spam” email, which is an unsolicited message often advertising a product, containing links to malware, or attempting to scam the recipient. Roughly 80-90% of more than 100 billion emails sent each day are spam emails, most being sent from botnets of malware-infected computers. The remainder of emails are called “ham” emails.

As a result of the huge number of spam emails being sent across the Internet each day, most email providers offer a spam filter that automatically flags likely spam messages and separates them from the ham. Though these filters use a number of techniques (e.g. looking up the sender in a so-called “Blackhole List” that contains IP addresses of likely spammers), most rely heavily on the analysis of the contents of an email via text analytics.

In this homework problem, we will build and evaluate a spam filter using a publicly available dataset first described in the 2006 conference paper “Spam Filtering with Naive Bayes – Which Naive Bayes?” by V. Metsis, I. Androutsopoulos, and G. Paliouras. The “ham” messages in this dataset come from the inbox of former Enron Managing Director for Research Vincent Kaminski, one of the inboxes in the Enron Corpus. One source of spam messages in this dataset is the SpamAssassin corpus, which contains hand-labeled spam messages contributed by Internet users. The remaining spam was collected by Project Honey Pot, a project that collects spam messages and identifies spammers by publishing email address that humans would know not to contact but that bots might target with spam. The full dataset we will use was constructed as roughly a 75/25 mix of the ham and spam messages.

The dataset contains just two fields:

  • text: The text of the email.
  • spam: A binary variable indicating if the email was spam.
#PROBLEM 1.1 - LOADING THE DATASET  

emails <- read.csv("emails.csv", stringsAsFactors=FALSE)
str(emails)
## 'data.frame':    5728 obs. of  2 variables:
##  $ text: chr  "Subject: naturally irresistible your corporate identity  lt is really hard to recollect a company : the  market is full of suqg"| __truncated__ "Subject: the stock trading gunslinger  fanny is merrill but muzo not colza attainder and penultimate like esmark perspicuous ra"| __truncated__ "Subject: unbelievable new homes made easy  im wanting to show you this  homeowner  you have been pre - approved for a $ 454 , 1"| __truncated__ "Subject: 4 color printing special  request additional information now ! click here  click here for a printable version of our o"| __truncated__ ...
##  $ spam: int  1 1 1 1 1 1 1 1 1 1 ...
#How many emails are in the dataset?
nrow(emails)
## [1] 5728
#Ans:5728

######################################

#PROBLEM 1.2 - LOADING THE DATASET  

#How many of the emails are spam?
table(emails$spam)
## 
##    0    1 
## 4360 1368
#or
sum(emails$spam)
## [1] 1368
#Ans:1368

#################################

#PROBLEM 1.3 - LOADING THE DATASET  

#Which word appears at the beginning of every email in the dataset? Respond as a lower-case word with punctuation removed.
emails$text[1]
## [1] "Subject: naturally irresistible your corporate identity  lt is really hard to recollect a company : the  market is full of suqgestions and the information isoverwhelminq ; but a good  catchy logo , stylish statlonery and outstanding website  will make the task much easier .  we do not promise that havinq ordered a iogo your  company will automaticaily become a world ieader : it isguite ciear that  without good products , effective business organization and practicable aim it  will be hotat nowadays market ; but we do promise that your marketing efforts  will become much more effective . here is the list of clear  benefits : creativeness : hand - made , original logos , specially done  to reflect your distinctive company image . convenience : logo and stationery  are provided in all formats ; easy - to - use content management system letsyou  change your website content and even its structure . promptness : you  will see logo drafts within three business days . affordability : your  marketing break - through shouldn ' t make gaps in your budget . 100 % satisfaction  guaranteed : we provide unlimited amount of changes with no extra fees for you to  be surethat you will love the result of this collaboration . have a look at our  portfolio _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ not interested . . . _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _"
emails$text[1000]
## [1] "Subject: 70 percent off your life insurance get a free quote instantly .  question :  are you paying too much for life insurance ?  most  likely the answer is yes !  here ' s why . fact . . . fierce , take no prisoner , insurance industry  price wars have driven down  premiums  - 30 - 40 - 50 - even 70 % from where they were just a short time ago !  that ' s why your insurance company doesn ' t want you to read this . . .  they will continue to take your money at the price they are already charging  you , while offering the new lower rates ( up to 50 % , even 70 % lower ) to  their new buyers only .  but , don ' t take our word for it . . . click  hereand request a free online quote . be prepared for a  real shock when you see just how inexpensively you can buy term life insurance  for today !  removal  instructions : this message is sent in compliance with the proposed bill  section 301 , paragraph ( a ) ( 2 ) ( c ) of s . 1618 . we obtain our list data from  a variety of online sources , including opt - in lists . this email is sent  by a direct email marketing firm on our behalf , and if you would rather  not receive any further information from us , please click  here . in this way , you can instantly opt - out from the list  your email address was obtained from , whether this was an opt - in  or otherwise . please accept our apologies if this message has reached you  in error . please allow 5 - 10 business days for your email address to be removed  from all lists in our control . meanwhile , simply delete any duplicate emails  that you may receive and rest assured that your request to be taken off  this list will be honored . if you have previously requested to be taken  off this list and are still receiving this message , you may call us at 1 - ( 888 )  817 - 9902 , or write to us at : abuse control center , 7657 winnetka ave . , canoga  park , ca 91306 "
#Ans:subject
#EXPLANATION:You can review emails with, for instance, emails$text[1] or emails$text[1000]. Every email begins with the word "Subject:".

#######################################

#PROBLEM 1.4 - LOADING THE DATASET  

#Could a spam classifier potentially benefit from including the frequency of the word that appears in every email?
#Ans:Yes -- the number of times the word appears might help us differentiate spam from ham.
#EXPLANATION:We know that each email has the word "subject" appear at least once, but the frequency with which it appears might help us differentiate spam from ham. For instance, a long email chain would have the word "subject" appear a number of times, and this higher frequency might be indicative of a ham message.

#############################################################

#PROBLEM 1.5 - LOADING THE DATASET  

#The nchar() function counts the number of characters in a piece of text. How many characters are in the longest email in the dataset (where longest is measured in terms of the maximum number of characters)?
max(nchar(emails$text)) #or
## [1] 43952
summary(nchar(emails$text))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    13.0   508.8   979.0  1557.0  1894.0 43950.0
#Ans:43952

###############################################

#PROBLEM 1.6 - LOADING THE DATASET  

#Which row contains the shortest email in the dataset? (Just like in the previous problem, shortest is measured in terms of the fewest number of characters.)
which.min(nchar(emails$text))
## [1] 1992
#or
min(nchar(emails$text)) #determining the min length of the email
## [1] 13
which(nchar(emails$text) == 13) #extracting the row having the shortest email
## [1] 1992
#Ans:1992

########################################

#ROBLEM 2.1 - PREPARING THE CORPUS  

#Follow the standard steps to build and pre-process the corpus:

#1) Build a new corpus variable called corpus.
library(tm)
library(SnowballC)

# Create corpus
corpus = Corpus(VectorSource(emails$text))

#To inspect the first doc in our corpus, we select the first element as:
corpus[[1]]$content
## [1] "Subject: naturally irresistible your corporate identity  lt is really hard to recollect a company : the  market is full of suqgestions and the information isoverwhelminq ; but a good  catchy logo , stylish statlonery and outstanding website  will make the task much easier .  we do not promise that havinq ordered a iogo your  company will automaticaily become a world ieader : it isguite ciear that  without good products , effective business organization and practicable aim it  will be hotat nowadays market ; but we do promise that your marketing efforts  will become much more effective . here is the list of clear  benefits : creativeness : hand - made , original logos , specially done  to reflect your distinctive company image . convenience : logo and stationery  are provided in all formats ; easy - to - use content management system letsyou  change your website content and even its structure . promptness : you  will see logo drafts within three business days . affordability : your  marketing break - through shouldn ' t make gaps in your budget . 100 % satisfaction  guaranteed : we provide unlimited amount of changes with no extra fees for you to  be surethat you will love the result of this collaboration . have a look at our  portfolio _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ not interested . . . _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _"
#2) Using tm_map, convert the text to lowercase.

#IMPORTANT NOTE: If you are using the latest version of the tm package, you will need to run the following line before continuing (it converts corpus to a Plain Text Document). This is a recent change having to do with the tolower function that occurred after this video was recorded.
corpus = tm_map(corpus, PlainTextDocument)

#Converting text to lower case

#We use the tm_map() function which takes as its first argument the name of a corpus and as second argument a function performing the transformation that we want to apply to the text.
corpus = tm_map(corpus, content_transformer(tolower))

#Checking the same "document" as before:
corpus[[1]]$content
## [1] "subject: naturally irresistible your corporate identity  lt is really hard to recollect a company : the  market is full of suqgestions and the information isoverwhelminq ; but a good  catchy logo , stylish statlonery and outstanding website  will make the task much easier .  we do not promise that havinq ordered a iogo your  company will automaticaily become a world ieader : it isguite ciear that  without good products , effective business organization and practicable aim it  will be hotat nowadays market ; but we do promise that your marketing efforts  will become much more effective . here is the list of clear  benefits : creativeness : hand - made , original logos , specially done  to reflect your distinctive company image . convenience : logo and stationery  are provided in all formats ; easy - to - use content management system letsyou  change your website content and even its structure . promptness : you  will see logo drafts within three business days . affordability : your  marketing break - through shouldn ' t make gaps in your budget . 100 % satisfaction  guaranteed : we provide unlimited amount of changes with no extra fees for you to  be surethat you will love the result of this collaboration . have a look at our  portfolio _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ not interested . . . _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _"
#3) Using tm_map, remove all punctuation from the corpus.
corpus = tm_map(corpus, removePunctuation)
corpus[[1]]$content #Checking the same "document" as before
## [1] "subject naturally irresistible your corporate identity  lt is really hard to recollect a company  the  market is full of suqgestions and the information isoverwhelminq  but a good  catchy logo  stylish statlonery and outstanding website  will make the task much easier   we do not promise that havinq ordered a iogo your  company will automaticaily become a world ieader  it isguite ciear that  without good products  effective business organization and practicable aim it  will be hotat nowadays market  but we do promise that your marketing efforts  will become much more effective  here is the list of clear  benefits  creativeness  hand  made  original logos  specially done  to reflect your distinctive company image  convenience  logo and stationery  are provided in all formats  easy  to  use content management system letsyou  change your website content and even its structure  promptness  you  will see logo drafts within three business days  affordability  your  marketing break  through shouldn  t make gaps in your budget  100  satisfaction  guaranteed  we provide unlimited amount of changes with no extra fees for you to  be surethat you will love the result of this collaboration  have a look at our  portfolio                                                     not interested                                                       "
#4) Using tm_map, remove all English stopwords from the corpus.

#We will remove all of these English stop words as it probably won't be very useful in our prediction problem.
corpus = tm_map(corpus, removeWords, stopwords("english"))
corpus[[1]]$content #Checking the same "document" as before to see if stop words are removed
## [1] "subject naturally irresistible  corporate identity  lt  really hard  recollect  company    market  full  suqgestions   information isoverwhelminq    good  catchy logo  stylish statlonery  outstanding website  will make  task much easier      promise  havinq ordered  iogo   company will automaticaily become  world ieader   isguite ciear   without good products  effective business organization  practicable aim   will  hotat nowadays market     promise   marketing efforts  will become much  effective     list  clear  benefits  creativeness  hand  made  original logos  specially done   reflect  distinctive company image  convenience  logo  stationery   provided   formats  easy    use content management system letsyou  change  website content  even  structure  promptness    will see logo drafts within three business days  affordability    marketing break   shouldn  t make gaps   budget  100  satisfaction  guaranteed   provide unlimited amount  changes   extra fees      surethat  will love  result   collaboration    look    portfolio                                                      interested                                                       "
#5) Using tm_map, stem the words in the corpus.

#Lastly, we want to stem our document with the stemDocument argument.
corpus = tm_map(corpus, stemDocument)

# Now that we have gone through the above preprocessing steps, we can take a second look at the first email in the corpus.
corpus[[1]]$content
## [1] "subject natur irresist  corpor ident  lt  realli hard  recollect  compani    market  full  suqgest   inform isoverwhelminq    good  catchi logo  stylish statloneri  outstand websit  will make  task much easier      promis  havinq order  iogo   compani will automaticaili becom  world ieader   isguit ciear   without good product  effect busi organ  practic aim   will  hotat nowaday market     promis   market effort  will becom much  effect     list  clear  benefit  creativ  hand  made  origin logo  special done   reflect  distinct compani imag  conveni  logo  stationeri   provid   format  easi    use content manag system letsyou  chang  websit content  even  structur  prompt    will see logo draft within three busi day  afford    market break   shouldn  t make gap   budget  100  satisfact  guarante   provid unlimit amount  chang   extra fee      surethat  will love  result   collabor    look    portfolio                                                      interest                                                      "
#6) Build a document term matrix from the corpus, called dtm.

#Create a Document Term Matrix
corpus = tm_map(corpus, PlainTextDocument)

#The values in the matrix are the number of times that word appears in each document.
dtm = DocumentTermMatrix(corpus)
dtm
## <<DocumentTermMatrix (documents: 5728, terms: 28687)>>
## Non-/sparse entries: 481719/163837417
## Sparsity           : 100%
## Maximal term length: 24
## Weighting          : term frequency (tf)
#How many terms are in dtm?
ncol(dtm)
## [1] 28687
#Ans:28687

##############################################

#PROBLEM 2.2 - PREPARING THE CORPUS  

#To obtain a more reasonable number of terms, limit dtm to contain terms appearing in at least 5% of documents, and store this result as spdtm (don't overwrite dtm, because we will use it in a later step of this homework). How many terms are in spdtm?

# Remove sparse terms that don't appear very often.
spdtm= removeSparseTerms(dtm, 1-0.05)
spdtm 
## <<DocumentTermMatrix (documents: 5728, terms: 330)>>
## Non-/sparse entries: 213551/1676689
## Sparsity           : 89%
## Maximal term length: 10
## Weighting          : term frequency (tf)
ncol(spdtm)
## [1] 330
#Ans:330

######################################

#PROBLEM 2.3 - PREPARING THE CORPUS  

#Build a data frame called emailsSparse from spdtm, and use the make.names function to make the variable names of emailsSparse valid.

#Create data frame emailsSparse from spdtm
emailsSparse <- as.data.frame(as.matrix(spdtm)) #convert the sparse matrix into a data frame that we will be able to use for our predictive models.

#To make all variable names R-friendly use:
colnames(emailsSparse)<- make.names(colnames(emailsSparse))

#colSums() is an R function that returns the sum of values for each variable in our data frame. Our data frame contains the number of times each word stem (columns) appeared in each email (rows). Therefore, colSums(emailsSparse) returns the number of times a word stem appeared across all the emails in the dataset.  Hint: think about how you can use sort() or which.max() to pick out the maximum frequency.
head(colSums(emailsSparse),20)
##    X000   X2000   X2001    X713    X853     abl  access account   addit 
##    1007    4967    3089    1097     462     590     789     829     774 
## address   allow alreadi    also analysi   anoth  applic appreci  approv 
##    1154     450     446    1864     495     435     567     541     648 
##   april    area 
##     682     489
#What is the word stem that shows up most frequently across all the emails in the dataset?
which.max(colSums(emailsSparse))
## enron 
##    92
#or sort(colSums(emailsSparse))
#Ans:enron 

#############################################

#PROBLEM 2.4 - PREPARING THE CORPUS  

#Add a variable called "spam" to emailsSparse containing the email spam labels. You can do this by copying over the "spam" variable from the original data frame (remember how we did this in the Twitter lecture).
emailsSparse$spam <- emails$spam

#How many word stems appear at least 5000 times in the ham emails in the dataset? Hint: in this and the next question, remember not to count the dependent variable we just added.
sum(colSums(subset(emailsSparse, emailsSparse$spam==0)) >= 5000)
## [1] 6
#or
head(sort(colSums(subset(emailsSparse, spam == 0)),decreasing = T),10)#We can read the most frequent terms in the ham datase
##    enron      ect  subject     vinc     will      hou    X2000 kaminski 
##    13388    11417     8625     8531     6802     5569     4935     4801 
##    pleas      com 
##     4494     4444
#Ans:6
#EXPLANATION:"enron", "ect", "subject", "vinc", "will", and "hou" appear at least 5000 times in the ham dataset.

##################################################

#PROBLEM 2.5 - PREPARING THE CORPUS  

#How many word stems appear at least 1000 times in the spam emails in the dataset?
sum(colSums(subset(emailsSparse, emailsSparse$spam==1)) >= 1000) - 1
## [1] 3
#or
head(sort(colSums(subset(emailsSparse, spam == 1)),decreasing = T),10)
## subject    will    spam compani     com    mail    busi   email     can 
##    1577    1450    1368    1065     999     917     897     865     831 
##  inform 
##     818
#Ans:3
#EXPLANATION:"subject", "will", and "compani" are the three stems that appear at least 1000 times. Note that the variable "spam" is the dependent variable and is not the frequency of a word stem.

#######################################

#PROBLEM 2.6 - PREPARING THE CORPUS  

#The lists of most common words are significantly different between the spam and ham emails. What does this likely imply
#Ans:The frequencies of these most common words are likely to help differentiate between spam and ham. 
#EXPLANATION:A word stem like "enron", which is extremely common in the ham emails but does not occur in any spam message, will help us correctly identify a large number of ham messages.

##############################

#PROBLEM 2.7 - PREPARING THE CORPUS  

#Several of the most common word stems from the ham documents, such as "enron", "hou" (short for Houston), "vinc" (the word stem of "Vince") and "kaminski", are likely specific to Vincent Kaminski's inbox. What does this mean about the applicability of the text analytics models we will train for the spam filtering problem?
#Ans:The models we build are personalized, and would need to be further tested before being used as a spam filter for another person.
#EXPLANATION:The ham dataset is certainly personalized to Vincent Kaminski, and therefore it might not generalize well to a general email user. Caution is definitely necessary before applying the filters derived in this problem to other email users.

####################################

#PROBLEM 3.1 - BUILDING MACHINE LEARNING MODELS  

#First, convert the dependent variable to a factor with "emailsSparse$spam = as.factor(emailsSparse$spam)".
emailsSparse$spam = as.factor(emailsSparse$spam)

#Next, set the random seed to 123 and use the sample.split function to split emailsSparse 70/30 into a training set called "train" and a testing set called "test". Make sure to perform this step on emailsSparse instead of emails.

#Split data in training/testing sets
library(caTools)
set.seed(123)

#let's split our data into a training set and a testing set, putting 70% of the data in the training set.
spl <- sample.split(emailsSparse$spam, SplitRatio=0.7)
train <- subset(emailsSparse, spl==TRUE)
test <- subset(emailsSparse, spl==FALSE)

#Using the training set, train the following three machine learning models. The models should predict the dependent variable "spam", using all other available variables as independent variables. Please be patient, as these models may take a few minutes to train.

#1) A logistic regression model called spamLog. You may see a warning message here - we'll discuss this more later.
spamLog <- glm(spam ~ ., train, family=binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#2) A CART model called spamCART, using the default parameters to train the model (don't worry about adding minbucket or cp). Remember to add the argument method="class" since this is a binary classification problem.

#Build a CART model
library(rpart)
library(rpart.plot)

#Let's  use CART to build a predictive model, using the rpart() function to predict responsive using all of the other variables as our independent variables and the data set train.
spamCART <- rpart(spam ~ ., train, method="class")


#3) A random forest model called spamRF, using the default parameters to train the model (don't worry about specifying ntree or nodesize). Directly before training the random forest model, set the random seed to 123 (even though we've already done this earlier in the problem, it's important to set the seed right before training the model so we all obtain the same results. Keep in mind though that on certain operating systems, your results might still be slightly different).
library(randomForest)
set.seed(123)
spamRF <- randomForest(spam ~ ., train)

#For each model, obtain the predicted spam probabilities for the training set. Be careful to obtain probabilities instead of predicted classes, because we will be using these values to compute training set AUC values. Recall that you can obtain probabilities for CART models by not passing any type parameter to the predict() function, and you can obtain probabilities from a random forest by adding the argument type="prob". For CART and random forest, you need to select the second column of the output of the predict() function, corresponding to the probability of a message being spam.

#Get predicted spam probabilities for the training set for each model:
predTrainLog <- predict(spamLog, type="response")
predTrainCART<- predict(spamCART)[,2]
predTrainRF<- predict(spamRF, type="prob")[,2]

#You may have noticed that training the logistic regression model yielded the messages "algorithm did not converge" and "fitted probabilities numerically 0 or 1 occurred". Both of these messages often indicate overfitting and the first indicates particularly severe overfitting, often to the point that the training set observations are fit perfectly by the model. Let's investigate the predicted probabilities from the logistic regression model.

#How many of the training set predicted probabilities from spamLog are less than 0.00001?
a<-sum(predTrainLog< 0.00001)
a
## [1] 3046
#or
table(predTrainLog < 0.00001)
## 
## FALSE  TRUE 
##   964  3046
#Ans:3046

#How many of the training set predicted probabilities from spamLog are more than 0.99999?
b<-sum(predTrainLog> 0.99999)
b
## [1] 954
#or
table(predTrainLog > 0.99999)
## 
## FALSE  TRUE 
##  3056   954
#Ans:954

#How many of the training set predicted probabilities from spamLog are between 0.00001 and 0.99999?
nrow(train) - a - b
## [1] 10
#or
table(predTrainLog >= 0.00001 & predTrainLog <= 0.99999)
## 
## FALSE  TRUE 
##  4000    10
#Ans:10

#################################################

#PROBLEM 3.2 - BUILDING MACHINE LEARNING MODELS  (1 point possible)

#How many variables are labeled as significant (at the p=0.05 level) in the logistic regression summary output?
summary(spamLog)
## 
## Call:
## glm(formula = spam ~ ., family = binomial, data = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.011   0.000   0.000   0.000   1.354  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.082e+01  1.055e+04  -0.003    0.998
## X000         1.474e+01  1.058e+04   0.001    0.999
## X2000       -3.631e+01  1.556e+04  -0.002    0.998
## X2001       -3.215e+01  1.318e+04  -0.002    0.998
## X713        -2.427e+01  2.914e+04  -0.001    0.999
## X853        -1.212e+00  5.942e+04   0.000    1.000
## abl         -2.049e+00  2.088e+04   0.000    1.000
## access      -1.480e+01  1.335e+04  -0.001    0.999
## account      2.488e+01  8.165e+03   0.003    0.998
## addit        1.463e+00  2.703e+04   0.000    1.000
## address     -4.613e+00  1.113e+04   0.000    1.000
## allow        1.899e+01  6.436e+03   0.003    0.998
## alreadi     -2.407e+01  3.319e+04  -0.001    0.999
## also         2.990e+01  1.378e+04   0.002    0.998
## analysi     -2.405e+01  3.860e+04  -0.001    1.000
## anoth       -8.744e+00  2.032e+04   0.000    1.000
## applic      -2.649e+00  1.674e+04   0.000    1.000
## appreci     -2.145e+01  2.762e+04  -0.001    0.999
## approv      -1.302e+00  1.589e+04   0.000    1.000
## april       -2.620e+01  2.208e+04  -0.001    0.999
## area         2.041e+01  2.266e+04   0.001    0.999
## arrang       1.069e+01  2.135e+04   0.001    1.000
## ask         -7.746e+00  1.976e+04   0.000    1.000
## assist      -1.128e+01  2.490e+04   0.000    1.000
## associ       9.049e+00  1.909e+04   0.000    1.000
## attach      -1.037e+01  1.534e+04  -0.001    0.999
## attend      -3.451e+01  3.257e+04  -0.001    0.999
## avail        8.651e+00  1.709e+04   0.001    1.000
## back        -1.323e+01  2.272e+04  -0.001    1.000
## base        -1.354e+01  2.122e+04  -0.001    0.999
## begin        2.228e+01  2.973e+04   0.001    0.999
## believ       3.233e+01  2.136e+04   0.002    0.999
## best        -8.201e+00  1.333e+03  -0.006    0.995
## better       4.263e+01  2.360e+04   0.002    0.999
## book         4.301e+00  2.024e+04   0.000    1.000
## bring        1.607e+01  6.767e+04   0.000    1.000
## busi        -4.803e+00  1.000e+04   0.000    1.000
## buy          4.170e+01  3.892e+04   0.001    0.999
## call        -1.145e+00  1.111e+04   0.000    1.000
## can          3.762e+00  7.674e+03   0.000    1.000
## case        -3.372e+01  2.880e+04  -0.001    0.999
## chang       -2.717e+01  2.215e+04  -0.001    0.999
## check        1.425e+00  1.963e+04   0.000    1.000
## click        1.376e+01  7.077e+03   0.002    0.998
## com          1.936e+00  4.039e+03   0.000    1.000
## come        -1.166e+00  1.511e+04   0.000    1.000
## comment     -3.251e+00  3.387e+04   0.000    1.000
## communic     1.580e+01  8.958e+03   0.002    0.999
## compani      4.781e+00  9.186e+03   0.001    1.000
## complet     -1.363e+01  2.024e+04  -0.001    0.999
## confer      -7.503e-01  8.557e+03   0.000    1.000
## confirm     -1.300e+01  1.514e+04  -0.001    0.999
## contact      1.530e+00  1.262e+04   0.000    1.000
## continu      1.487e+01  1.535e+04   0.001    0.999
## contract    -1.295e+01  1.498e+04  -0.001    0.999
## copi        -4.274e+01  3.070e+04  -0.001    0.999
## corp         1.606e+01  2.708e+04   0.001    1.000
## corpor      -8.286e-01  2.818e+04   0.000    1.000
## cost        -1.938e+00  1.833e+04   0.000    1.000
## cours        1.665e+01  1.834e+04   0.001    0.999
## creat        1.338e+01  3.946e+04   0.000    1.000
## credit       2.617e+01  1.314e+04   0.002    0.998
## crenshaw     9.994e+01  6.769e+04   0.001    0.999
## current      3.629e+00  1.707e+04   0.000    1.000
## custom       1.829e+01  1.008e+04   0.002    0.999
## data        -2.609e+01  2.271e+04  -0.001    0.999
## date        -2.786e+00  1.699e+04   0.000    1.000
## day         -6.100e+00  5.866e+03  -0.001    0.999
## deal        -1.129e+01  1.448e+04  -0.001    0.999
## dear        -2.313e+00  2.306e+04   0.000    1.000
## depart      -4.068e+01  2.509e+04  -0.002    0.999
## deriv       -4.971e+01  3.587e+04  -0.001    0.999
## design      -7.923e+00  2.939e+04   0.000    1.000
## detail       1.197e+01  2.301e+04   0.001    1.000
## develop      5.976e+00  9.455e+03   0.001    0.999
## differ      -2.293e+00  1.075e+04   0.000    1.000
## direct      -2.051e+01  3.194e+04  -0.001    0.999
## director    -1.770e+01  1.793e+04  -0.001    0.999
## discuss     -1.051e+01  1.915e+04  -0.001    1.000
## doc         -2.597e+01  2.603e+04  -0.001    0.999
## don          2.129e+01  1.456e+04   0.001    0.999
## done         6.828e+00  1.882e+04   0.000    1.000
## due         -4.163e+00  3.532e+04   0.000    1.000
## ect          8.685e-01  5.342e+03   0.000    1.000
## edu         -2.122e-01  6.917e+02   0.000    1.000
## effect       1.948e+01  2.100e+04   0.001    0.999
## effort       1.606e+01  5.670e+04   0.000    1.000
## either      -2.744e+01  4.000e+04  -0.001    0.999
## email        3.833e+00  1.186e+04   0.000    1.000
## end         -1.311e+01  2.938e+04   0.000    1.000
## energi      -1.620e+01  1.646e+04  -0.001    0.999
## engin        2.664e+01  2.394e+04   0.001    0.999
## enron       -8.789e+00  5.719e+03  -0.002    0.999
## etc          9.470e-01  1.569e+04   0.000    1.000
## even        -1.654e+01  2.289e+04  -0.001    0.999
## event        1.694e+01  1.851e+04   0.001    0.999
## expect      -1.179e+01  1.914e+04  -0.001    1.000
## experi       2.460e+00  2.240e+04   0.000    1.000
## fax          3.537e+00  3.386e+04   0.000    1.000
## feel         2.596e+00  2.348e+04   0.000    1.000
## file        -2.943e+01  2.165e+04  -0.001    0.999
## final        8.075e+00  5.008e+04   0.000    1.000
## financ      -9.122e+00  7.524e+03  -0.001    0.999
## financi     -9.747e+00  1.727e+04  -0.001    1.000
## find        -2.623e+00  9.727e+03   0.000    1.000
## first       -4.666e-01  2.043e+04   0.000    1.000
## follow       1.766e+01  3.080e+03   0.006    0.995
## form         8.483e+00  1.674e+04   0.001    1.000
## forward     -3.484e+00  1.864e+04   0.000    1.000
## free         6.113e+00  8.121e+03   0.001    0.999
## friday      -1.146e+01  1.996e+04  -0.001    1.000
## full         2.125e+01  2.190e+04   0.001    0.999
## futur        4.146e+01  1.439e+04   0.003    0.998
## gas         -3.901e+00  4.160e+03  -0.001    0.999
## get          5.154e+00  9.737e+03   0.001    1.000
## gibner       2.901e+01  2.460e+04   0.001    0.999
## give        -2.518e+01  2.130e+04  -0.001    0.999
## given       -2.186e+01  5.426e+04   0.000    1.000
## good         5.399e+00  1.619e+04   0.000    1.000
## great        1.222e+01  1.090e+04   0.001    0.999
## group        5.264e-01  1.037e+04   0.000    1.000
## happi        1.939e-02  1.202e+04   0.000    1.000
## hear         2.887e+01  2.281e+04   0.001    0.999
## hello        2.166e+01  1.361e+04   0.002    0.999
## help         1.731e+01  2.791e+03   0.006    0.995
## high        -1.982e+00  2.554e+04   0.000    1.000
## home         5.973e+00  8.965e+03   0.001    0.999
## hope        -1.435e+01  2.179e+04  -0.001    0.999
## hou          6.852e+00  6.437e+03   0.001    0.999
## hour         2.478e+00  1.333e+04   0.000    1.000
## houston     -1.855e+01  7.305e+03  -0.003    0.998
## howev       -3.449e+01  3.562e+04  -0.001    0.999
## http         2.528e+01  2.107e+04   0.001    0.999
## idea        -1.845e+01  3.892e+04   0.000    1.000
## immedi       6.285e+01  3.346e+04   0.002    0.999
## import      -1.859e+00  2.236e+04   0.000    1.000
## includ      -3.454e+00  1.799e+04   0.000    1.000
## increas      6.476e+00  2.329e+04   0.000    1.000
## industri    -3.160e+01  2.373e+04  -0.001    0.999
## info        -1.255e+00  4.857e+03   0.000    1.000
## inform       2.078e+01  8.549e+03   0.002    0.998
## interest     2.698e+01  1.159e+04   0.002    0.998
## intern      -7.991e+00  3.351e+04   0.000    1.000
## internet     8.749e+00  1.100e+04   0.001    0.999
## interview   -1.640e+01  1.873e+04  -0.001    0.999
## invest       3.201e+01  2.393e+04   0.001    0.999
## invit        4.304e+00  2.215e+04   0.000    1.000
## involv       3.815e+01  3.315e+04   0.001    0.999
## issu        -3.708e+01  3.396e+04  -0.001    0.999
## john        -5.326e-01  2.856e+04   0.000    1.000
## join        -3.824e+01  2.334e+04  -0.002    0.999
## juli        -1.358e+01  3.009e+04   0.000    1.000
## just        -1.021e+01  1.114e+04  -0.001    0.999
## kaminski    -1.812e+01  6.029e+03  -0.003    0.998
## keep         1.867e+01  2.782e+04   0.001    0.999
## kevin       -3.779e+01  4.738e+04  -0.001    0.999
## know         1.277e+01  1.526e+04   0.001    0.999
## last         1.046e+00  1.372e+04   0.000    1.000
## let         -2.763e+01  1.462e+04  -0.002    0.998
## life         5.812e+01  3.864e+04   0.002    0.999
## like         5.649e+00  7.660e+03   0.001    0.999
## line         8.743e+00  1.236e+04   0.001    0.999
## link        -6.929e+00  1.345e+04  -0.001    1.000
## list        -8.692e+00  2.149e+03  -0.004    0.997
## locat        2.073e+01  1.597e+04   0.001    0.999
## london       6.745e+00  1.642e+04   0.000    1.000
## long        -1.489e+01  1.934e+04  -0.001    0.999
## look        -7.031e+00  1.563e+04   0.000    1.000
## lot         -1.964e+01  1.321e+04  -0.001    0.999
## made         2.820e+00  2.743e+04   0.000    1.000
## mail         7.584e+00  1.021e+04   0.001    0.999
## make         2.901e+01  1.528e+04   0.002    0.998
## manag        6.014e+00  1.445e+04   0.000    1.000
## mani         1.885e+01  1.442e+04   0.001    0.999
## mark        -3.350e+01  3.208e+04  -0.001    0.999
## market       7.895e+00  8.012e+03   0.001    0.999
## may         -9.434e+00  1.397e+04  -0.001    0.999
## mean         6.078e-01  2.952e+04   0.000    1.000
## meet        -1.063e+00  1.263e+04   0.000    1.000
## member       1.381e+01  2.343e+04   0.001    1.000
## mention     -2.279e+01  2.714e+04  -0.001    0.999
## messag       1.716e+01  2.562e+03   0.007    0.995
## might        1.244e+01  1.753e+04   0.001    0.999
## model       -2.292e+01  1.049e+04  -0.002    0.998
## monday      -1.034e+00  3.233e+04   0.000    1.000
## money        3.264e+01  1.321e+04   0.002    0.998
## month       -3.727e+00  1.112e+04   0.000    1.000
## morn        -2.645e+01  3.403e+04  -0.001    0.999
## move        -3.834e+01  3.011e+04  -0.001    0.999
## much         3.775e-01  1.392e+04   0.000    1.000
## name         1.672e+01  1.322e+04   0.001    0.999
## need         8.437e-01  1.221e+04   0.000    1.000
## net          1.256e+01  2.197e+04   0.001    1.000
## new          1.003e+00  1.009e+04   0.000    1.000
## next.        1.492e+01  1.724e+04   0.001    0.999
## note         1.446e+01  2.294e+04   0.001    0.999
## now          3.790e+01  1.219e+04   0.003    0.998
## number      -9.622e+00  1.591e+04  -0.001    1.000
## offer        1.174e+01  1.084e+04   0.001    0.999
## offic       -1.344e+01  2.311e+04  -0.001    1.000
## one          1.241e+01  6.652e+03   0.002    0.999
## onlin        3.589e+01  1.665e+04   0.002    0.998
## open         2.114e+01  2.961e+04   0.001    0.999
## oper        -1.696e+01  2.757e+04  -0.001    1.000
## opportun    -4.131e+00  1.918e+04   0.000    1.000
## option      -1.085e+00  9.325e+03   0.000    1.000
## order        6.533e+00  1.242e+04   0.001    1.000
## origin       3.226e+01  3.818e+04   0.001    0.999
## part         4.594e+00  3.483e+04   0.000    1.000
## particip    -1.154e+01  1.738e+04  -0.001    0.999
## peopl       -1.864e+01  1.439e+04  -0.001    0.999
## per          1.367e+01  1.273e+04   0.001    0.999
## person       1.870e+01  9.575e+03   0.002    0.998
## phone       -6.957e+00  1.172e+04  -0.001    1.000
## place        9.005e+00  3.661e+04   0.000    1.000
## plan        -1.830e+01  6.320e+03  -0.003    0.998
## pleas       -7.961e+00  9.484e+03  -0.001    0.999
## point        5.498e+00  3.403e+04   0.000    1.000
## posit       -1.543e+01  2.316e+04  -0.001    0.999
## possibl     -1.366e+01  2.492e+04  -0.001    1.000
## power       -5.643e+00  1.173e+04   0.000    1.000
## present     -6.163e+00  1.278e+04   0.000    1.000
## price        3.428e+00  7.850e+03   0.000    1.000
## problem      1.262e+01  9.763e+03   0.001    0.999
## process     -2.957e-01  1.191e+04   0.000    1.000
## product      1.016e+01  1.345e+04   0.001    0.999
## program      1.444e+00  1.183e+04   0.000    1.000
## project      2.173e+00  1.497e+04   0.000    1.000
## provid       2.422e-01  1.859e+04   0.000    1.000
## public      -5.250e+01  2.341e+04  -0.002    0.998
## put         -1.052e+01  2.681e+04   0.000    1.000
## question    -3.467e+01  1.859e+04  -0.002    0.999
## rate        -3.112e+00  1.319e+04   0.000    1.000
## read        -1.527e+01  2.145e+04  -0.001    0.999
## real         2.046e+01  2.358e+04   0.001    0.999
## realli      -2.667e+01  4.640e+04  -0.001    1.000
## receiv       5.765e-01  1.585e+04   0.000    1.000
## recent      -2.067e+00  1.780e+04   0.000    1.000
## regard      -3.668e+00  1.511e+04   0.000    1.000
## relat       -5.114e+01  1.793e+04  -0.003    0.998
## remov        2.325e+01  2.484e+04   0.001    0.999
## repli        1.538e+01  2.916e+04   0.001    1.000
## report      -1.482e+01  1.477e+04  -0.001    0.999
## request     -1.232e+01  1.167e+04  -0.001    0.999
## requir       5.004e-01  2.937e+04   0.000    1.000
## research    -2.826e+01  1.553e+04  -0.002    0.999
## resourc     -2.735e+01  3.522e+04  -0.001    0.999
## respond      2.974e+01  3.888e+04   0.001    0.999
## respons     -1.960e+01  3.667e+04  -0.001    1.000
## result      -5.002e-01  3.140e+04   0.000    1.000
## resum       -9.219e+00  2.100e+04   0.000    1.000
## return       1.745e+01  1.844e+04   0.001    0.999
## review      -4.825e+00  1.013e+04   0.000    1.000
## right        2.312e+01  1.590e+04   0.001    0.999
## risk        -4.001e+00  1.718e+04   0.000    1.000
## robert      -2.096e+01  2.907e+04  -0.001    0.999
## run         -5.162e+01  4.434e+04  -0.001    0.999
## say          7.366e+00  2.217e+04   0.000    1.000
## schedul      1.919e+00  3.580e+04   0.000    1.000
## school      -3.870e+00  2.882e+04   0.000    1.000
## secur       -1.604e+01  2.201e+03  -0.007    0.994
## see         -1.120e+01  1.293e+04  -0.001    0.999
## send        -2.427e+01  1.222e+04  -0.002    0.998
## sent        -1.488e+01  2.195e+04  -0.001    0.999
## servic      -7.164e+00  1.235e+04  -0.001    1.000
## set         -9.353e+00  2.627e+04   0.000    1.000
## sever        2.041e+01  3.093e+04   0.001    0.999
## shall        1.930e+01  3.075e+04   0.001    0.999
## shirley     -7.133e+01  6.329e+04  -0.001    0.999
## short       -8.974e+00  1.721e+04  -0.001    1.000
## sinc        -3.438e+00  3.546e+04   0.000    1.000
## sincer      -2.073e+01  3.515e+04  -0.001    1.000
## site         8.689e+00  1.496e+04   0.001    1.000
## softwar      2.575e+01  1.059e+04   0.002    0.998
## soon         2.350e+01  3.731e+04   0.001    0.999
## sorri        6.036e+00  2.299e+04   0.000    1.000
## special      1.777e+01  2.755e+04   0.001    0.999
## specif      -2.337e+01  3.083e+04  -0.001    0.999
## start        1.437e+01  1.897e+04   0.001    0.999
## state        1.221e+01  1.677e+04   0.001    0.999
## still        3.878e+00  2.622e+04   0.000    1.000
## stinson     -4.345e+01  2.697e+04  -0.002    0.999
## student     -1.815e+01  2.186e+04  -0.001    0.999
## subject      3.041e+01  1.055e+04   0.003    0.998
## success      4.344e+00  2.783e+04   0.000    1.000
## suggest     -3.842e+01  4.475e+04  -0.001    0.999
## support     -1.539e+01  1.976e+04  -0.001    0.999
## sure        -5.503e+00  2.078e+04   0.000    1.000
## system       3.778e+00  9.149e+03   0.000    1.000
## take         5.731e+00  1.716e+04   0.000    1.000
## talk        -1.011e+01  2.021e+04  -0.001    1.000
## team         7.940e+00  2.570e+04   0.000    1.000
## term         2.013e+01  2.303e+04   0.001    0.999
## thank       -3.890e+01  1.059e+04  -0.004    0.997
## thing        2.579e+01  1.341e+04   0.002    0.998
## think       -1.218e+01  2.077e+04  -0.001    1.000
## thought      1.243e+01  3.023e+04   0.000    1.000
## thursday    -1.491e+01  3.262e+04   0.000    1.000
## time        -5.921e+00  8.335e+03  -0.001    0.999
## today       -1.762e+01  1.965e+04  -0.001    0.999
## togeth      -2.355e+01  1.869e+04  -0.001    0.999
## trade       -1.755e+01  1.483e+04  -0.001    0.999
## tri          9.278e-01  1.282e+04   0.000    1.000
## tuesday     -2.808e+01  3.959e+04  -0.001    0.999
## two         -2.573e+01  1.844e+04  -0.001    0.999
## type        -1.447e+01  2.755e+04  -0.001    1.000
## understand   9.307e+00  2.342e+04   0.000    1.000
## unit        -4.020e+00  3.008e+04   0.000    1.000
## univers      1.228e+01  2.197e+04   0.001    1.000
## updat       -1.510e+01  1.448e+04  -0.001    0.999
## use         -1.385e+01  9.382e+03  -0.001    0.999
## valu         9.024e-01  1.360e+04   0.000    1.000
## version     -3.606e+01  2.939e+04  -0.001    0.999
## vinc        -3.735e+01  8.647e+03  -0.004    0.997
## visit        2.585e+01  1.170e+04   0.002    0.998
## vkamin      -6.649e+01  5.703e+04  -0.001    0.999
## want        -2.555e+00  1.106e+04   0.000    1.000
## way          1.339e+01  1.138e+04   0.001    0.999
## web          2.791e+00  1.686e+04   0.000    1.000
## websit      -2.563e+01  1.848e+04  -0.001    0.999
## wednesday   -1.526e+01  2.642e+04  -0.001    1.000
## week        -6.795e+00  1.046e+04  -0.001    0.999
## well        -2.222e+01  9.713e+03  -0.002    0.998
## will        -1.119e+01  5.980e+03  -0.002    0.999
## wish         1.173e+01  3.175e+04   0.000    1.000
## within       2.900e+01  2.163e+04   0.001    0.999
## without      1.942e+01  1.763e+04   0.001    0.999
## work        -1.099e+01  1.160e+04  -0.001    0.999
## write        4.406e+01  2.825e+04   0.002    0.999
## www         -7.867e+00  2.224e+04   0.000    1.000
## year        -1.010e+01  1.039e+04  -0.001    0.999
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4409.49  on 4009  degrees of freedom
## Residual deviance:   13.46  on 3679  degrees of freedom
## AIC: 675.46
## 
## Number of Fisher Scoring iterations: 25
#Ans:0
#EXPLANATION:From summary(spamLog), we see that none of the variables are labeled as significant (a symptom of the logistic regression algorithm not converging).


###########################################

#PROBLEM 3.3 - BUILDING MACHINE LEARNING MODELS 

#How many of the word stems "enron", "hou", "vinc", and "kaminski" appear in the CART tree? Recall that we suspect these word stems are specific to Vincent Kaminski and might affect the generalizability of a spam filter built with his ham data.
prp(spamCART)

#Ans:2
#EXPLANATION:From prp(spamCART), we see that "vinc" and "enron" appear in the CART tree as the top two branches, but that "hou" and "kaminski" do not appear.

####################################

#PROBLEM 3.4 - BUILDING MACHINE LEARNING MODELS 

#What is the training set accuracy of spamLog, using a threshold of 0.5 for predictions?

#We are interested in the overall accuracy of our model
#First we compute the confusion matrix
cmat_log<-table(train$spam, predTrainLog > 0.5)
cmat_log
##    
##     FALSE TRUE
##   0  3052    0
##   1     4  954
#lets now compute the overall accuracy
accu_log <- (cmat_log[1,1] + cmat_log[2,2])/sum(cmat_log)
accu_log #(3052+954)/nrow(train) = 0.9990025
## [1] 0.9990025
#Ans:0.9990025

###############################################

#PROBLEM 3.5 - BUILDING MACHINE LEARNING MODELS 

#What is the training set AUC of spamLog?

library(ROCR)

predictionTrainLog = prediction(predTrainLog, train$spam)
perf <- performance(predictionTrainLog, "tpr", "fpr")
as.numeric(performance(predictionTrainLog, "auc")@y.values)
## [1] 0.9999959
#Ans:0.9999959

#We then plot the ROC curve, with the option that color-codes the different cutoff thresholds.
plot(perf, colorize=TRUE) 

#visually showing perfect fitting i.e. overfitting which is confirmed by very high auc value

##################################

#PROBLEM 3.6 - BUILDING MACHINE LEARNING MODELS 

#What is the training set accuracy of spamCART, using a threshold of 0.5 for predictions? (Remember that if you used the type="class" argument when making predictions, you automatically used a threshold of 0.5. If you did not add in the type argument to the predict function, the probabilities are in the second column of the predict output.)

#We are interested in the overall accuracy of our model
#First we compute the confusion matrix
cmat_CART<-table(train$spam, predTrainCART > 0.5)
cmat_CART
##    
##     FALSE TRUE
##   0  2885  167
##   1    64  894
#lets now compute the overall accuracy
accu_CART <- (cmat_CART[1,1] + cmat_CART[2,2])/sum(cmat_CART)
accu_CART  #(2885+894)/nrow(train) = 0.942394
## [1] 0.942394
#Ans:0.942394

############################################

#PROBLEM 3.7 - BUILDING MACHINE LEARNING MODELS  

#What is the training set AUC of spamCART? (Remember that you have to pass the prediction function predicted probabilities, so don't include the type argument when making predictions for your CART model.)

library(ROCR)

predictionTrainCART = prediction(predTrainCART, train$spam)
perf <- performance(predictionTrainCART, "tpr", "fpr")
as.numeric(performance(predictionTrainCART, "auc")@y.values)
## [1] 0.9696044
#Ans:0.9696044

#We then plot the ROC curve, with the option that color-codes the different cutoff thresholds.
plot(perf, colorize=TRUE) 

#visually showing excellent fitting which is confirmed by very high auc value

########################################

#PROBLEM 3.8 - BUILDING MACHINE LEARNING MODELS  

#What is the training set accuracy of spamRF, using a threshold of 0.5 for predictions? (Remember that your answer might not match ours exactly, due to random behavior in the random forest algorithm on different operating systems.)

#We are interested in the overall accuracy of our model
#First we compute the confusion matrix
cmat_RF<-table(train$spam, predTrainRF > 0.5)
cmat_RF
##    
##     FALSE TRUE
##   0  3013   39
##   1    44  914
#lets now compute the overall accuracy
accu_RF <- (cmat_RF[1,1] + cmat_RF[2,2])/sum(cmat_RF)
accu_RF #(3013+914)/nrow(train)=0.9793017
## [1] 0.9793017
#Ans:0.9793017

#####################################################

#PROBLEM 3.9 - BUILDING MACHINE LEARNING MODELS  

#What is the training set AUC of spamRF? (Remember to pass the argument type="prob" to the predict function to get predicted probabilities for a random forest model. The probabilities will be the second column of the output.)

library(ROCR)

# Make  predictions:
predictionTrainRF = prediction(predTrainRF, train$spam)
perf<-performance(predictionTrainRF, "tpr", "fpr")
as.numeric(performance(predictionTrainRF, "auc")@y.values)
## [1] 0.9979116
#Ans:0.9979116

#We then plot the ROC curve, with the option that color-codes the different cutoff thresholds.
plot(perf, colorize=TRUE) 

#visually showing excellent fitting which is confirmed by very high auc value

############################################

#PROBLEM 3.10 - BUILDING MACHINE LEARNING MODELS

#Which model had the best training set performance, in terms of accuracy and AUC?
#Ans: Logistic regression 
#EXPLANATION:In terms of both accuracy and AUC, logistic regression is nearly perfect and outperforms the other two models.

######################################

#PROBLEM 4.1 - EVALUATING ON THE TEST SET  

#Obtain predicted probabilities for the testing set for each of the models, again ensuring that probabilities instead of classes are obtained.

## Make Out-of-Sample predictions on the testing set :
predTestLog<- predict(spamLog, newdata=test, type="response")
predTestCART <- predict(spamCART, newdata=test)[,2]
predTestRF <- predict(spamRF, newdata=test, type="prob")[,2]

#What is the testing set accuracy of spamLog, using a threshold of 0.5 for predictions?

#Build a confusion matrix (with a threshold of 0.5) and compute the accuracy of the model.What is the accuracy?

# Out of sample confusion matrix with threshold of 0.5
cmat_log<-table(test$spam, predTestLog> 0.5)
cmat_log
##    
##     FALSE TRUE
##   0  1257   51
##   1    34  376
#lets now compute the overall accuracy
accu_log <- (cmat_log[1,1] + cmat_log[2,2])/sum(cmat_log)
accu_log #(1257+376)/nrow(test) = 0.9505239
## [1] 0.9505239
#Ans:0.9505239

#######################################

#PROBLEM 4.2 - EVALUATING ON THE TEST SET 

#What is the testing set AUC of spamLog?

library(ROCR)

predictionTestLog = prediction(predTestLog,test$spam)
perf <- performance(predictionTestLog, "tpr", "fpr")
as.numeric(performance(predictionTestLog, "auc")@y.values)
## [1] 0.9627517
#Ans:0.9627517

#We then plot the ROC curve, with the option that color-codes the different cutoff thresholds.
plot(perf, colorize=TRUE) 

#visually showing excellent fitting i.e. overfitting which is confirmed by very high auc value

#############################

#PROBLEM 4.3 - EVALUATING ON THE TEST SET  

#What is the testing set accuracy of spamCART, using a threshold of 0.5 for predictions?

#We are interested in the overall accuracy of our model
#First we compute the out of sample confusion matrix with threshold of 0.5
cmat_CART<-table(test$spam, predTestCART > 0.5)
cmat_CART
##    
##     FALSE TRUE
##   0  1228   80
##   1    24  386
#lets now compute the overall accuracy
accu_CART <- (cmat_CART[1,1] + cmat_CART[2,2])/sum(cmat_CART)
accu_CART  #(1228+386)/nrow(test) =  0.9394645
## [1] 0.9394645
#Ans: 0.9394645

######################################

#PROBLEM 4.4 - EVALUATING ON THE TEST SET  

#What is the testing set AUC of spamCART?

library(ROCR)

predictionTestCART = prediction(predTestCART,test$spam)
perf <- performance(predictionTestCART, "tpr", "fpr")
as.numeric(performance(predictionTestCART, "auc")@y.values)
## [1] 0.963176
#Ans:0.963176

#We then plot the ROC curve, with the option that color-codes the different cutoff thresholds.
plot(perf, colorize=TRUE) 

#visually showing excellent fitting  which is confirmed by very high auc value

##########################################

#PROBLEM 4.5 - EVALUATING ON THE TEST SET  

#What is the testing set accuracy of spamRF, using a threshold of 0.5 for predictions?

#computing the confusion matrix with threshold of 0.5
cmat_RF<-table(test$spam, predTestRF > 0.5)
cmat_RF
##    
##     FALSE TRUE
##   0  1290   18
##   1    25  385
#lets now compute the overall accuracy
accu_RF <- (cmat_RF[1,1] + cmat_RF[2,2])/sum(cmat_RF)
accu_RF # (1290+385)/nrow(test)=0.9749709
## [1] 0.9749709
#Ans:0.9749709

########################################

#PROBLEM 4.6 - EVALUATING ON THE TEST SET

#What is the testing set AUC of spamRF?

library(ROCR)

predictionTestRF = prediction(predTestRF,test$spam)
perf <- performance(predictionTestRF, "tpr", "fpr")
as.numeric(performance(predictionTestRF, "auc")@y.values)
## [1] 0.9975656
#Ans:0.9975656

#We then plot the ROC curve, with the option that color-codes the different cutoff thresholds.
plot(perf, colorize=TRUE) 

#visually showing excellent fitting which is confirmed by very high auc value

#########################################

#PROBLEM 4.7 - EVALUATING ON THE TEST SET  

#Which model had the best testing set performance, in terms of accuracy and AUC?
#Ans:Random forest 
#EXPLANATION:The random forest outperformed logistic regression and CART in both measures, obtaining an impressive AUC of 0.997 on the test set.

################################################

#PROBLEM 4.8 - EVALUATING ON THE TEST SET  

#Which model demonstrated the greatest degree of overfitting?
#Ans: Logistic regression
#EXPLANATION:Both CART and random forest had very similar accuracies on the training and testing sets. However, logistic regression obtained nearly perfect accuracy and AUC on the training set and had far-from-perfect performance on the testing set. This is an indicator of overfitting.

SEPARATING SPAM FROM HAM (PART 2 - OPTIONAL)

PROBLEM 5.1 - ASSIGNING WEIGHTS TO DIFFERENT TYPES OF ERRORS
Thus far, we have used a threshold of 0.5 as the cutoff for predicting that an email message is spam, and we have used accuracy as one of our measures of model quality. As we have previously learned, these are good choices when we have no preference for different types of errors (false positives vs. false negatives), but other choices might be better if we assign a higher cost to one type of error.

Consider the case of an email provider using the spam filter we have developed. The email provider moves all of the emails flagged as spam to a separate “Junk Email” folder, meaning those emails are not displayed in the main inbox. The emails not flagged as spam by the algorithm are displayed in the inbox. Many of this provider’s email users never check the spam folder, so they will never see emails delivered there.

Q:In this scenario, what is the cost associated with the model making a false negative error?

Ans:A spam email will be displayed in the main inbox, a nuisance for the email user

EXPLANATION:A false negative means the model labels a spam email as ham. This results in a spam email being displayed in the main inbox.

Q:In this scenario, what is the cost associated with our model making a false positive error?

Ans:A ham email will be sent to the Junk Email folder, potentially resulting in the email user never seeing that message

EXPLANATION:A false positive means the model labels a ham email as spam. This results in a ham email being sent to the Junk Email folder.

PROBLEM 5.2 - ASSIGNING WEIGHTS TO DIFFERENT TYPES OF ERRORS

Q:Which sort of mistake is more costly (less desirable), assuming that the user will never check the Junk Email folder?

Ans:False positive

EXPLANATION:A false negative is largely a nuisance (the user will need to delete the unsolicited email). However a false positive can be very costly, since the user might completely miss an important email due to it being delivered to the spam folder. Therefore, the false positive is more costly.

PROBLEM 5.3 - ASSIGNING WEIGHTS TO DIFFERENT TYPES OF ERRORS

Q:What sort of user might assign a particularly high cost to a false negative result?

Ans:A user who is particularly annoyed by spam email reaching their main inbox

EXPLANATION:A false negative results in spam reaching a user’s main inbox, which is a nuisance. A user who is particularly annoyed by such spam would assign a particularly high cost to a false negative.

PROBLEM 5.4 - ASSIGNING WEIGHTS TO DIFFERENT TYPES OF ERRORS

Q:What sort of user might assign a particularly high cost to a false positive result?

Ans:A user who never checks his/her Junk Email folder

EXPLANATION:A false positive results in ham being sent to a user’s Junk Email folder. While the user might catch the mistake upon checking the Junk Email folder, users who never check this folder will miss the email, incurring a particularly high cost.

PROBLEM 5.5 - ASSIGNING WEIGHTS TO DIFFERENT TYPES OF ERRORS

Q:Consider another use case for the spam filter, in which messages labeled as spam are still delivered to the main inbox but are flagged as “potential spam.” Therefore, there is no risk of the email user missing an email regardless of whether it is flagged as spam. What is the largest way in which this change in spam filter design affects the costs of false negative and false positive results?

Ans:The cost of false positive results is decreased

EXPLANATION:While before many users would completely miss a ham email labeled as spam (false positive), now users will not miss an email after this sort of mistake. As a result, the cost of a false positive has been decreased.

PROBLEM 5.6 - ASSIGNING WEIGHTS TO DIFFERENT TYPES OF ERRORS

Q:Consider a large-scale email provider with more than 100,000 customers. Which of the following represents an approach for approximating each customer’s preferences between a false positive and false negative that is both practical and personalized?

Ans:Automatically collect information about how often each user accesses his/her Junk Email folder to infer preferences

EXPLANATION:While using expert opinion is practical, it is not personalized (we would use the same cost for all users). Likewise, a random sample of user preferences doesn’t enable personalized costs for each user.

While a survey of all users would enable personalization, it is impractical to obtain survey results from all or most of the users.

While it’s impractical to survey all users, it is easy to automatically collect their usage patterns. This could enable us to select higher regression thresholds for users who rarely check their Junk Email folder but lower thresholds for users who regularly check the folder.

PROBLEM 6.1 - INTEGRATING WORD COUNT INFORMATION

While we have thus far mostly dealt with frequencies of specific words in our analysis, we can extract other information from text. The last two sections of this problem will deal with two other types of information we can extract.

First, we will use the number of words in the each email as an independent variable. We can use the original document term matrix called dtm for this task. The document term matrix has documents (in this case, emails) as its rows, terms (in this case word stems) as its columns, and frequencies as its values. As a result, the sum of all the elements in a row of the document term matrix is equal to the number of terms present in the document corresponding to the row. Obtain the word counts for each email with the command:

wordCount = rowSums(as.matrix(dtm))

library(slam)

wordCount = rollup(dtm, 2, FUN=sum)$v

When you have successfully created wordCount, answer the following question.

Q:What would have occurred if we had instead created wordCount using spdtm instead of dtm?

Ans:wordCount would have only counted some of the words, but would have returned a result for all the emails

EXPLANATION:spdtm has had sparse terms removed, which means we have removed some of the columns but none of the rows from dtm. This means rowSums will still return a sum for each row (one for each email), but it will not have counted the frequencies of any uncommon words in the dataset. As a result, wordCount will only count some of the words.

PROBLEM 6.2 - INTEGRATING WORD COUNT INFORMATION

Use the hist() function to plot the distribution of wordCount in the dataset. What best describes the distribution of the data?

hist(wordCount)

Ans: The data is skew right – there are a large number of small wordCount values and a small number of large values.

EXPLANATION:From hist(wordCount), nearly all the observations are in the very left of the graph, representing small values. Therefore, this distribution is skew right.

PROBLEM 6.3 - INTEGRATING WORD COUNT INFORMATION

Now, use the hist() function to plot the distribution of log(wordCount) in the dataset. What best describes the distribution of the data?

hist(log(wordCount))

Ans:The data is not skewed – there are roughly the same number of unusually large and unusually small log(wordCount) values.

EXPLANATION:From hist(log(wordCount)), the frequencies are quite balanced, suggesting log(wordCount) is not skewed.

PROBLEM 6.4 - INTEGRATING WORD COUNT INFORMATION

Create a variable called logWordCount in emailsSparse that is equal to log(wordCount). Use the boxplot() command to plot logWordCount against whether a message is spam. Which of the following best describes the box plot?

emailsSparse$logWordCount<-log(wordCount)
boxplot(emailsSparse$logWordCount~emailsSparse$spam)

Ans: logWordCount is slightly smaller in spam messages than in ham messages

EXPLANATION:We can see that the 1st quartile, median, and 3rd quartiles are all slightly lower for spam messages than for ham messages.

PROBLEM 6.5 - INTEGRATING WORD COUNT INFORMATION

Because logWordCount differs between spam and ham messages, we hypothesize that it might be useful in predicting whether an email is spam. Take the following steps:

#1) Use the same sample.split output you obtained earlier (do not re-run sample.split) to split emailsSparse into a training and testing set, which you should call train2 and test2.

train2 = subset(emailsSparse, spl == TRUE)
test2 = subset(emailsSparse, spl == FALSE)

#2) Use train2 to train a CART tree with the default parameters, saving the model to the variable spam2CART.

library(rpart)
library(rpart.plot)

spam2CART = rpart(spam~., data=train2, method="class")

#Plotting the CART model
prp(spam2CART)

#3) Use train2 to train a random forest with the default parameters, saving the model to the variable spam2RF. Again, set the random seed to 123 directly before training spam2RF.

set.seed(123)
spam2RF = randomForest(spam~., data=train2)

#Was the new variable used in the new CART tree spam2CART?
#Ans:Yes 
#EXPLANATION:From prp(spam2CART), we see that the logWordCount was integrated into the tree (it might only display as "logWord", because prp shortens some of the variable names when it outputs them).

#####################################

#PROBLEM 6.6 - INTEGRATING WORD COUNT INFORMATION  

#Perform test-set predictions using the new CART and random forest models.

predTest2CART = predict(spam2CART, newdata=test2)[,2]
predTest2RF = predict(spam2RF, newdata=test2, type="prob")[,2]
#What is the test-set accuracy of spam2CART, using threshold 0.5 for predicting an email is spam?

#Now lets assess the accuracy of the model through confusion matrix
cmat_CART<-table(test2$spam, predTest2CART > 0.5)  #first arg is the true outcomes and the second is the predicted outcomes
cmat_CART
##    
##     FALSE TRUE
##   0  1214   94
##   1    26  384
#lets now compute the overall accuracy
accu_CART <- (cmat_CART[1,1] + cmat_CART[2,2])/sum(cmat_CART)
accu_CART  #  (1214+384)/nrow(test2)= 0.9301513
## [1] 0.9301513
#Ans:0.9301513

#######################################

#PROBLEM 6.7 - INTEGRATING WORD COUNT INFORMATION  

#What is the test-set AUC of spam2CART?

library(ROCR)

predictionTest2CART= prediction(predTest2CART, test2$spam)
perf <- performance(predictionTest2CART, "tpr", "fpr")
as.numeric(performance(predictionTrainCART, "auc")@y.values)
## [1] 0.9696044
#Ans:0.9582438

#We then plot the ROC curve, with the option that color-codes the different cutoff thresholds.
plot(perf, colorize=TRUE) 

######################################

#PROBLEM 6.8 - INTEGRATING WORD COUNT INFORMATION 

#What is the test-set accuracy of spam2RF, using a threshold of 0.5 for predicting if an email is spam? (Remember that you might get a different accuracy than us even if you set the seed, due to the random behavior of randomForest on some operating systems.)

#We are interested in the overall accuracy of our out of sample model
#First we compute the confusion matrix
cmat_RF<-table(test2$spam, predTest2RF > 0.5)
cmat_RF
##    
##     FALSE TRUE
##   0  1298   10
##   1    28  382
#lets now compute the overall accuracy
accu_RF <- (cmat_RF[1,1] + cmat_RF[2,2])/sum(cmat_RF)
accu_RF #(1298+382)/nrow(test2)=0.9778813
## [1] 0.9778813
#Ans:0.9778813

################################

#PROBLEM 6.9 - INTEGRATING WORD COUNT INFORMATION 

#What is the test-set AUC of spam2RF? (Remember that you might get a different AUC than us even if you set the seed when building your model, due to the random behavior of randomForest on some operating systems.)

library(ROCR)

predictionTest2RF = prediction(predTest2RF, test2$spam)
perf<-performance(predictionTest2RF, "tpr", "fpr")
as.numeric(performance(predictionTrainRF, "auc")@y.values)
## [1] 0.9979116
#Ans:0.9979116

#We then plot the ROC curve, with the option that color-codes the different cutoff thresholds.
plot(perf, colorize=TRUE) 

#In this case, adding the logWordCounts variable did not result in improved results on the test set for the CART or random forest model.

USING N-GRAMS

Another source of information that might be extracted from text is the frequency of various n-grams. An n-gram is a sequence of n consecutive words in the document. For instance, for the document “Text analytics rocks!”, which we would preprocess to “text analyt rock”, the 1-grams are “text”, “analyt”, and “rock”, the 2-grams are “text analyt” and “analyt rock”, and the only 3-gram is “text analyt rock”. n-grams are order-specific, meaning the 2-grams “text analyt” and “analyt text” are considered two separate n-grams. We can see that so far our analysis has been extracting only 1-grams.

We do not have exercises in this class covering n-grams, but if you are interested in learning more, the “RTextTools”, “tau”, “RWeka”, and “textcat” packages in R are all good resources.

sessionInfo()
## R version 3.3.0 (2016-05-03)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 10586)
## 
## locale:
## [1] C
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] slam_0.1-34         ROCR_1.0-7          gplots_3.0.1       
##  [4] randomForest_4.6-12 rpart.plot_1.5.3    rpart_4.1-10       
##  [7] caTools_1.17.1      SnowballC_0.5.1     tm_0.6-2           
## [10] NLP_0.1-9           DataComputing_0.8.3 curl_0.9.7         
## [13] base64enc_0.1-3     manipulate_1.0.1    mosaic_0.13.0      
## [16] mosaicData_0.13.0   car_2.1-2           lattice_0.20-33    
## [19] knitr_1.13          stringr_1.0.0       tidyr_0.4.1        
## [22] lubridate_1.5.6     dplyr_0.4.3         ggplot2_2.1.0      
## 
## loaded via a namespace (and not attached):
##  [1] gtools_3.5.0       reshape2_1.4.1     splines_3.3.0     
##  [4] colorspace_1.2-6   htmltools_0.3.5    yaml_2.1.13       
##  [7] mgcv_1.8-12        nloptr_1.0.4       DBI_0.4-1         
## [10] plyr_1.8.4         MatrixModels_0.4-1 munsell_0.4.3     
## [13] gtable_0.2.0       codetools_0.2-14   evaluate_0.9      
## [16] SparseM_1.7        quantreg_5.26      pbkrtest_0.4-6    
## [19] parallel_3.3.0     Rcpp_0.12.5        KernSmooth_2.23-15
## [22] scales_0.4.0       formatR_1.4        gdata_2.17.0      
## [25] mime_0.4           lme4_1.1-12        gridExtra_2.2.1   
## [28] digest_0.6.9       stringi_1.1.1      grid_3.3.0        
## [31] tools_3.3.0        bitops_1.0-6       magrittr_1.5      
## [34] lazyeval_0.1.10    ggdendro_0.1-20    MASS_7.3-45       
## [37] Matrix_1.2-6       assertthat_0.1     minqa_1.2.4       
## [40] rmarkdown_0.9.6    R6_2.1.2           nnet_7.3-12       
## [43] nlme_3.1-128

#############################That’s All folks….Phew!##########################