[ source files available on GitHub ]
Libraries needed for data processing and plotting:
library("tm")
library("SnowballC")
library("caTools")
library("rpart")
library("rpart.plot")
library("randomForest")
# May need to execute this
Sys.setlocale("LC_ALL", "C")
We will be trying to understand sentiment of tweets about the company Apple.
While Apple has a large number of fans, they also have a large number of people who don’t like their products. They also have several competitors.
To better understand public perception, Apple wants to monitor how people feel over time and how people receive new announcements.
Our challenge in this lecture is to see if we can correctly classify tweets as being negative, positive, or neither about Apple.
To collect the data needed for this task, we had to perform two steps.
The first was to collect data about tweets from the internet.
Twitter data is publicly available, and it can be collected it through scraping the website or via the Twitter API.
The sender of the tweet might be useful to predict sentiment, but we will ignore it to keep our data anonymized.
So we will just be using the text of the tweet.
Then we need to construct the outcome variable for these tweets, which means that we have to label them as positive, negative, or neutral sentiment.
We would like to label thousands of tweets, and we know that two people might disagree over the correct classification of a tweet. To do this efficiently, one option is to use the Amazon Mechanical Turk.
The task that we put on the Amazon Mechanical Turk was to judge the sentiment expressed by the following item toward the software company Apple.
The items we gave them were tweets that we had collected. The workers could pick from the following options as their response:
These outcomes were represented as a number on the scale from -2 to 2.
Each tweet was labeled by five workers. For each tweet, we take the average of the five scores given by the five workers, hence the final scores can range from -2 to 2 in increments of 0.2.
The following graph shows the distribution of the number of tweets classified into each of the categories. We can see here that the majority of tweets were classified as neutral, with a small number classified as strongly negative or strongly positive.
So now we have a bunch of tweets that are labeled with their sentiment. But how do we build independent variables from the text of a tweet to be used to predict the sentiment?
One of the most used techniques to transforms text into independent variables is that called Bag of Words.
Fully understanding text is difficult, but Bag of Words provides a very simple approach: it just counts the number of times each word appears in the text and uses these counts as the independent variables.
For example, in the sentence,
"This course is great. I would recommend this course to my friends,"
the word this is seen twice, the word course is seen twice, the word great is seen once, et cetera.
In Bag of Words, there is one feature for each word. This is a very simple approach, but is often very effective, too. It is used as a baseline in text analytics projects and for Natural Language Processing.
This is not the whole story, though. Preprocessing the text can dramatically improve the performance of the Bag of Words method.
One part of preprocessing the text is to clean up irregularities.
Text data often as many inconsistencies that will cause algorithms trouble. Computers are very literal by default.
One common irregularity concerns the case of the letters, and it is customary to change all words to either lower-case or upper-case.
Punctuation also causes problems, and the basic approach is to remove everything that is not a letter. However some punctuation is meaningful, and therefore the removal of punctuation should be tailored to the specific problem.
There are also unhelpful terms:
Stemming: This step is motivated by the desire to represent words with different endings as the same word. We probably do not need to draw a distinction between argue, argued, argues, and arguing. They could all be represented by a common stem, argu. The algorithmic process of performing this reduction is called stemming.
There are many ways to approach the problem.
This second approach is widely popular and is called the Porter Stemmer, designed by Martin Porter in 1980, and it’s still used today.
tweets <- read.csv("data/tweets.csv", stringsAsFactors = FALSE)
Note: when working on a text analytics problem it is important (necessary!) to add the extra argument stringsAsFactors = FALSE, so that the text is read in properly.
Let’s take a look at the structure of our data:
str(tweets)
## 'data.frame': 1181 obs. of 2 variables:
## $ Tweet: chr "I have to say, Apple has by far the best customer care service I have ever received! @Apple @AppStore" "iOS 7 is so fricking smooth & beautiful!! #ThanxApple @Apple" "LOVE U @APPLE" "Thank you @apple, loving my new iPhone 5S!!!!! #apple #iphone5S pic.twitter.com/XmHJCU4pcb" ...
## $ Avg : num 2 2 1.8 1.8 1.8 1.8 1.8 1.6 1.6 1.6 ...
We have 1181 observations of 2 variables:
The tweet texts are real tweets that gathered on the internet directed to Apple with a few cleaned up words.
We are more interested in being able to detect the tweets with clear negative sentiment, so let’s define a new variable in our data set called Negative.
tweets$Negative <- as.factor(tweets$Avg <= -1)
table(tweets$Negative)
##
## FALSE TRUE
## 999 182
One of fundamental concepts in text analysis, implemented in the package tm as well, is that of a corpus.
A corpus is a collection of documents.
We will need to convert our tweets to a corpus for pre-processing. Various function in the tm package can be used to create a corpus in many different ways.
We will create it from the tweet column of our data frame using two functions, Corpus() and VectorSource(). We feed to this latter the Tweets variable of the tweets data frame.
corpus <- Corpus(VectorSource(tweets$Tweet))
Let’s take a look at corpus:
corpus
## <<VCorpus (documents: 1181, metadata (corpus/indexed): 0/0)>>
We can check that the documents match our tweets by using double brackets [[.
To inspect the first (or 10th) tweet in our corpus, we select the first (or 10th) element as:
corpus[[1]]
## <<PlainTextDocument (metadata: 7)>>
## I have to say, Apple has by far the best customer care service I have ever received! @Apple @AppStore
corpus[[10]]
## <<PlainTextDocument (metadata: 7)>>
## Just checked out the specs on the new iOS 7...wow is all I have to say! I can't wait to get the new update ?? Bravo @Apple
Pre-processing is easy in tm.
Each operation, like stemming or removing stop words, can be done with one line in R, where we use the tm_map() function which takes as
To transform all text to lower case:
corpus <- tm_map(corpus, tolower)
Checking the same two “documents” as before:
corpus[[1]]
## [1] "i have to say, apple has by far the best customer care service i have ever received! @apple @appstore"
corpus[[10]]
## [1] "just checked out the specs on the new ios 7...wow is all i have to say! i can't wait to get the new update ?? bravo @apple"
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removePunctuation)
Check the first document:
corpus[[1]]
## <<PlainTextDocument (metadata: 7)>>
## i have to say apple has by far the best customer care service i have ever received apple appstore
corpus[[10]]
## <<PlainTextDocument (metadata: 7)>>
## just checked out the specs on the new ios 7wow is all i have to say i cant wait to get the new update bravo apple
Next we want to remove the stop words in our tweets.
It is necessary to define a list of words that we regard as being stop words, and for this the tm package provides a default list for the English language. We can check it out with:
stopwords("english")[1:10]
## [1] "i" "me" "my" "myself" "we" "our" "ours" "ourselves"
## [9] "you" "your"
Removing words can be done with the removeWords argument to the tm_map() function, with an extra argument, i.e. what the stop words are that we want to remove.
We will remove all of these English stop words, but we will also remove the word “apple” since all of these tweets have the word “apple” and it probably won’t be very useful in our prediction problem.
corpus <- tm_map(corpus, removeWords, c("apple", stopwords("english")))
corpus[[1]]
## <<PlainTextDocument (metadata: 7)>>
## say far best customer care service ever received appstore
corpus[[10]]
## <<PlainTextDocument (metadata: 7)>>
## just checked specs new ios 7wow say cant wait get new update bravo
Lastly, we want to stem our document with the stemDocument argument.
corpus <- tm_map(corpus, stemDocument)
corpus[[1]]
## <<PlainTextDocument (metadata: 7)>>
## say far best custom care servic ever receiv appstor
corpus[[10]]
## <<PlainTextDocument (metadata: 7)>>
## just check spec new io 7wow say cant wait get new updat bravo
We can see that this took off the ending of “customer,” “service,” “received,” and “appstore.”
We are now ready to extract the word frequencies to be used in our prediction problem. The tm package provides a function called DocumentTermMatrix() that generates a matrix where:
The values in the matrix are the number of times that word appears in each document.
DTM <- DocumentTermMatrix(corpus)
DTM
## <<DocumentTermMatrix (documents: 1181, terms: 3289)>>
## Non-/sparse entries: 8980/3875329
## Sparsity : 100%
## Maximal term length: 115
## Weighting : term frequency (tf)
We see that in the corpus there are 3289 unique words.
Let’s see what this matrix looks like using the inspect() function, in particular slicing a block of rows/columns from the Document Term Matrix by calling by their indices:
inspect(DTM[1000:1005, 505:515])
## <<DocumentTermMatrix (documents: 6, terms: 11)>>
## Non-/sparse entries: 1/65
## Sparsity : 98%
## Maximal term length: 9
## Weighting : term frequency (tf)
##
## Terms
## Docs cheapen cheaper check cheep cheer cheerio cherylcol chief chiiiiqu child children
## character(0) 0 0 0 0 0 0 0 0 0 0 0
## character(0) 0 0 0 0 0 0 0 0 0 0 0
## character(0) 0 0 0 0 0 0 0 0 0 0 0
## character(0) 0 0 0 0 0 0 0 0 0 0 0
## character(0) 0 0 0 0 0 0 0 0 0 0 0
## character(0) 0 0 0 0 1 0 0 0 0 0 0
In this range we see that the word “cheer” appears in the tweet 1005, but “cheap” does not appear in any of these tweets. This data is what we call sparse. This means that there are many zeros in our matrix.
We can look at what the most popular terms are, or words, with the function findFreqTerms(), selecting a minimum number of 20 occurrences over the whole corpus:
frequent_ge_20 <- findFreqTerms(DTM, lowfreq = 20)
frequent_ge_20
## [1] "android" "anyon" "app" "appl"
## [5] "back" "batteri" "better" "buy"
## [9] "can" "cant" "come" "dont"
## [13] "fingerprint" "freak" "get" "googl"
## [17] "ios7" "ipad" "iphon" "iphone5"
## [21] "iphone5c" "ipod" "ipodplayerpromo" "itun"
## [25] "just" "like" "lol" "look"
## [29] "love" "make" "market" "microsoft"
## [33] "need" "new" "now" "one"
## [37] "phone" "pleas" "promo" "promoipodplayerpromo"
## [41] "realli" "releas" "samsung" "say"
## [45] "store" "thank" "think" "time"
## [49] "twitter" "updat" "use" "via"
## [53] "want" "well" "will" "work"
Out of the 3289 words in our matrix, only 56 words appear at least 20 times in our tweets.
This means that we probably have a lot of terms that will be pretty useless for our prediction model. The number of terms is an issue for two main reasons:
Therefore let’s remove some terms that don’t appear very often.
sparse_DTM <- removeSparseTerms(DTM, 0.995)
This function takes a second parameters, the sparsity threshold. The sparsity threshold works as follows.
Let’s see what the new Document Term Matrix properties look like:
sparse_DTM
## <<DocumentTermMatrix (documents: 1181, terms: 309)>>
## Non-/sparse entries: 4669/360260
## Sparsity : 99%
## Maximal term length: 20
## Weighting : term frequency (tf)
It only contains 309 unique terms, i.e. only about 9.4% of the full set.
Now let’s convert the sparse matrix into a data frame that we will be able to use for our predictive models.
tweetsSparse <- as.data.frame(as.matrix(sparse_DTM))
Since R struggles with variable names that start with a number, and we probably have some words here that start with a number, we should run the make.names() function to make sure all of our words are appropriate variable names. It will convert the variable names to make sure they are all appropriate names for R before we build our predictive models. You should do this each time you build a data frame using text analytics.
To make all variable names R-friendly use:
colnames(tweetsSparse) <- make.names(colnames(tweetsSparse))
We should add back to this data frame our dependent variable to this data set. We’ll call it tweetsSparse$Negative and set it equal to the original Negative variable from the tweets data frame.
tweetsSparse$Negative <- tweets$Negative
Lastly, let’s split our data into a training set and a testing set, putting 70% of the data in the training set.
set.seed(123)
split <- sample.split(tweetsSparse$Negative, SplitRatio = 0.7)
trainSparse <- subset(tweetsSparse, split == TRUE)
testSparse <- subset(tweetsSparse, split == FALSE)
Let’s first use CART to build a predictive model, using the rpart() function to predict Negative using all of the other variables as our independent variables and the data set trainSparse.
We’ll add one more argument here, which is method = "class" so that the rpart() function knows to build a classification model. We keep default settings for all other parameters, in particular we are not adding anything for minbucket or cp.
tweetCART <- rpart(Negative ~ . , data = trainSparse, method = "class")
prp(tweetCART)
The tree says that
TRUE, or negative sentiment.TRUE.TRUE, or negative sentiment.FALSE, or non-negative sentiment.This tree makes sense intuitively since these three words are generally seen as negative words.
Using the predict() function we compute the predictions of our model tweetCART on the new data set testSparse. Be careful to add the argument type = "class" to make sure we get class predictions.
predictCART <- predict(tweetCART, newdata = testSparse, type = "class")
And from the predictions we can compute the confusion matrix:
cmat_CART <- table(testSparse$Negative, predictCART)
cmat_CART
## predictCART
## FALSE TRUE
## FALSE 294 6
## TRUE 37 18
accu_CART <- (cmat_CART[1,1] + cmat_CART[2,2])/sum(cmat_CART)
Let’s compare this to a simple baseline model that always predicts non-negative (i.e. the most common value of the dependent variable).
To compute the accuracy of the baseline model, let’s make a table of just the outcome variable Negative.
cmat_baseline <- table(testSparse$Negative)
cmat_baseline
##
## FALSE TRUE
## 300 55
accu_baseline <- max(cmat_baseline)/sum(cmat_baseline)
The accuracy of the baseline model is then 0.8451.
So the CART model does better than the simple baseline model.
How well would a Random Forest model do?
We use the randomForest() function to predict Negative again using all of our other variables as independent variables and the data set trainSparse. Again we use the default parameter settings:
set.seed(123)
tweetRF <- randomForest(Negative ~ . , data = trainSparse)
tweetRF
##
## Call:
## randomForest(formula = Negative ~ ., data = trainSparse)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 17
##
## OOB estimate of error rate: 11.26%
## Confusion matrix:
## FALSE TRUE class.error
## FALSE 685 14 0.02002861
## TRUE 79 48 0.62204724
And then compute the Out-of-Sample predictions:
predictRF <- predict(tweetRF, newdata = testSparse)
and compute the confusion matrix:
cmat_RF <- table(testSparse$Negative, predictRF)
cmat_RF
## predictRF
## FALSE TRUE
## FALSE 293 7
## TRUE 34 21
accu_RF <- (cmat_RF[1,1] + cmat_RF[2,2])/sum(cmat_RF)
The overall accuracy of this Random Forest model is 0.8845
This is a little better than the CART model, but due to the interpretability of the CART model, this latter would probably be preferred over the random forest model.
If you were to use cross-validation to pick the cp parameter for the CART model, the accuracy would increase to about the same as the random forest model.
So by using a bag-of-words approach and these models, we can reasonably predict sentiment even with a relatively small data set of tweets.
Build the model, using all independent variables as predictors:
tweetLog <- glm(Negative ~ . , data = trainSparse, family = "binomial")
# summary(tweetLog)
Prediction on the testing set:
tweetLog_predict_test <- predict(tweetLog, type = "response", newdata = testSparse)
Confusion matrix:
cmat_logRegr <- table(testSparse$Negative, tweetLog_predict_test > 0.5)
cmat_logRegr
##
## FALSE TRUE
## FALSE 257 43
## TRUE 21 34
accu_logRegr <- (cmat_logRegr[1,1] + cmat_logRegr[2,2])/sum(cmat_logRegr)
The overall accuracy of this logistic regression model is 0.8197, which is worse than the baseline (?!).
If you were to compute the accuracy on the training set instead, you would see that the model does really well on the training set.
This is an example of over-fitting. The model fits the training set really well, but does not perform well on the test set. A logistic regression model with a large number of variables is particularly at risk for overfitting.