[ source files available on GitHub ]

PRELIMINARIES

Libraries needed for data processing and plotting:

library("tm")
library("SnowballC")

library("caTools")
library("rpart")
library("rpart.plot")
library("randomForest")
# May need to execute this 
Sys.setlocale("LC_ALL", "C")

INTRODUCTION

We will be trying to understand sentiment of tweets about the company Apple.

While Apple has a large number of fans, they also have a large number of people who don’t like their products. They also have several competitors.
To better understand public perception, Apple wants to monitor how people feel over time and how people receive new announcements.

Our challenge in this lecture is to see if we can correctly classify tweets as being negative, positive, or neither about Apple.

The Data

To collect the data needed for this task, we had to perform two steps.

Collect Twitter data

The first was to collect data about tweets from the internet.
Twitter data is publicly available, and it can be collected it through scraping the website or via the Twitter API.

The sender of the tweet might be useful to predict sentiment, but we will ignore it to keep our data anonymized.
So we will just be using the text of the tweet.

Construct the outcome variable

Then we need to construct the outcome variable for these tweets, which means that we have to label them as positive, negative, or neutral sentiment.

We would like to label thousands of tweets, and we know that two people might disagree over the correct classification of a tweet. To do this efficiently, one option is to use the Amazon Mechanical Turk.

The task that we put on the Amazon Mechanical Turk was to judge the sentiment expressed by the following item toward the software company Apple.
The items we gave them were tweets that we had collected. The workers could pick from the following options as their response:

  • strongly negative,
  • negative,
  • neutral,
  • positive, and
  • strongly positive.

These outcomes were represented as a number on the scale from -2 to 2.

Each tweet was labeled by five workers. For each tweet, we take the average of the five scores given by the five workers, hence the final scores can range from -2 to 2 in increments of 0.2.

The following graph shows the distribution of the number of tweets classified into each of the categories. We can see here that the majority of tweets were classified as neutral, with a small number classified as strongly negative or strongly positive.

distribution of score

So now we have a bunch of tweets that are labeled with their sentiment. But how do we build independent variables from the text of a tweet to be used to predict the sentiment?

A Bag of Words

One of the most used techniques to transforms text into independent variables is that called Bag of Words.

Fully understanding text is difficult, but Bag of Words provides a very simple approach: it just counts the number of times each word appears in the text and uses these counts as the independent variables.

For example, in the sentence,

"This course is great.  I would recommend this course to my friends,"

the word this is seen twice, the word course is seen twice, the word great is seen once, et cetera.

bag of words

In Bag of Words, there is one feature for each word. This is a very simple approach, but is often very effective, too. It is used as a baseline in text analytics projects and for Natural Language Processing.

This is not the whole story, though. Preprocessing the text can dramatically improve the performance of the Bag of Words method.

Cleaning Up Irregularities

One part of preprocessing the text is to clean up irregularities.
Text data often as many inconsistencies that will cause algorithms trouble. Computers are very literal by default.

  • One common irregularity concerns the case of the letters, and it is customary to change all words to either lower-case or upper-case.

  • Punctuation also causes problems, and the basic approach is to remove everything that is not a letter. However some punctuation is meaningful, and therefore the removal of punctuation should be tailored to the specific problem.

There are also unhelpful terms:

  • Stopwords: they are words used frequently but that are only meaningful in a sentence. Examples are the, is, at, and which. It’s unlikely that these words will improve the machine learning prediction quality, so we want to remove them to reduce the size of the data.
    • There are some potential problems with this approach. Sometimes, two stop words taken together have a very important meaning (e.g. the name of the band “The Who”). By removing the stop words, we remove both of these words, but The Who might actually have a significant meaning for our prediction task.
  • Stemming: This step is motivated by the desire to represent words with different endings as the same word. We probably do not need to draw a distinction between argue, argued, argues, and arguing. They could all be represented by a common stem, argu. The algorithmic process of performing this reduction is called stemming.
    There are many ways to approach the problem.

    1. One approach is to build a database of words and their stems.
      • A pro is that this approach handles exceptions very nicely, since we have defined all of the stems.
      • However, it will not handle new words at all, since they are not in the database.
        This is especially bad for problems where we’re using data from the internet, since we have no idea what words will be used.
    2. A different approach is to write a rule-based algorithm.
      In this approach, if a word ends in things like ed, ing, or ly, we would remove the ending.
      • A pro of this approach is that it handles new or unknown words well.
      • However, there are many exceptions, and this approach would miss all of these.
        Words like child and children would be considered different, but it would get other plurals, like dog and dogs.

    This second approach is widely popular and is called the Porter Stemmer, designed by Martin Porter in 1980, and it’s still used today.

LOADING AND PROCESSING DATA IN R

tweets <- read.csv("data/tweets.csv", stringsAsFactors = FALSE)

Note: when working on a text analytics problem it is important (necessary!) to add the extra argument stringsAsFactors = FALSE, so that the text is read in properly.

Let’s take a look at the structure of our data:

str(tweets)
## 'data.frame':    1181 obs. of  2 variables:
##  $ Tweet: chr  "I have to say, Apple has by far the best customer care service I have ever received! @Apple @AppStore" "iOS 7 is so fricking smooth & beautiful!! #ThanxApple @Apple" "LOVE U @APPLE" "Thank you @apple, loving my new iPhone 5S!!!!!  #apple #iphone5S pic.twitter.com/XmHJCU4pcb" ...
##  $ Avg  : num  2 2 1.8 1.8 1.8 1.8 1.8 1.6 1.6 1.6 ...

We have 1181 observations of 2 variables:

The tweet texts are real tweets that gathered on the internet directed to Apple with a few cleaned up words.

We are more interested in being able to detect the tweets with clear negative sentiment, so let’s define a new variable in our data set called Negative.

tweets$Negative <- as.factor(tweets$Avg <= -1)
table(tweets$Negative)
## 
## FALSE  TRUE 
##   999   182

CREATING A CORPUS

One of fundamental concepts in text analysis, implemented in the package tm as well, is that of a corpus.
A corpus is a collection of documents.

We will need to convert our tweets to a corpus for pre-processing. Various function in the tm package can be used to create a corpus in many different ways.
We will create it from the tweet column of our data frame using two functions, Corpus() and VectorSource(). We feed to this latter the Tweets variable of the tweets data frame.

corpus <- Corpus(VectorSource(tweets$Tweet))

Let’s take a look at corpus:

corpus
## <<VCorpus (documents: 1181, metadata (corpus/indexed): 0/0)>>

We can check that the documents match our tweets by using double brackets [[.
To inspect the first (or 10th) tweet in our corpus, we select the first (or 10th) element as:

corpus[[1]]
## <<PlainTextDocument (metadata: 7)>>
## I have to say, Apple has by far the best customer care service I have ever received! @Apple @AppStore

corpus[[10]]
## <<PlainTextDocument (metadata: 7)>>
## Just checked out the specs on the new iOS 7...wow is all I have to say! I can't wait to get the new update ?? Bravo @Apple

Converting text to lower case

Pre-processing is easy in tm.
Each operation, like stemming or removing stop words, can be done with one line in R, where we use the tm_map() function which takes as

  • its first argument the name of a corpus and
  • as second argument a function performing the transformation that we want to apply to the text.

To transform all text to lower case:

corpus <- tm_map(corpus, tolower)

Checking the same two “documents” as before:

corpus[[1]]
## [1] "i have to say, apple has by far the best customer care service i have ever received! @apple @appstore"

corpus[[10]]
## [1] "just checked out the specs on the new ios 7...wow is all i have to say! i can't wait to get the new update ?? bravo @apple"
corpus <- tm_map(corpus, PlainTextDocument)

Removing punctuation

corpus <- tm_map(corpus, removePunctuation)

Check the first document:

corpus[[1]]
## <<PlainTextDocument (metadata: 7)>>
## i have to say apple has by far the best customer care service i have ever received apple appstore

corpus[[10]]
## <<PlainTextDocument (metadata: 7)>>
## just checked out the specs on the new ios 7wow is all i have to say i cant wait to get the new update  bravo apple

Removing stop words (and apple)

Next we want to remove the stop words in our tweets.
It is necessary to define a list of words that we regard as being stop words, and for this the tm package provides a default list for the English language. We can check it out with:

stopwords("english")[1:10]
##  [1] "i"         "me"        "my"        "myself"    "we"        "our"       "ours"      "ourselves"
##  [9] "you"       "your"

Removing words can be done with the removeWords argument to the tm_map() function, with an extra argument, i.e. what the stop words are that we want to remove.

We will remove all of these English stop words, but we will also remove the word “apple” since all of these tweets have the word “apple” and it probably won’t be very useful in our prediction problem.

corpus <- tm_map(corpus, removeWords, c("apple", stopwords("english")))
corpus[[1]]
## <<PlainTextDocument (metadata: 7)>>
##    say    far  best customer care service   ever received  appstore

corpus[[10]]
## <<PlainTextDocument (metadata: 7)>>
## just checked   specs   new ios 7wow      say  cant wait  get  new update  bravo

Stemming

Lastly, we want to stem our document with the stemDocument argument.

corpus <- tm_map(corpus, stemDocument)
corpus[[1]]
## <<PlainTextDocument (metadata: 7)>>
##    say    far  best custom care servic   ever receiv  appstor

corpus[[10]]
## <<PlainTextDocument (metadata: 7)>>
## just check   spec   new io 7wow      say  cant wait  get  new updat  bravo

We can see that this took off the ending of “customer,” “service,” “received,” and “appstore.”

BAG OF WORDS IN R

Create a Document Term Matrix

We are now ready to extract the word frequencies to be used in our prediction problem. The tm package provides a function called DocumentTermMatrix() that generates a matrix where:

  • the rows correspond to documents, in our case tweets, and
  • the columns correspond to words in those tweets.

The values in the matrix are the number of times that word appears in each document.

DTM <- DocumentTermMatrix(corpus)
DTM
## <<DocumentTermMatrix (documents: 1181, terms: 3289)>>
## Non-/sparse entries: 8980/3875329
## Sparsity           : 100%
## Maximal term length: 115
## Weighting          : term frequency (tf)

We see that in the corpus there are 3289 unique words.

Let’s see what this matrix looks like using the inspect() function, in particular slicing a block of rows/columns from the Document Term Matrix by calling by their indices:

inspect(DTM[1000:1005, 505:515])
## <<DocumentTermMatrix (documents: 6, terms: 11)>>
## Non-/sparse entries: 1/65
## Sparsity           : 98%
## Maximal term length: 9
## Weighting          : term frequency (tf)
## 
##               Terms
## Docs           cheapen cheaper check cheep cheer cheerio cherylcol chief chiiiiqu child children
##   character(0)       0       0     0     0     0       0         0     0        0     0        0
##   character(0)       0       0     0     0     0       0         0     0        0     0        0
##   character(0)       0       0     0     0     0       0         0     0        0     0        0
##   character(0)       0       0     0     0     0       0         0     0        0     0        0
##   character(0)       0       0     0     0     0       0         0     0        0     0        0
##   character(0)       0       0     0     0     1       0         0     0        0     0        0

In this range we see that the word “cheer” appears in the tweet 1005, but “cheap” does not appear in any of these tweets. This data is what we call sparse. This means that there are many zeros in our matrix.

We can look at what the most popular terms are, or words, with the function findFreqTerms(), selecting a minimum number of 20 occurrences over the whole corpus:

frequent_ge_20 <- findFreqTerms(DTM, lowfreq = 20)

frequent_ge_20 
##  [1] "android"              "anyon"                "app"                  "appl"                
##  [5] "back"                 "batteri"              "better"               "buy"                 
##  [9] "can"                  "cant"                 "come"                 "dont"                
## [13] "fingerprint"          "freak"                "get"                  "googl"               
## [17] "ios7"                 "ipad"                 "iphon"                "iphone5"             
## [21] "iphone5c"             "ipod"                 "ipodplayerpromo"      "itun"                
## [25] "just"                 "like"                 "lol"                  "look"                
## [29] "love"                 "make"                 "market"               "microsoft"           
## [33] "need"                 "new"                  "now"                  "one"                 
## [37] "phone"                "pleas"                "promo"                "promoipodplayerpromo"
## [41] "realli"               "releas"               "samsung"              "say"                 
## [45] "store"                "thank"                "think"                "time"                
## [49] "twitter"              "updat"                "use"                  "via"                 
## [53] "want"                 "well"                 "will"                 "work"

Out of the 3289 words in our matrix, only 56 words appear at least 20 times in our tweets.

This means that we probably have a lot of terms that will be pretty useless for our prediction model. The number of terms is an issue for two main reasons:

  • One is computational: more terms means more independent variables, which usually means it takes longer to build our models.
  • The other is that in building models the ratio of independent variables to observations will affect how well the model will generalize.

Remove sparse terms

Therefore let’s remove some terms that don’t appear very often.

sparse_DTM <- removeSparseTerms(DTM, 0.995)

This function takes a second parameters, the sparsity threshold. The sparsity threshold works as follows.

  • If we say 0.98, this means to only keep terms that appear in 2% or more of the tweets.
  • If we say 0.99, that means to only keep terms that appear in 1% or more of the tweets.
  • If we say 0.995, that means to only keep terms that appear in 0.5% or more of the tweets, about six or more tweets.

Let’s see what the new Document Term Matrix properties look like:

sparse_DTM
## <<DocumentTermMatrix (documents: 1181, terms: 309)>>
## Non-/sparse entries: 4669/360260
## Sparsity           : 99%
## Maximal term length: 20
## Weighting          : term frequency (tf)

It only contains 309 unique terms, i.e. only about 9.4% of the full set.

Convert the DTM to a data frame

Now let’s convert the sparse matrix into a data frame that we will be able to use for our predictive models.

tweetsSparse <- as.data.frame(as.matrix(sparse_DTM))

Fix variables names in the data frame

Since R struggles with variable names that start with a number, and we probably have some words here that start with a number, we should run the make.names() function to make sure all of our words are appropriate variable names. It will convert the variable names to make sure they are all appropriate names for R before we build our predictive models. You should do this each time you build a data frame using text analytics.

To make all variable names R-friendly use:

colnames(tweetsSparse) <- make.names(colnames(tweetsSparse))

Add the dependent variable

We should add back to this data frame our dependent variable to this data set. We’ll call it tweetsSparse$Negative and set it equal to the original Negative variable from the tweets data frame.

tweetsSparse$Negative <- tweets$Negative

Split data in training/testing sets

Lastly, let’s split our data into a training set and a testing set, putting 70% of the data in the training set.

set.seed(123)

split <- sample.split(tweetsSparse$Negative, SplitRatio = 0.7)

trainSparse <- subset(tweetsSparse, split == TRUE)
testSparse <- subset(tweetsSparse, split == FALSE)

PREDICTING SENTIMENT

Let’s first use CART to build a predictive model, using the rpart() function to predict Negative using all of the other variables as our independent variables and the data set trainSparse.

We’ll add one more argument here, which is method = "class" so that the rpart() function knows to build a classification model. We keep default settings for all other parameters, in particular we are not adding anything for minbucket or cp.

tweetCART <- rpart(Negative ~ . , data = trainSparse, method = "class")
prp(tweetCART)

The tree says that

This tree makes sense intuitively since these three words are generally seen as negative words.

Out-of-Sample performance of the model

Using the predict() function we compute the predictions of our model tweetCART on the new data set testSparse. Be careful to add the argument type = "class" to make sure we get class predictions.

predictCART <- predict(tweetCART, newdata = testSparse, type = "class")

And from the predictions we can compute the confusion matrix:

cmat_CART <- table(testSparse$Negative, predictCART)
cmat_CART 
##        predictCART
##         FALSE TRUE
##   FALSE   294    6
##   TRUE     37   18

accu_CART <- (cmat_CART[1,1] + cmat_CART[2,2])/sum(cmat_CART)
  • Overall Accuracy = 0.8789
    Sensitivity = 18 / 55 = 0.3273 ( = TP rate)
    Specificity = 294 / 300 = 0.98
    FP rate = 6 / 300 = 0.02

Comparison with the baseline model

Let’s compare this to a simple baseline model that always predicts non-negative (i.e. the most common value of the dependent variable).

To compute the accuracy of the baseline model, let’s make a table of just the outcome variable Negative.

cmat_baseline <- table(testSparse$Negative)
cmat_baseline
## 
## FALSE  TRUE 
##   300    55

accu_baseline <- max(cmat_baseline)/sum(cmat_baseline)

The accuracy of the baseline model is then 0.8451.
So the CART model does better than the simple baseline model.

Comparison with a Random Forest model

How well would a Random Forest model do?

We use the randomForest() function to predict Negative again using all of our other variables as independent variables and the data set trainSparse. Again we use the default parameter settings:

set.seed(123)
tweetRF <- randomForest(Negative ~ . , data = trainSparse)

tweetRF
## 
## Call:
##  randomForest(formula = Negative ~ ., data = trainSparse) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 17
## 
##         OOB estimate of  error rate: 11.26%
## Confusion matrix:
##       FALSE TRUE class.error
## FALSE   685   14  0.02002861
## TRUE     79   48  0.62204724

And then compute the Out-of-Sample predictions:

predictRF <- predict(tweetRF, newdata = testSparse)

and compute the confusion matrix:

cmat_RF <- table(testSparse$Negative, predictRF)
cmat_RF 
##        predictRF
##         FALSE TRUE
##   FALSE   293    7
##   TRUE     34   21

accu_RF <- (cmat_RF[1,1] + cmat_RF[2,2])/sum(cmat_RF)

The overall accuracy of this Random Forest model is 0.8845
This is a little better than the CART model, but due to the interpretability of the CART model, this latter would probably be preferred over the random forest model.

If you were to use cross-validation to pick the cp parameter for the CART model, the accuracy would increase to about the same as the random forest model.

So by using a bag-of-words approach and these models, we can reasonably predict sentiment even with a relatively small data set of tweets.


Comparison with logistic regression model

Build the model, using all independent variables as predictors:

tweetLog <- glm(Negative ~ . , data = trainSparse, family = "binomial")

# summary(tweetLog)

Prediction on the testing set:

tweetLog_predict_test <- predict(tweetLog, type = "response", newdata = testSparse)

Confusion matrix:

cmat_logRegr <- table(testSparse$Negative, tweetLog_predict_test > 0.5)
cmat_logRegr 
##        
##         FALSE TRUE
##   FALSE   257   43
##   TRUE     21   34

accu_logRegr <- (cmat_logRegr[1,1] + cmat_logRegr[2,2])/sum(cmat_logRegr)

The Perils of Over-fitting

The overall accuracy of this logistic regression model is 0.8197, which is worse than the baseline (?!).

If you were to compute the accuracy on the training set instead, you would see that the model does really well on the training set.
This is an example of over-fitting. The model fits the training set really well, but does not perform well on the test set. A logistic regression model with a large number of variables is particularly at risk for overfitting.


THE ANALYTICS EDGE