Classify Apple Tweets

Apple is a computer company known for its laptops, phones, tablets etc. While Apple has a large number of fans, they also have a large number of people who don’t like their products. In this exercise, our challenge will be to classify tweets as being negative, positive, or neither about Apple.

During this exercise, I will perform the following:
  1. Create a corpus
  2. Clean corpus to remove stop words, punctuations, stem document etc.
  3. Find frequent words and associations
  4. Create a frequency plot and word cloud
  5. Split the data
  6. Build Models
  7. Make Predictions

1. Load the Data

1.1 Let’s load the required packages

# Load packages
library(tm)
library(wordcloud)
library(SnowballC)
library(RColorBrewer)
library(ggplot2)
library(caTools)
library(rpart)
library(rpart.plot)

1.2 Let’s read the data and examine the dataset

tweets <- read.csv('tweets.csv', stringsAsFactors = FALSE)
str(tweets)
## 'data.frame':    1181 obs. of  2 variables:
##  $ Tweet: chr  "I have to say, Apple has by far the best customer care service I have ever received! @Apple @AppStore" "iOS 7 is so fricking smooth & beautiful!! #ThanxApple @Apple" "LOVE U @APPLE" "Thank you @apple, loving my new iPhone 5S!!!!!  #apple #iphone5S pic.twitter.com/XmHJCU4pcb" ...
##  $ Avg  : num  2 2 1.8 1.8 1.8 1.8 1.8 1.6 1.6 1.6 ...

We can see that the dataset has 1181 observations and 2 variables. One of the variables is the tweet itself and the other variable is the average sentiment about the tweet.

1.3 Let’s create a new variable ‘Negative’

As we are interested in detecting tweets with negative sentiment, let’s create a new variable with average value of <= -1.

tweets$Negative <- as.factor(tweets$Avg <= -1)

# Look at the number and proportion of negative tweets
table(tweets$Negative)
## 
## FALSE  TRUE 
##   999   182
prop.table(table(tweets$Negative))
## 
##     FALSE      TRUE 
## 0.8458933 0.1541067

We can set that about 15% tweets are negative.

2. Preprocess the Data

2.1 Create a corpus and examine tweets

# Create Corpus
tweet_corpus <- Corpus(VectorSource(tweets$Tweet))

# Review Corpus
tweet_corpus
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1181
# View tweets
tweet_corpus[[1]][1]
## $content
## [1] "I have to say, Apple has by far the best customer care service I have ever received! @Apple @AppStore"
tweet_corpus[[5]][1]
## $content
## [1] ".@apple has the best customer service. In and out with a new phone in under 10min!"

We can see that the corpus has 1181 documents in it. We can also look at the 1st and the 5th tweets to examine their contents.

2.2 Clean the corpus

Let’s create a function that will clean the corpus to convert all characters to lowercase, remove punctuation, remove any stop words and stem the document.

# Create function to clean corpus
clean_corpus <- function(corp){
  corp <- tm_map(corp, content_transformer(tolower))
  corp <- tm_map(corp, removePunctuation)
  corp <- tm_map(corp, removeWords, c('apple', stopwords('en')))
  corp <- tm_map(corp, stemDocument)
}

# Apply function on corpus
clean_corp <- clean_corpus(tweet_corpus)

# Check a tweet from cleaned corpus
clean_corp[[1]][1]
## $content
## [1] "   say    far  best custom care servic   ever receiv  appstor"

Looking at the tweet, we can see that the tweets have been cleaned

3. Find frequent words and associations

3.1 Create a Term Document Matrix

Let’s create a Term Document Matrix to check the no of times a word appears in each document

frequencies <- TermDocumentMatrix(clean_corp)

# Look at the TDM
frequencies
## <<TermDocumentMatrix (terms: 3289, documents: 1181)>>
## Non-/sparse entries: 8980/3875329
## Sparsity           : 100%
## Maximal term length: 115
## Weighting          : term frequency (tf)
# Inspect in detail
inspect(frequencies[505:515, 1000:1005])
## <<TermDocumentMatrix (terms: 11, documents: 6)>>
## Non-/sparse entries: 1/65
## Sparsity           : 98%
## Maximal term length: 9
## Weighting          : term frequency (tf)
## 
##            Docs
## Terms       1000 1001 1002 1003 1004 1005
##   cheapen      0    0    0    0    0    0
##   cheaper      0    0    0    0    0    0
##   check        0    0    0    0    0    0
##   cheep        0    0    0    0    0    0
##   cheer        0    0    0    0    0    1
##   cheerio      0    0    0    0    0    0
##   cherylcol    0    0    0    0    0    0
##   chief        0    0    0    0    0    0
##   chiiiiqu     0    0    0    0    0    0
##   child        0    0    0    0    0    0
##   children     0    0    0    0    0    0
We can see that
  1. There are 1181 documents and 3289 terms.
  2. The word ‘cheer’ appears in 1 tweet but the word cheep does not appear in any - this is called sparse data which shows there are many zeros in our matrix.

3.2 Let’s look at frequent terms

Let’s look at frequent terms with a low frequency of 20 (min no. of times a term must be displayed to appear)

findFreqTerms(frequencies, lowfreq = 20)
##  [1] "android"              "anyon"                "app"                 
##  [4] "appl"                 "back"                 "batteri"             
##  [7] "better"               "buy"                  "can"                 
## [10] "cant"                 "come"                 "dont"                
## [13] "fingerprint"          "freak"                "get"                 
## [16] "googl"                "ios7"                 "ipad"                
## [19] "iphon"                "iphone5"              "iphone5c"            
## [22] "ipod"                 "ipodplayerpromo"      "itun"                
## [25] "just"                 "like"                 "lol"                 
## [28] "look"                 "love"                 "make"                
## [31] "market"               "microsoft"            "need"                
## [34] "new"                  "now"                  "one"                 
## [37] "phone"                "pleas"                "promo"               
## [40] "promoipodplayerpromo" "realli"               "releas"              
## [43] "samsung"              "say"                  "store"               
## [46] "thank"                "think"                "time"                
## [49] "twitter"              "updat"                "use"                 
## [52] "via"                  "want"                 "well"                
## [55] "will"                 "work"

We can see that out of 3289 terms only 56 appear at least 20 times in our tweets. This means that there are a lot of terms that will be useless for our model.

3.3 Let’s look at associations

Let’s look at words that are associated with ‘ipad’, ‘android’, ‘ios7’

findAssocs(frequencies, c('ipad','android','ios7'), corlimit = 0.4)
## $ipad
##      ipodplayerpromo                 itun                 ipod 
##                 0.83                 0.80                 0.78 
## promoipodplayerpromo                promo 
##                 0.78                 0.68 
## 
## $android
## dougrtequan 
##        0.53 
## 
## $ios7
## numeric(0)

We can see that the word ‘ipad’ has a few corelated words whereas ‘ios7’ does not have any.

4. Create a frequency plot and word cloud

4.1 Create a matrix from TDM

The term document matrix is not a matrix. With this step, let’s convert it into a matrix

freq.matrix <- as.matrix(frequencies)

4.2 Get the Word Count

Let’s get the word count in decreasing order of frequency

term.freq <- sort(rowSums(freq.matrix), decreasing = TRUE)
head(term.freq)
## iphon  itun   new  ipad phone   get 
##   287   121   113    91    86    75

We can see that words iphone, itune are highly frequent

4.3 Create Data Frame

Let’s create a data frame of the words and their frequencies

term.df <- data.frame(term=names(term.freq), freq=term.freq)
str(term.df)
## 'data.frame':    3289 obs. of  2 variables:
##  $ term: Factor w/ 3289 levels "000","075","0909",..: 1542 1580 1989 1540 2178 1117 1553 1631 1808 1554 ...
##  $ freq: num  287 121 113 91 86 75 73 61 61 60 ...

4.4 Create a Bar Plot

Let’s create a bar plot of words with frequencies > 35 and < 200

plot <- ggplot(subset(term.df, term.df$freq > 35 & term.df$freq < 200), aes(term, freq, fill=freq)) + geom_bar(stat='identity') + labs(x='Terms', y='Count', title='Term Frequencies') 
plot + coord_flip()

4.5 Create a Word Cloud

Let’s create a word cloud of 200 words with a minimum frequency of 15

wordcloud(term.df$term, term.df$freq, min.freq=15, max.words = 200, random.order = FALSE, colors= brewer.pal(8, 'Dark2'))

We can see that the words ‘iphon’, ‘itun’, ‘new’, ‘phone’ are bigger which shows a high frequency of these words in our tweets.

4. Split the Data

4.1 Create a Document Term Matrix

Let’s create a DTM from clean corpus and view its contents

dtm <- DocumentTermMatrix(clean_corp)
dtm
## <<DocumentTermMatrix (documents: 1181, terms: 3289)>>
## Non-/sparse entries: 8980/3875329
## Sparsity           : 100%
## Maximal term length: 115
## Weighting          : term frequency (tf)

We can see that DTM has 1181 documents and 3289 terms. We have also seen earlier that the document has a lot of sparse terms. Let’s try removing these sparse terms.

4.2 Remove Sparse Terms

Let’s remove terms that dont appear often (keep only terms that appers in 0.5% or more of tweets)

sparseData <- removeSparseTerms(dtm, sparse=0.995)
sparseData
## <<DocumentTermMatrix (documents: 1181, terms: 309)>>
## Non-/sparse entries: 4669/360260
## Sparsity           : 99%
## Maximal term length: 20
## Weighting          : term frequency (tf)

We can see that there are only 309 terms in our sparse matrix. Only about 9% of the previous count of 3289.

4.3 Create a Data Frame

Let’s convert sparseData to a data frame for further analysis

# Convert to data frame
sparsedf <- as.data.frame(as.matrix(sparseData))

# Look at the head of first 6 columns of the data frame
head(sparsedf[,1:6])
##   244tsuyoponzu 7evenstarz actual add alreadi alway
## 1             0          0      0   0       0     0
## 2             0          0      0   0       0     0
## 3             0          0      0   0       0     0
## 4             0          0      0   0       0     0
## 5             0          0      0   0       0     0
## 6             0          0      0   0       0     0

We can see that some of the column names start with a number.

4.4 Add Column Names

Since R struggles with variable names that start with a number, let’s run make.names function to make sure all variable names are syntactically valid.

# Make variable names
colnames(sparsedf) <- make.names(colnames(sparsedf))

# Look at the head of first 6 columns of the data frame
head(sparsedf[,1:6])
##   X244tsuyoponzu X7evenstarz actual add alreadi alway
## 1              0           0      0   0       0     0
## 2              0           0      0   0       0     0
## 3              0           0      0   0       0     0
## 4              0           0      0   0       0     0
## 5              0           0      0   0       0     0
## 6              0           0      0   0       0     0

We can see that character “X” is prepended to variable names that started with a number.

4.5 Add dependent variable

Let’s add the dependent variable

sparsedf$Negative <- tweets$Negative

4.6 Split the Data

Let’ split the data into training and test set

set.seed(101)
split <- sample.split(sparsedf$Negative, SplitRatio = 0.7)
trainSparse <- subset(sparsedf, split==TRUE)
testSparse <- subset(sparsedf, split==FALSE)

5. Build the Models

5.1 Create a Baseline Model

Let’s create a baseline classification model, plot the model.

# Build Baseline Model
tweet.CART <- rpart(Negative~., trainSparse, method='class')

# Plot Model
prp(tweet.CART)

Our tree suggests that:
  1. If the word ‘freak’ is in the tweets then predict TRUE or negative sentiment.
  2. If ‘freak’ is not in the tweets but ‘hate’ is then predict TRUE
  3. If neither are in the tweet but ‘wtf’ is then predict TRUE.
  4. If none of these are there then predict FALSE or positive sentiment.
5.1.1 Compute Performance of Baseline Model

Let’s compute the numerical performance of our baseline model.

library(caret)
## Loading required package: lattice
# Make Prediction
predictCART <- predict(tweet.CART, testSparse, type='class')

# Compute Accuracy
table(predictCART, testSparse$Negative)
##            
## predictCART FALSE TRUE
##       FALSE   298   39
##       TRUE      2   16
# Accuracy
postResample(predictCART, testSparse$Negative)
##  Accuracy     Kappa 
## 0.8845070 0.3918947

Looking at the results, we can see that the accuracy is 0.8845

5.2 Build Models

Let’s evaluate some models and pick the best model. We will use a mixture of simple linear (LDA), nonlinear (CART, KNN) and complex nonlinear methods (SVM, RF, C5.0).

control <- trainControl(method='cv', number=10)
metric <- 'Accuracy'

# LDA
set.seed(101)
tweet.lda <- train(Negative ~ ., data=trainSparse, method='lda',
                   trControl=control, metric=metric)
## Loading required package: MASS
# CART
set.seed(101)
tweet.cart <- train(Negative~., data=trainSparse, method='rpart',
                    trControl=control, metric=metric)

# KNN
set.seed(101)
tweet.knn <- train(Negative ~ ., data=trainSparse, method='knn',
                   trControl=control, metric=metric)

# SVM
set.seed(101)
tweet.svm <- train(Negative~., data=trainSparse, method='svmRadial',
                    trControl=control, metric=metric)
## Loading required package: kernlab
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
## 
##     alpha
# RF
set.seed(101)
tweet.rf <- train(Negative~., data=trainSparse, method='ranger',
                   trControl=control, metric=metric)
## Loading required package: e1071
## Loading required package: ranger
# C 5.0
set.seed(101)
tweet.C50 <- train(Negative~., data=trainSparse, method='C5.0',
                   trControl=control, metric=metric)
## Loading required package: C50
## Loading required package: plyr
5.2.1 Compute Performance

Let’s compute and compare the performance of these models.

tweet.results <- resamples(list(lda=tweet.lda, cart=tweet.cart,
                                knn=tweet.knn, svm=tweet.svm, rf=tweet.rf,
                                c50=tweet.C50))
# Summarize the results
summary(tweet.results)
## 
## Call:
## summary.resamples(object = tweet.results)
## 
## Models: lda, cart, knn, svm, rf, c50 
## Number of resamples: 10 
## 
## Accuracy 
##        Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## lda  0.8072  0.8308 0.8545 0.8593  0.8855 0.9146    2
## cart 0.8434  0.8675 0.8841 0.8899  0.9154 0.9398    0
## knn  0.8434  0.8675 0.8795 0.8753  0.8876 0.9024    0
## svm  0.8415  0.8580 0.8728 0.8705  0.8795 0.9036    0
## rf   0.8193  0.8434 0.8537 0.8657  0.8912 0.9268    0
## c50  0.8537  0.8675 0.8789 0.8862  0.9033 0.9268    0
## 
## Kappa 
##        Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## lda  0.3516  0.3896 0.4524 0.4882  0.5754 0.6758    2
## cart 0.2407  0.3575 0.3735 0.4657  0.6330 0.7296    0
## knn  0.1055  0.2347 0.3360 0.3109  0.3843 0.5126    0
## svm  0.0000  0.1260 0.2780 0.2417  0.3560 0.5132    0
## rf   0.2450  0.3256 0.3708 0.4339  0.5683 0.6854    0
## c50  0.1908  0.3155 0.3601 0.4246  0.5716 0.6602    0
# Visualize the results
dotplot(tweet.results)

bwplot(tweet.results)

Looking at the summary and graphs, we can see that CART model still has the highest accuracy of 0.8899.

5.3 Make Predictions

Let’s make prediction on test data

# Make Prediction
tweet.pred <- predict(tweet.cart, testSparse)

# Create Confusion Matrix
confusionMatrix(tweet.pred, testSparse$Negative)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE   298   39
##      TRUE      2   16
##                                           
##                Accuracy : 0.8845          
##                  95% CI : (0.8466, 0.9158)
##     No Information Rate : 0.8451          
##     P-Value [Acc > NIR] : 0.02084         
##                                           
##                   Kappa : 0.3919          
##  Mcnemar's Test P-Value : 1.885e-08       
##                                           
##             Sensitivity : 0.9933          
##             Specificity : 0.2909          
##          Pos Pred Value : 0.8843          
##          Neg Pred Value : 0.8889          
##              Prevalence : 0.8451          
##          Detection Rate : 0.8394          
##    Detection Prevalence : 0.9493          
##       Balanced Accuracy : 0.6421          
##                                           
##        'Positive' Class : FALSE           
## 

Looking at the results, we can see that the accuracy is 0.8845.