Apple is a computer company known for its laptops, phones, tablets etc. While Apple has a large number of fans, they also have a large number of people who don’t like their products. In this exercise, our challenge will be to classify tweets as being negative, positive, or neither about Apple.
# Load packages
library(tm)
library(wordcloud)
library(SnowballC)
library(RColorBrewer)
library(ggplot2)
library(caTools)
library(rpart)
library(rpart.plot)
tweets <- read.csv('tweets.csv', stringsAsFactors = FALSE)
str(tweets)
## 'data.frame': 1181 obs. of 2 variables:
## $ Tweet: chr "I have to say, Apple has by far the best customer care service I have ever received! @Apple @AppStore" "iOS 7 is so fricking smooth & beautiful!! #ThanxApple @Apple" "LOVE U @APPLE" "Thank you @apple, loving my new iPhone 5S!!!!! #apple #iphone5S pic.twitter.com/XmHJCU4pcb" ...
## $ Avg : num 2 2 1.8 1.8 1.8 1.8 1.8 1.6 1.6 1.6 ...
We can see that the dataset has 1181 observations and 2 variables. One of the variables is the tweet itself and the other variable is the average sentiment about the tweet.
As we are interested in detecting tweets with negative sentiment, let’s create a new variable with average value of <= -1.
tweets$Negative <- as.factor(tweets$Avg <= -1)
# Look at the number and proportion of negative tweets
table(tweets$Negative)
##
## FALSE TRUE
## 999 182
prop.table(table(tweets$Negative))
##
## FALSE TRUE
## 0.8458933 0.1541067
We can set that about 15% tweets are negative.
# Create Corpus
tweet_corpus <- Corpus(VectorSource(tweets$Tweet))
# Review Corpus
tweet_corpus
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 1181
# View tweets
tweet_corpus[[1]][1]
## $content
## [1] "I have to say, Apple has by far the best customer care service I have ever received! @Apple @AppStore"
tweet_corpus[[5]][1]
## $content
## [1] ".@apple has the best customer service. In and out with a new phone in under 10min!"
We can see that the corpus has 1181 documents in it. We can also look at the 1st and the 5th tweets to examine their contents.
Let’s create a function that will clean the corpus to convert all characters to lowercase, remove punctuation, remove any stop words and stem the document.
# Create function to clean corpus
clean_corpus <- function(corp){
corp <- tm_map(corp, content_transformer(tolower))
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, removeWords, c('apple', stopwords('en')))
corp <- tm_map(corp, stemDocument)
}
# Apply function on corpus
clean_corp <- clean_corpus(tweet_corpus)
# Check a tweet from cleaned corpus
clean_corp[[1]][1]
## $content
## [1] " say far best custom care servic ever receiv appstor"
Looking at the tweet, we can see that the tweets have been cleaned
Let’s create a Term Document Matrix to check the no of times a word appears in each document
frequencies <- TermDocumentMatrix(clean_corp)
# Look at the TDM
frequencies
## <<TermDocumentMatrix (terms: 3289, documents: 1181)>>
## Non-/sparse entries: 8980/3875329
## Sparsity : 100%
## Maximal term length: 115
## Weighting : term frequency (tf)
# Inspect in detail
inspect(frequencies[505:515, 1000:1005])
## <<TermDocumentMatrix (terms: 11, documents: 6)>>
## Non-/sparse entries: 1/65
## Sparsity : 98%
## Maximal term length: 9
## Weighting : term frequency (tf)
##
## Docs
## Terms 1000 1001 1002 1003 1004 1005
## cheapen 0 0 0 0 0 0
## cheaper 0 0 0 0 0 0
## check 0 0 0 0 0 0
## cheep 0 0 0 0 0 0
## cheer 0 0 0 0 0 1
## cheerio 0 0 0 0 0 0
## cherylcol 0 0 0 0 0 0
## chief 0 0 0 0 0 0
## chiiiiqu 0 0 0 0 0 0
## child 0 0 0 0 0 0
## children 0 0 0 0 0 0
Let’s look at frequent terms with a low frequency of 20 (min no. of times a term must be displayed to appear)
findFreqTerms(frequencies, lowfreq = 20)
## [1] "android" "anyon" "app"
## [4] "appl" "back" "batteri"
## [7] "better" "buy" "can"
## [10] "cant" "come" "dont"
## [13] "fingerprint" "freak" "get"
## [16] "googl" "ios7" "ipad"
## [19] "iphon" "iphone5" "iphone5c"
## [22] "ipod" "ipodplayerpromo" "itun"
## [25] "just" "like" "lol"
## [28] "look" "love" "make"
## [31] "market" "microsoft" "need"
## [34] "new" "now" "one"
## [37] "phone" "pleas" "promo"
## [40] "promoipodplayerpromo" "realli" "releas"
## [43] "samsung" "say" "store"
## [46] "thank" "think" "time"
## [49] "twitter" "updat" "use"
## [52] "via" "want" "well"
## [55] "will" "work"
We can see that out of 3289 terms only 56 appear at least 20 times in our tweets. This means that there are a lot of terms that will be useless for our model.
Let’s look at words that are associated with ‘ipad’, ‘android’, ‘ios7’
findAssocs(frequencies, c('ipad','android','ios7'), corlimit = 0.4)
## $ipad
## ipodplayerpromo itun ipod
## 0.83 0.80 0.78
## promoipodplayerpromo promo
## 0.78 0.68
##
## $android
## dougrtequan
## 0.53
##
## $ios7
## numeric(0)
We can see that the word ‘ipad’ has a few corelated words whereas ‘ios7’ does not have any.
The term document matrix is not a matrix. With this step, let’s convert it into a matrix
freq.matrix <- as.matrix(frequencies)
Let’s get the word count in decreasing order of frequency
term.freq <- sort(rowSums(freq.matrix), decreasing = TRUE)
head(term.freq)
## iphon itun new ipad phone get
## 287 121 113 91 86 75
We can see that words iphone, itune are highly frequent
Let’s create a data frame of the words and their frequencies
term.df <- data.frame(term=names(term.freq), freq=term.freq)
str(term.df)
## 'data.frame': 3289 obs. of 2 variables:
## $ term: Factor w/ 3289 levels "000","075","0909",..: 1542 1580 1989 1540 2178 1117 1553 1631 1808 1554 ...
## $ freq: num 287 121 113 91 86 75 73 61 61 60 ...
Let’s create a bar plot of words with frequencies > 35 and < 200
plot <- ggplot(subset(term.df, term.df$freq > 35 & term.df$freq < 200), aes(term, freq, fill=freq)) + geom_bar(stat='identity') + labs(x='Terms', y='Count', title='Term Frequencies')
plot + coord_flip()
Let’s create a word cloud of 200 words with a minimum frequency of 15
wordcloud(term.df$term, term.df$freq, min.freq=15, max.words = 200, random.order = FALSE, colors= brewer.pal(8, 'Dark2'))
We can see that the words ‘iphon’, ‘itun’, ‘new’, ‘phone’ are bigger which shows a high frequency of these words in our tweets.
Let’s create a DTM from clean corpus and view its contents
dtm <- DocumentTermMatrix(clean_corp)
dtm
## <<DocumentTermMatrix (documents: 1181, terms: 3289)>>
## Non-/sparse entries: 8980/3875329
## Sparsity : 100%
## Maximal term length: 115
## Weighting : term frequency (tf)
We can see that DTM has 1181 documents and 3289 terms. We have also seen earlier that the document has a lot of sparse terms. Let’s try removing these sparse terms.
Let’s remove terms that dont appear often (keep only terms that appers in 0.5% or more of tweets)
sparseData <- removeSparseTerms(dtm, sparse=0.995)
sparseData
## <<DocumentTermMatrix (documents: 1181, terms: 309)>>
## Non-/sparse entries: 4669/360260
## Sparsity : 99%
## Maximal term length: 20
## Weighting : term frequency (tf)
We can see that there are only 309 terms in our sparse matrix. Only about 9% of the previous count of 3289.
Let’s convert sparseData to a data frame for further analysis
# Convert to data frame
sparsedf <- as.data.frame(as.matrix(sparseData))
# Look at the head of first 6 columns of the data frame
head(sparsedf[,1:6])
## 244tsuyoponzu 7evenstarz actual add alreadi alway
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
We can see that some of the column names start with a number.
Since R struggles with variable names that start with a number, let’s run make.names function to make sure all variable names are syntactically valid.
# Make variable names
colnames(sparsedf) <- make.names(colnames(sparsedf))
# Look at the head of first 6 columns of the data frame
head(sparsedf[,1:6])
## X244tsuyoponzu X7evenstarz actual add alreadi alway
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
We can see that character “X” is prepended to variable names that started with a number.
Let’s add the dependent variable
sparsedf$Negative <- tweets$Negative
Let’ split the data into training and test set
set.seed(101)
split <- sample.split(sparsedf$Negative, SplitRatio = 0.7)
trainSparse <- subset(sparsedf, split==TRUE)
testSparse <- subset(sparsedf, split==FALSE)
Let’s create a baseline classification model, plot the model.
# Build Baseline Model
tweet.CART <- rpart(Negative~., trainSparse, method='class')
# Plot Model
prp(tweet.CART)
Let’s compute the numerical performance of our baseline model.
library(caret)
## Loading required package: lattice
# Make Prediction
predictCART <- predict(tweet.CART, testSparse, type='class')
# Compute Accuracy
table(predictCART, testSparse$Negative)
##
## predictCART FALSE TRUE
## FALSE 298 39
## TRUE 2 16
# Accuracy
postResample(predictCART, testSparse$Negative)
## Accuracy Kappa
## 0.8845070 0.3918947
Looking at the results, we can see that the accuracy is 0.8845
Let’s evaluate some models and pick the best model. We will use a mixture of simple linear (LDA), nonlinear (CART, KNN) and complex nonlinear methods (SVM, RF, C5.0).
control <- trainControl(method='cv', number=10)
metric <- 'Accuracy'
# LDA
set.seed(101)
tweet.lda <- train(Negative ~ ., data=trainSparse, method='lda',
trControl=control, metric=metric)
## Loading required package: MASS
# CART
set.seed(101)
tweet.cart <- train(Negative~., data=trainSparse, method='rpart',
trControl=control, metric=metric)
# KNN
set.seed(101)
tweet.knn <- train(Negative ~ ., data=trainSparse, method='knn',
trControl=control, metric=metric)
# SVM
set.seed(101)
tweet.svm <- train(Negative~., data=trainSparse, method='svmRadial',
trControl=control, metric=metric)
## Loading required package: kernlab
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
# RF
set.seed(101)
tweet.rf <- train(Negative~., data=trainSparse, method='ranger',
trControl=control, metric=metric)
## Loading required package: e1071
## Loading required package: ranger
# C 5.0
set.seed(101)
tweet.C50 <- train(Negative~., data=trainSparse, method='C5.0',
trControl=control, metric=metric)
## Loading required package: C50
## Loading required package: plyr
Let’s compute and compare the performance of these models.
tweet.results <- resamples(list(lda=tweet.lda, cart=tweet.cart,
knn=tweet.knn, svm=tweet.svm, rf=tweet.rf,
c50=tweet.C50))
# Summarize the results
summary(tweet.results)
##
## Call:
## summary.resamples(object = tweet.results)
##
## Models: lda, cart, knn, svm, rf, c50
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lda 0.8072 0.8308 0.8545 0.8593 0.8855 0.9146 2
## cart 0.8434 0.8675 0.8841 0.8899 0.9154 0.9398 0
## knn 0.8434 0.8675 0.8795 0.8753 0.8876 0.9024 0
## svm 0.8415 0.8580 0.8728 0.8705 0.8795 0.9036 0
## rf 0.8193 0.8434 0.8537 0.8657 0.8912 0.9268 0
## c50 0.8537 0.8675 0.8789 0.8862 0.9033 0.9268 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lda 0.3516 0.3896 0.4524 0.4882 0.5754 0.6758 2
## cart 0.2407 0.3575 0.3735 0.4657 0.6330 0.7296 0
## knn 0.1055 0.2347 0.3360 0.3109 0.3843 0.5126 0
## svm 0.0000 0.1260 0.2780 0.2417 0.3560 0.5132 0
## rf 0.2450 0.3256 0.3708 0.4339 0.5683 0.6854 0
## c50 0.1908 0.3155 0.3601 0.4246 0.5716 0.6602 0
# Visualize the results
dotplot(tweet.results)
bwplot(tweet.results)
Looking at the summary and graphs, we can see that CART model still has the highest accuracy of 0.8899.
Let’s make prediction on test data
# Make Prediction
tweet.pred <- predict(tweet.cart, testSparse)
# Create Confusion Matrix
confusionMatrix(tweet.pred, testSparse$Negative)
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 298 39
## TRUE 2 16
##
## Accuracy : 0.8845
## 95% CI : (0.8466, 0.9158)
## No Information Rate : 0.8451
## P-Value [Acc > NIR] : 0.02084
##
## Kappa : 0.3919
## Mcnemar's Test P-Value : 1.885e-08
##
## Sensitivity : 0.9933
## Specificity : 0.2909
## Pos Pred Value : 0.8843
## Neg Pred Value : 0.8889
## Prevalence : 0.8451
## Detection Rate : 0.8394
## Detection Prevalence : 0.9493
## Balanced Accuracy : 0.6421
##
## 'Positive' Class : FALSE
##
Looking at the results, we can see that the accuracy is 0.8845.