Here, it is the presentation about some text analysis on Twitter the Rmarkdown. Example here would be starbucks customers’ sentiment analysis.
First, it is necessary to load some important packages such as
library(twitteR) # help setting up the twitter accounts
library(sentiment) # old package needed source code
## Loading required package: tm
## Loading required package: Rstem
library(plyr)
##
## Attaching package: 'plyr'
##
## The following object is masked from 'package:twitteR':
##
## id
library(ggplot2) # beautiful plot packages
library(wordcloud) # data visualization
## Loading required package: RColorBrewer
library(RColorBrewer) # colors for wordcloud
library(Rstem) # needed for sentiment pacakge
After loading packages, it is needed to search some tweets about “starbucks”
Normally tweets will be like below, and each of them is a character
some_txt[1:5]
## [1] "Almost June :::-----))))) so that means I can finally get starbucks after 2 months :-)"
## [2] "RT @twerkteam_anna: I want Starbucks"
## [3] "Read the Green sign. Attitude? Go to #Starbucks. Cry Baby? Go to Starbucks. Jason the coffee man puts… https://t.co/9eiwejEWW4"
## [4] "I love the kindness in people. #niceday #people #kind #starbucks #icedcoffee https://t.co/3s3MpfvXcZ"
## [5] "RT @AshleyPosts: must try strabucks secret menu \xed\xa0\xbd\xed\xb2\x95\u2615️ #3 is my favorite! ❤️ http://t.co/h0q5vWdcMU http://t.co/fHIQKXOx6I"
Right now, all we need to do is cleaning data by removing its puncutation, people involved in the tweets, numers, retweet, html links, unnecessary spaces, also capitalizations. Then, we can delete NA terms for the tweets text.
# remove retweet entities
some_txt = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", some_txt)
# remove at people
some_txt = gsub("@\\w+", "", some_txt)
# remove punctuation
some_txt = gsub("[[:punct:]]", "", some_txt)
# remove numbers
some_txt = gsub("[[:digit:]]", "", some_txt)
# remove html links
some_txt = gsub("http\\w+", "", some_txt)
# remove unnecessary spaces
some_txt = gsub("[ \t]{2,}", "", some_txt)
some_txt = gsub("^\\s+|\\s+$", "", some_txt)
# define "tolower error handling" function
try.error = function(x)
{
# create missing value
y = NA
# tryCatch error
try_error = tryCatch(tolower(x), error=function(e) e)
# if not an error
if (!inherits(try_error, "error"))
y = tolower(x)
# result
return(y)
}
# lower case using try.error with sapply
some_txt = sapply(some_txt, try.error)
# remove NAs in some_txt
some_txt = some_txt[!is.na(some_txt)]
names(some_txt) = NULL
Then, we need to classify individual words into certain types. Based on the rule of emotion, the sentiment pacakge already set 5 different types. Later on, in the comparison plot, it is able to see. By the way, when words are not classified into certain 5 types, we treate them as “unknown”.
# classify emotion
class_emo = classify_emotion(some_txt, algorithm="bayes", prior=1.0)
# get emotion best fit
emotion = class_emo[,7]
# substitute NA's by "unknown"
emotion[is.na(emotion)] = "unknown"
In the meantime, we also do classification for polarity using also the naive bayes algorithm.
# classify polarity
class_pol = classify_polarity(some_txt, algorithm="bayes")
# get polarity best fit
polarity = class_pol[,4]
# data frame with results
sent_df = data.frame(text=some_txt, emotion=emotion,
polarity=polarity, stringsAsFactors=FALSE)
# sort data frame
sent_df = within(sent_df,
emotion <- factor(emotion, levels=names(sort(table(emotion), decreasing=TRUE))))
Now, making some plots about what we classifed from the tweets
# plot distribution of emotions
ggplot(sent_df, aes(x=emotion)) +
geom_bar(aes(y=..count.., fill=emotion)) + scale_fill_brewer(palette="Dark2")
# plot distribution of polarity
ggplot(sent_df, aes(x=polarity)) +
geom_bar(aes(y=..count.., fill=polarity)) + scale_fill_brewer(palette="RdGy")
Last, we can make the comparison wordcloud to see the words.
# separating text by emotion
emos = levels(factor(sent_df$emotion))
nemo = length(emos)
emo.docs = rep("", nemo)
for (i in 1:nemo)
{
tmp = some_txt[emotion == emos[i]]
emo.docs[i] = paste(tmp, collapse=" ")
}
# remove stopwords
emo.docs = removeWords(emo.docs, stopwords("english"))
# create corpus
corpus = Corpus(VectorSource(emo.docs))
tdm = TermDocumentMatrix(corpus)
tdm = as.matrix(tdm)
colnames(tdm) = emos
# comparison word cloud
comparison.cloud(tdm, colors = brewer.pal(nemo, "Dark2"),
random.order = FALSE, title.size = 1.5)