In this lecture we want to do Sentiment Analysis for tweets containing #vaksin around Indonesia to know how Indonesian people think about vaccination program held by their government. We collect the data directly from Twitter using Twitter API. http://rmarkdown.rstudio.com.
To pull data directly from Twitter, we need to sign up our account to get developer account from Twitter. After our account have approved, we need to set a few things up.
# import library
library(rtweet)
## Warning: package 'rtweet' was built under R version 4.0.5
Then we will set our key to connect to our application.
# set key
apiKey <- "G4DqgBkVqaXfix65mBVfqhFO0"
apiSecret <- "GYuZSNiK1zShJWTkN35aRVbqZU8Vs0UXz8oonEsfCmWumvxBQx"
accessKey <- "748893266657943552-88bFckxwZvhKrLkjxrPpPIVe8rwVKs6"
accessSecret <- "AtWRo84q8OgfL1IijksITalkZToLFdKwMm3xbxJuoB3fx"
All of those keys can be found in application key and tokens tab. All you need to do was copy it and paste in your code. Next we will create tokens to connect our app using app name and above keys.
# create token
token <- create_token(
app = "CovSentimentAnalysis",
consumer_key = apiKey,
consumer_secret = apiSecret,
access_token = accessKey,
access_secret = accessSecret
)
If the connection is successfull, you can pulling data directly form Twitter using search_tweets() function from rtweet() library.
# load data
data <- search_tweets(
"vaksin",
n = 1000,
include_rts = FALSE,
lang = "id"
)
The search_tweets() function requires at least 2 parameters. The first one was the topic we want to pull from Twitter, in this case, we want to pull tweets that contains words “vaksin”. The second was the number of tweets we want to pull, We choose to pull 1000 data about “vaksin”. We can add other parameters to specify our data, for example in this lecture we’re not include retweets and set the tweets language to only Indonesian.
Because we pulling data directly form Twitter, the data set doesn’t have labelled yet. And it’s important for data set to have labelled first. We can do this using sentimentr library.
# import library
library(sentimentr)
## Warning: package 'sentimentr' was built under R version 4.0.5
sentiment <- sentiment_by(data$text)
summary(sentiment$ave_sentiment)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.447214 0.000000 0.000000 -0.004966 0.000000 0.490153
To make understanding data easier, we can draw graph about our sentiment score.
hist(sentiment$ave_sentiment, breaks = 5)
Now we can split the data based on the sentiment score they have. Here we just group it to 2 category, postive and negative. The positive tweets has >= 0 sentiment score, and for the negative they have score < 0. The main purpose on splitting data was to minimize the time our models to predict the sentiment later.
# take only ave column
data$ave <- sentiment$ave_sentiment
# build new dataframe for predicting
keeps <- c("text", "ave")
data <- data[keeps]
# split negative
data$Negative <- as.factor(data$ave < 0)
table(data$Negative)
##
## FALSE TRUE
## 874 126
# split positive
data$Positive <- as.factor(data$ave >= 0 )
table(data$Positive)
##
## FALSE TRUE
## 126 874
In this step, we will do some data cleaning and data transformation process.
3.1 Data Cleaning
# remove ampersand
data$text <- gsub("&", "", data$text)
# remove retweeted
data$text <- gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", data$text)
# remove mention
data$text <- gsub("@\\w+", "", data$text)
# remove punctuation
data$text <- gsub("[[:punct:]]", "", data$text)
# remove number
data$text <- gsub("[[:digit:]]", "", data$text)
# remove URL
data$text <- gsub("http\\w+", "", data$text)
# remove \tab
data$text <- gsub("[ \t]{2,}", "", data$text)
# remove $ symbol
data$text <- gsub("^\\s+|\\s+$", "", data$text)
# remove extra white space
data$text <- gsub("[\r\n]", "", data$text)
3.2 Corpus
library(tm)
## Loading required package: NLP
corpus <- Corpus(VectorSource(data$text))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, function(x) iconv(x, "latin1", "ASCII", sub=""))
corpus <- tm_map(corpus, stripWhitespace)
3.3 Stopwords
library(readr)
stopwords <- read_lines("stopwords-id.txt")
corpus <- tm_map(corpus, removeWords, c("vaksin", stopwords = stopwords))
Since the parameters stopwords in tm_map function doesn’t support for Indonesian language, we need to add our own Indonesian stopwords.
3.4 Stemming
corpus <- tm_map(corpus, stemDocument)
3.5 Document Term Matrix(DTM)
dtm <- DocumentTermMatrix(corpus)
inspect(dtm)
## <<DocumentTermMatrix (documents: 1000, terms: 4571)>>
## Non-/sparse entries: 9988/4561012
## Sparsity : 100%
## Maximal term length: 60
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs aman covid cucuk dah kena kinerja nak orang pemerintah vaksinasi
## 14 0 0 0 0 0 0 0 0 0 0
## 157 0 0 0 1 0 0 1 1 0 0
## 236 0 0 0 0 0 0 0 1 0 0
## 288 0 0 0 0 0 0 0 0 0 0
## 343 0 0 0 0 0 0 1 0 0 0
## 51 0 0 0 0 0 0 0 0 0 0
## 547 0 0 0 0 0 0 0 1 0 0
## 826 0 0 0 0 0 0 0 0 0 0
## 881 0 0 0 0 0 0 0 0 0 0
## 984 0 0 0 0 0 0 0 0 0 0
final <- as.data.frame(as.matrix(dtm))
colnames(final) <- make.names(colnames(final))
3.6 Split Data
final$Negative <- data$Negative
final$Positive <- data$Positive
library(caTools)
## Warning: package 'caTools' was built under R version 4.0.5
set.seed(2021)
# negative train and test
splitNegative <- sample.split(final$Negative, SplitRatio = 0.7)
trainSparseNegative <- subset(final, splitNegative == TRUE)
testSparseNegative <- subset(final, splitNegative == FALSE)
# positive train and test
splitPositive <- sample.split(final$Positive, SplitRatio = 0.7)
trainSparsePositive <- subset(final, splitPositive == TRUE)
testSparsePositive <- subset(final, splitPositive == FALSE)
actualN <- testSparseNegative$Negative
actualP <- testSparsePositive$Positive
library(randomForest)
set.seed(2021)
# build random forest model for negative terms
RFN <- randomForest(Negative ~ ., data = trainSparseNegative, na.action = na.roughfix)
set.seed(2021)
# build random forest model for positive terms
RFP <- randomForest(Positive ~ ., data = trainSparsePositive, na.action = na.roughfix)
5.1 Predict Test Data
pred.forestN <- predict(RFN, testSparseNegative, type = "response")
cm.forestN <- table(actualN, pred.forestN)
cm.forestN
## pred.forestN
## actualN FALSE TRUE
## FALSE 262 0
## TRUE 0 38
pred.forestP <- predict(RFP, testSparsePositive, type = "response")
cm.forestP <- table(actualP, pred.forestP)
cm.forestP
## pred.forestP
## actualP FALSE TRUE
## FALSE 38 0
## TRUE 0 262
5.2 Compute Accuracy
perf_rfn <- (cm.forestN[1,1] + cm.forestN[2,2]) / (cm.forestN[1,1] + cm.forestN[2,2] + cm.forestN[1,2] + cm.forestN[2,1])
round(perf_rfn, 3)
## [1] 1
perf_rfp <- (cm.forestP[1,1] + cm.forestP[2,2]) / (cm.forestP[1,1] + cm.forestP[2,2] + cm.forestP[1,2] + cm.forestP[2,1])
round(perf_rfp, 3)
## [1] 1
acc_total <- (perf_rfn + perf_rfp) / 2
round(acc_total, 3)
## [1] 1
6.1 RandomForest algorithm gives us 99.7% total accuracy.
6.2 Try different algorithm
6.3 Do further data cleaning
6.4 Playing with data analysis