Twitter Sentiment Analysis

Description

In this lecture we want to do Sentiment Analysis for tweets containing #vaksin around Indonesia to know how Indonesian people think about vaccination program held by their government. We collect the data directly from Twitter using Twitter API. http://rmarkdown.rstudio.com.

Report Outline

Data Extraction
Exploratory Data Analysis
Data Preparation
Modeling
Evaluation
Recommendation

1. Data Extraction

To pull data directly from Twitter, we need to sign up our account to get developer account from Twitter. After our account have approved, we need to set a few things up.

# import library
library(rtweet)

## Warning: package 'rtweet' was built under R version 4.0.5

Then we will set our key to connect to our application.

# set key
apiKey <- "G4DqgBkVqaXfix65mBVfqhFO0"
apiSecret <- "GYuZSNiK1zShJWTkN35aRVbqZU8Vs0UXz8oonEsfCmWumvxBQx"
accessKey <- "748893266657943552-88bFckxwZvhKrLkjxrPpPIVe8rwVKs6"
accessSecret <- "AtWRo84q8OgfL1IijksITalkZToLFdKwMm3xbxJuoB3fx"

All of those keys can be found in application key and tokens tab. All you need to do was copy it and paste in your code. Next we will create tokens to connect our app using app name and above keys.

# create token
token <- create_token(
  app = "CovSentimentAnalysis",
  consumer_key = apiKey,
  consumer_secret = apiSecret,
  access_token = accessKey,
  access_secret = accessSecret
)

If the connection is successfull, you can pulling data directly form Twitter using search_tweets() function from rtweet() library.

# load data
data <- search_tweets(
  "vaksin",
  n = 1000,
  include_rts = FALSE,
  lang = "id"
)

The search_tweets() function requires at least 2 parameters. The first one was the topic we want to pull from Twitter, in this case, we want to pull tweets that contains words “vaksin”. The second was the number of tweets we want to pull, We choose to pull 1000 data about “vaksin”. We can add other parameters to specify our data, for example in this lecture we’re not include retweets and set the tweets language to only Indonesian.

2. Exploratory Data Analysis

Because we pulling data directly form Twitter, the data set doesn’t have labelled yet. And it’s important for data set to have labelled first. We can do this using sentimentr library.

# import library
library(sentimentr)

## Warning: package 'sentimentr' was built under R version 4.0.5

sentiment <- sentiment_by(data$text)

summary(sentiment$ave_sentiment)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.447214  0.000000  0.000000 -0.004966  0.000000  0.490153

To make understanding data easier, we can draw graph about our sentiment score.

hist(sentiment$ave_sentiment, breaks = 5)

Now we can split the data based on the sentiment score they have. Here we just group it to 2 category, postive and negative. The positive tweets has >= 0 sentiment score, and for the negative they have score < 0. The main purpose on splitting data was to minimize the time our models to predict the sentiment later.

# take only ave column
data$ave <- sentiment$ave_sentiment

# build new dataframe for predicting
keeps <- c("text", "ave")
data <- data[keeps]

# split negative
data$Negative <- as.factor(data$ave < 0)
table(data$Negative)

## 
## FALSE  TRUE 
##   874   126

# split positive
data$Positive <- as.factor(data$ave >= 0 )
table(data$Positive)

## 
## FALSE  TRUE 
##   126   874

3. Data Preparation

In this step, we will do some data cleaning and data transformation process.

3.1 Data Cleaning

# remove ampersand
data$text <- gsub("&amp", "", data$text)
# remove retweeted
data$text <- gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", data$text)
# remove mention
data$text <- gsub("@\\w+", "", data$text)
# remove punctuation
data$text <- gsub("[[:punct:]]", "", data$text)
# remove number
data$text <- gsub("[[:digit:]]", "", data$text)
# remove URL
data$text <- gsub("http\\w+", "", data$text)
# remove \tab
data$text <- gsub("[ \t]{2,}", "", data$text)
# remove $ symbol
data$text <- gsub("^\\s+|\\s+$", "", data$text)
# remove extra white space
data$text <- gsub("[\r\n]", "", data$text)

3.2 Corpus

library(tm)

## Loading required package: NLP

corpus <- Corpus(VectorSource(data$text))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, function(x) iconv(x, "latin1", "ASCII", sub=""))
corpus <- tm_map(corpus, stripWhitespace)

3.3 Stopwords

library(readr)
stopwords <- read_lines("stopwords-id.txt")
corpus <- tm_map(corpus, removeWords, c("vaksin", stopwords = stopwords))

Since the parameters stopwords in tm_map function doesn’t support for Indonesian language, we need to add our own Indonesian stopwords.

3.4 Stemming

corpus <- tm_map(corpus, stemDocument)

3.5 Document Term Matrix(DTM)

dtm <- DocumentTermMatrix(corpus)
inspect(dtm)

## <<DocumentTermMatrix (documents: 1000, terms: 4571)>>
## Non-/sparse entries: 9988/4561012
## Sparsity           : 100%
## Maximal term length: 60
## Weighting          : term frequency (tf)
## Sample             :
##      Terms
## Docs  aman covid cucuk dah kena kinerja nak orang pemerintah vaksinasi
##   14     0     0     0   0    0       0   0     0          0         0
##   157    0     0     0   1    0       0   1     1          0         0
##   236    0     0     0   0    0       0   0     1          0         0
##   288    0     0     0   0    0       0   0     0          0         0
##   343    0     0     0   0    0       0   1     0          0         0
##   51     0     0     0   0    0       0   0     0          0         0
##   547    0     0     0   0    0       0   0     1          0         0
##   826    0     0     0   0    0       0   0     0          0         0
##   881    0     0     0   0    0       0   0     0          0         0
##   984    0     0     0   0    0       0   0     0          0         0

final <- as.data.frame(as.matrix(dtm))

colnames(final) <- make.names(colnames(final))

3.6 Split Data

final$Negative <- data$Negative
final$Positive <- data$Positive

library(caTools)

## Warning: package 'caTools' was built under R version 4.0.5

set.seed(2021)

# negative train and test
splitNegative <- sample.split(final$Negative, SplitRatio = 0.7)
trainSparseNegative <- subset(final, splitNegative == TRUE)
testSparseNegative <- subset(final, splitNegative == FALSE)

# positive train and test
splitPositive <- sample.split(final$Positive, SplitRatio = 0.7)
trainSparsePositive <- subset(final, splitPositive == TRUE)
testSparsePositive <- subset(final, splitPositive == FALSE)

actualN <- testSparseNegative$Negative
actualP <- testSparsePositive$Positive

4. Modeling

library(randomForest)

set.seed(2021)
# build random forest model for negative terms
RFN <- randomForest(Negative ~ ., data = trainSparseNegative, na.action = na.roughfix)

set.seed(2021)
# build random forest model for positive terms
RFP <- randomForest(Positive ~ ., data = trainSparsePositive, na.action = na.roughfix)

5. Evaluation

5.1 Predict Test Data

pred.forestN <- predict(RFN, testSparseNegative, type = "response")
cm.forestN <- table(actualN, pred.forestN)
cm.forestN

##        pred.forestN
## actualN FALSE TRUE
##   FALSE   262    0
##   TRUE      0   38

pred.forestP <- predict(RFP, testSparsePositive, type = "response")
cm.forestP <- table(actualP, pred.forestP)
cm.forestP

##        pred.forestP
## actualP FALSE TRUE
##   FALSE    38    0
##   TRUE      0  262

5.2 Compute Accuracy

perf_rfn <- (cm.forestN[1,1] + cm.forestN[2,2]) / (cm.forestN[1,1] + cm.forestN[2,2] + cm.forestN[1,2] + cm.forestN[2,1])
round(perf_rfn, 3)

## [1] 1

perf_rfp <- (cm.forestP[1,1] + cm.forestP[2,2]) / (cm.forestP[1,1] + cm.forestP[2,2] + cm.forestP[1,2] + cm.forestP[2,1])
round(perf_rfp, 3)

## [1] 1

acc_total <- (perf_rfn + perf_rfp) / 2
round(acc_total, 3)

## [1] 1

6. Recommendation

6.1 RandomForest algorithm gives us 99.7% total accuracy.
6.2 Try different algorithm
6.3 Do further data cleaning
6.4 Playing with data analysis