This report describes sentiment analysis using classification model. The dataset used in this project is from Kaggle. Our challenge is to see if we can correctly classify review as being negative or positive. https://www.kaggle.com/prakharrathi25/google-play-store-reviews.
Report outline:
1. Data Extraction
2. Exploratory Data Analysis
3. Data Preparation
4. Modeling
5. Evaluation
6. Recommendation
1.1 Load Library:
library(tm)
## Loading required package: NLP
library(SnowballC)
library(caTools)
library(rpart)
library(rpart.plot)
library(ROCR)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
library(e1071)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
## The following object is masked from 'package:NLP':
##
## annotate
library(wordcloud)
## Loading required package: RColorBrewer
1.1 Load the Data:
dataframe <- read.csv("data/reviews.csv", stringsAsFactors = FALSE)
Set the parameter stringsAsFactors to FALSE means we don’t want to convert strings to factor.
1.2 Get Dataframe Summary:
summary(dataframe)
## reviewId userName userImage content
## Length:12495 Length:12495 Length:12495 Length:12495
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## score thumbsUpCount reviewCreatedVersion at
## Min. :1.000 Min. : 0.000 Length:12495 Length:12495
## 1st Qu.:2.000 1st Qu.: 0.000 Class :character Class :character
## Median :3.000 Median : 0.000 Mode :character Mode :character
## Mean :3.094 Mean : 3.047
## 3rd Qu.:4.000 3rd Qu.: 1.000
## Max. :5.000 Max. :397.000
## replyContent repliedAt sortOrder appId
## Length:12495 Length:12495 Length:12495 Length:12495
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
As seen above, the summary() syntax gives us information about the dataframe, such as Min, Max, and Mean values. In the next step, we will use the score as the target variable, since we want to predict the sentiment behind the reviews. We also know that there are several columns that unrelated to our goals. We will remove it in Data Preprocessing step.
1.3 Explore Data Structure
str(dataframe)
## 'data.frame': 12495 obs. of 12 variables:
## $ reviewId : chr "gp:AOqpTOEhZuqSqqWnaKRgv-9ABYdajFUB0WugPGh-SG-fgH355YH_t7J2q4xYo6ZzN3Mc7iSrrTV6ke8hG_fl4Q" "gp:AOqpTOH0WP4IQKBZ2LrdNmFy_YmpPCVrV3diEU9KGm3fAX6VG0NAZCudCQpQRRI3GLL_tr8DQzUTP1hrOYG74A" "gp:AOqpTOEMCkJB8Iq1p-r9dPwnSYadA5BkPWTf32Z1azuuTvqA9KWdTQqNNXWZsJEhmSuYUY_LmL-OdUIl4j70wg" "gp:AOqpTOGFrUWuKGycpje8kszj3uwHN6tU_fd4gLVFy9z7hfGM7Gan22TJrN89NmGVEdj5o4U6W4I6slbTx8OsQw" ...
## $ userName : chr "Eric Tie" "john alpha" "Sudhakar .S" "SKGflorida@bellsouth.net DAVID S" ...
## $ userImage : chr "https://play-lh.googleusercontent.com/a-/AOh14GiGET2XHTvsSEsA07ZBPu2s1E6fOXd9WyT_ahChpw" "https://play-lh.googleusercontent.com/a-/AOh14GjpfgjOEbD3brypMeMT3KvhYlWG_nO2bMnMIfY9" "https://play-lh.googleusercontent.com/a-/AOh14GidHUHTvHZTXBX36CdxFeccVR2IasC1MHUHXLuFpg" "https://play-lh.googleusercontent.com/-75aK0WFniac/AAAAAAAAAAI/AAAAAAAAAAA/AMZuucn_nhfTJ2FT63nZ53feI1vVx58DJg/photo.jpg" ...
## $ content : chr "I cannot open the app anymore" "I have been begging for a refund from this app for over a month and nobody is replying me" "Very costly for the premium version (approx Indian Rupees 910 per year). Better to download the premium version"| __truncated__ "Used to keep me organized, but all the 2020 UPDATES have made a mess of things !!! Y cudn't u leave well enuf a"| __truncated__ ...
## $ score : int 1 1 1 1 1 1 1 1 1 1 ...
## $ thumbsUpCount : int 0 0 0 0 0 1 0 0 0 1 ...
## $ reviewCreatedVersion: chr "5.4.0.6" "" "" "" ...
## $ at : chr "2020-10-27 21:24:41" "2020-10-27 14:03:28" "2020-10-27 08:18:40" "2020-10-26 13:28:07" ...
## $ replyContent : chr "" "Please note that from checking our records, your email has been answered, and there was no subscription registe"| __truncated__ "" "What do you find troublesome about the update? We'd love to get your feedback, by writing to us at https://www."| __truncated__ ...
## $ repliedAt : chr "" "2020-10-27 15:05:52" "" "2020-10-26 14:58:29" ...
## $ sortOrder : chr "newest" "newest" "newest" "newest" ...
## $ appId : chr "com.anydo" "com.anydo" "com.anydo" "com.anydo" ...
We have 12495 obrservations of 12 variables, and we just take only 2 variables, content and score. So we will remove other columns.
1.4 Remove Unnecessary Columns
# use just content and score column
dataframe <- dataframe[ , c("content", "score")]
Since we want to predict the sentiment behind the reviews, we just need relevant information, so we can remove the other columns.
2.1 Plotting Variable ‘score’
hist(dataframe$score, breaks = 5)
The histogram shows variable ‘score’ in the dataframe has value between 1-5. 1 means most negative and 5 means most positive. In this project, we will divide the value of score for just positive and negative sentiment. The negative sentiment will have score between 1-2, and the positive will have score between 4-5.
2.2 Building WordCloud
library(wordcloud)
library(RColorBrewer)
library(wordcloud2)
# building wordcloud
set.seed(2021)
wordcloud(dataframe$content,min.freq = 10,colors=brewer.pal(8, "Dark2"),random.color = TRUE,max.words = 100)
There are still so many unnecessary word that appear in dataframe, it can affect the prediction model if not removed.
2.3 Get Sentiment Type based on Reviews Content
# get sentiment based on score
library(syuzhet)
mysentiment_data <- get_nrc_sentiment((dataframe$content))
# turn it to dataframe
Sentimentscores_data <- data.frame(colSums(mysentiment_data[,]))
# rename column name
names(Sentimentscores_data) <- "Score"
# bind column
Sentimentscores_data <- cbind("sentiment"=rownames(Sentimentscores_data), Sentimentscores_data)
# remove column name
rownames(Sentimentscores_data) <- NULL
library(ggplot2)
# plotting
ggplot(data = Sentimentscores_data,
aes(x = sentiment,
y = Score)) +
geom_bar(aes(fill = sentiment),
stat = "identity") +
theme(legend.position = "none") +
xlab("Sentiments") +
ylab("scores") +
ggtitle("Sentiments of people behind their reviews")
From the graph above, we can see people are expressing various reviews. In general, positive reviews are more than negative.
2.4 Split Data Into Positive and Negative
# split data into negative sentiment based on score
dataframe$Negative <- as.factor(dataframe$score <= 2)
We can see how many review are in negative category. This can be done with table() function.
table(dataframe$Negative)
##
## FALSE TRUE
## 7645 4850
# split data into positive sentiment based on score
dataframe$Positive <- as.factor(dataframe$score >= 4)
table(dataframe$Positive)
##
## FALSE TRUE
## 6841 5654
We are just interested in detecting clear negative and clear positive sentiment, so we define new variable in dataframe. Let’s define it with this rule:
- The content which has score <= 2 will be negative reviews.
- The content which has score >= 4 will be positive reviews.
In this step, we will try to remove unnecessary data, such as stopwords, punctuation, and create a data matrix so it can fit to our classification model.
3.1 Create Corpus
In this part, we try to remove unnecessary symbol and or case, such as removing punctuation, &, non-ASCII character, and stopwords(“I”, “you”, “and”, “or”, etc.) as well as implementing corpus.
Corpus is a collection of documents.
library(tm)
# create corpus
corpus <- Corpus(VectorSource(dataframe$content))
We can check the corpus that match our content by using double bracket [[]]. Let’s inspect the first element in our corpus.
corpus[[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 29
3.2 Data Cleaning
For sentiment analysis, text cleaning is required. The text might contain several unnecessary symbol and or case.
We will follow the standard step to build and pre-processing the corpus.
First.
Transform all text to lower case:
# make all content to lower case
corpus <- tm_map(corpus, tolower)
Second.
Convert corpus to a Plain Text Document.
# create plain text document
corpus <- tm_map(corpus, PlainTextDocument)
## Warning in tm_map.SimpleCorpus(corpus, PlainTextDocument): transformation drops
## documents
Third.
Removing Non-ASCII Char
# remove non-ASCII char
corpus <- tm_map(corpus, function(x) iconv(x, "latin1", "ASCII", sub=""))
## Warning in tm_map.SimpleCorpus(corpus, function(x) iconv(x, "latin1", "ASCII", :
## transformation drops documents
Fourth.
Removing Punctuation and Extra White Space
# remove punctuation
corpus <- tm_map(corpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(corpus, removePunctuation): transformation drops
## documents
# remove extra white space
corpus <- tm_map(corpus, stripWhitespace)
## Warning in tm_map.SimpleCorpus(corpus, stripWhitespace): transformation drops
## documents
Fifth.
Removing Stop Words
Removing stop words in R can be done by using removeWords argument in tm_map() function. You can also add extra argument such as the word(s) you want to remove. In this lecture we will remove the word “app” since almost all of the content have it and it won’t be useful in our prediction problem.
# remove stopwords
corpus <- tm_map(corpus, removeWords, c("app", stopwords("english")))
## Warning in tm_map.SimpleCorpus(corpus, removeWords, c("app",
## stopwords("english"))): transformation drops documents
Now we will check out our corpus again.
corpus[[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 16
Sixth.
Stemming
Last, we want to stem our document using stemDocument argument in tm_map() function.
corpus <- tm_map(corpus, stemDocument)
## Warning in tm_map.SimpleCorpus(corpus, stemDocument): transformation drops
## documents
corpus[[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 11
3.3 Create a Document Term Matrix
Now we will extract word frequencies to be used in our prediction model. We can use a function called DocumentTermMatrix() that will generates a matrix where:
- the rows correspond to the review contents
- the columns correspond to words in those contents.
The values in the matrix are the number of times that word appears in each content.
# create DTM
dtm <- DocumentTermMatrix(corpus)
dtm
## <<DocumentTermMatrix (documents: 12495, terms: 8729)>>
## Non-/sparse entries: 152862/108915993
## Sparsity : 100%
## Maximal term length: 94
## Weighting : term frequency (tf)
We see that in our corpus, there are 8729 unique words. Next, lets see what this matrix look like using inspect() function.
inspect(dtm)
## <<DocumentTermMatrix (documents: 12495, terms: 8729)>>
## Non-/sparse entries: 152862/108915993
## Sparsity : 100%
## Maximal term length: 94
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs calendar can good great just like task time use work
## 10112 0 1 0 0 1 2 3 1 3 0
## 10655 2 1 0 0 0 0 0 1 2 0
## 1307 0 0 0 0 0 0 1 0 0 0
## 2778 0 4 0 0 4 2 0 5 5 1
## 3199 0 1 0 0 1 1 1 0 0 0
## 3245 0 1 1 0 0 0 1 0 4 0
## 3291 0 2 0 2 1 2 0 0 1 1
## 4947 0 0 1 0 0 0 0 0 3 1
## 5819 0 0 0 0 1 1 4 1 4 4
## 5851 1 1 1 1 1 1 9 0 1 0
We see above that the word “calendar” appears in docs 10655 and 5851, but doesn’t appear on the other docs. It means that we have so many zero values in our matrix, this data is called sparse.
Since the zero value is pretty useless for our prediction model, we will just take words that appear at least 500 times in our whole matrix. We will use findFreqTerms() function to selecting the words.
3.4 Count Terms Frequency
# find terms that appear >= 500 times
freq <- findFreqTerms(dtm, lowfreq = 500)
freq
## [1] "list" "premium" "use" "version" "year" "chang"
## [7] "keep" "make" "need" "thing" "updat" "cant"
## [13] "now" "doesnt" "even" "get" "like" "look"
## [19] "phone" "see" "show" "time" "way" "work"
## [25] "free" "new" "dont" "app" "fix" "just"
## [31] "one" "realli" "tri" "will" "good" "great"
## [37] "much" "set" "sync" "option" "task" "widget"
## [43] "love" "pay" "remind" "want" "also" "complet"
## [49] "notif" "still" "pleas" "featur" "day" "calendar"
## [55] "event" "googl" "can" "add" "star" "easi"
## [61] "ive" "nice" "help" "best" "habit"
Out of 8729 unique words that is appear in our corpus, only 65 words appears more than equal 500 times. Leaving lot of terms like this will be pretty useless for 2 reasons, the time it took to build our models, and how well our models will generalize.
Therefore, let’s remove terms that don’t appear very often:
# remove terms that don't often appear
sparse_dtm <- removeSparseTerms(dtm, 0.995)
This function require 2 parameters, the Document Term Matrix and sparsity threshold. The sparsity threshold value only values in between 0 and 1. In this lecture, We used 0.995, that means we only keep terms that appear >= 0.5% in our content review.
Now let’s see how the new document look like:
sparse_dtm
## <<DocumentTermMatrix (documents: 12495, terms: 485)>>
## Non-/sparse entries: 115895/5944180
## Sparsity : 98%
## Maximal term length: 10
## Weighting : term frequency (tf)
Now we can see that our matrix only contains 485 unique terms.
3.5 Convert DTM to Dataframe Next, we will convert our latest DTM to a dataframe so we can use it for our classification models. And we will validating names for character vectors.
# turn it into dataframe
contentSparse <- as.data.frame(as.matrix(sparse_dtm))
# make syntactically valid names out of character vectors
colnames(contentSparse) <- make.names(colnames(contentSparse))
3.6 Set Target Variable
Because we have cleaning our data back then, we will add the sparse content to our data set and we are ready to build our machine learning models.
# create sparse dtm for each sentiment
contentSparse$Negative <- dataframe$Negative
contentSparse$Positive <- dataframe$Positive
3.7 Split Data Into Training and Test
# split sparsed dtm into train and test data
set.seed(2021)
# negative train and test
splitNegative <- sample.split(contentSparse$Negative, SplitRatio = 0.7)
trainSparseNegative <- subset(contentSparse, splitNegative == TRUE)
testSparseNegative <- subset(contentSparse, splitNegative == FALSE)
# positive train and test
splitPositive <- sample.split(contentSparse$Positive, SplitRatio = 0.7)
trainSparsePositive <- subset(contentSparse, splitPositive == TRUE)
testSparsePositive <- subset(contentSparse, splitPositive == FALSE)
We use set.seed() again to prevent change in randomize data everytime we run our project.
actualN <- testSparseNegative$Negative
actualP <- testSparsePositive$Positive
4.1 Building Models
In this lecture, we will use Random Forest Algorithm, we will then compute the accuracy, precision, recall, F1-score.
set.seed(2021)
# build random forest model for negative terms
RFN <- randomForest(Negative ~ ., data = trainSparseNegative, na.action = na.roughfix)
set.seed(2021)
# build random forest model for positive terms
RFP <- randomForest(Positive ~ ., data = trainSparsePositive, na.action = na.roughfix)
We build 2 models for each algorithm because we think it will reduce the time to build our models. Since Random Forest Algorithm used a random to generate the data, we use set.seed() function to make sure the data doesn’t change everytime we run the program.
The na.roughfix argument replaced NA value with median/mode for numeric variables, and with most frequent levels if it’s factor variables.
RFN
##
## Call:
## randomForest(formula = Negative ~ ., data = trainSparseNegative, na.action = na.roughfix)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 22
##
## OOB estimate of error rate: 14.61%
## Confusion matrix:
## FALSE TRUE class.error
## FALSE 4438 914 0.1707773
## TRUE 364 3031 0.1072165
RFP
##
## Call:
## randomForest(formula = Positive ~ ., data = trainSparsePositive, na.action = na.roughfix)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 22
##
## OOB estimate of error rate: 13.73%
## Confusion matrix:
## FALSE TRUE class.error
## FALSE 4094 695 0.1451242
## TRUE 506 3452 0.1278423
Our model is ready, now we will try to predict the sentiment using the test data.
# predict negative data
pred.forestN <- predict(RFN, testSparseNegative, type = "response")
# predict positive data
pred.forestP <- predict(RFP, testSparsePositive, type = "response")
5.1 Compute Confusion Matrix
Here we will use confusion matrix to compute the accuracy, precision, recall, and F1-score of our models.
# create confusion matrix
cm.forestN <- table(actualN, pred.forestN)
cm.forestN
## pred.forestN
## actualN FALSE TRUE
## FALSE 1880 413
## TRUE 158 1297
# create confusion matrix
cm.forestP <- table(actualP, pred.forestP)
cm.forestP
## pred.forestP
## actualP FALSE TRUE
## FALSE 1743 309
## TRUE 214 1482
The parameters actualN and actualP was the actual score in the test data. We use it to compare with our prediction.
5.2 Create Function to Compute Accuracy
performance <- function(prediction, method){
TN <- prediction[1,1]
TP <- prediction[2,2]
FN <- prediction[2,1]
FP <- prediction[1,2]
accuracy <- (TN + TP) / (TN + TP + FN + FP)
precision <- TP / (TP + FP)
recall <- TP / (TP + FN)
fmeasure <- 2 * precision * recall / (precision + recall)
result <- paste("===", method, "===",
"\nAccuracy: ", round(accuracy, 3),
"\nPrecision: ", round(precision, 3),
"\nRecall: ", round(recall, 3),
"\nF-Measure: ", round(fmeasure, 3)
)
cat(result)
}
We will use the function above to compute our models performance including accuracy, precision, recall, and F1-score.
5.3 End Result
Now if we want to know our models performance, we just need to call the function:
perf_rfn <- performance(cm.forestN, "Random Forest Negative")
## === Random Forest Negative ===
## Accuracy: 0.848
## Precision: 0.758
## Recall: 0.891
## F-Measure: 0.82
perf_rfp <- performance(cm.forestP, "Random Forest Positive")
## === Random Forest Positive ===
## Accuracy: 0.86
## Precision: 0.827
## Recall: 0.874
## F-Measure: 0.85
From the result above, we can compute the total accuracy for Random Forest Algorithm, the total accuracy is 0.854%. If we think that accuracy is not enough to predict the sentiment, we can increased it by repeating Data Cleaning step, maybe remove any unnecessary terms and or playing with machine learning algorithm parameters and try different algorithm.