Sentiment analysis, also known as opinion mining, computationally identifies and categorizes opinions expressed in text data. It is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine.
stop words removal and standardize significanceThe data is Customer Reviews from TripAdvisor Hotel Reviews, this data has over 20,000 reviews each with a rating review.
install required packages
# Install required packages
# install.packages("tm") # Text Mining package
# install.packages("SnowballC") # Snowball stemmer for text processing
# install.packages("syuzhet") # Sentiment analysis package
# install.packages("tidyverse") # Comprehensive data manipulation package
# install.packages("wordcloud") # Word cloud generation package
# install.packages("ggplot2") # Powerful visualization package
load the packages
read in the data
data = read_csv("tripadvisorhotelreviews.csv")
head(data)
## # A tibble: 6 × 3
## S.No. Review Rating
## <dbl> <chr> <dbl>
## 1 1 "nice hotel expensive parking got good deal stay hotel anniversa… 4
## 2 2 "ok nothing special charge diamond member hilton decided chain s… 2
## 3 3 "nice rooms not 4* experience hotel monaco seattle good hotel n'… 3
## 4 4 "unique \tgreat stay \twonderful time hotel monaco \tlocation ex… 5
## 5 5 "great stay great stay \twent seahawk game awesome \tdownfall vi… 5
## 6 6 "love monaco staff husband stayed hotel crazy weekend attending … 5
We need to convert the review column into a character vector and create a corpus, which is a collection of text documents.
# Convert review column of dataframe to character vector
corpus <- iconv(data$Review)
corpus_vector = str_remove_all(corpus, "n't")
corpus_vector = Corpus(VectorSource(corpus))
inspect( corpus_vector[1] )
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 1
##
## [1] nice hotel expensive parking got good deal stay hotel anniversary \tarrived late evening took advice previous reviews did valet parking \tcheck quick easy \tlittle disappointed non-existent view room room clean nice size \tbed comfortable woke stiff neck high pillows \tnot soundproof like heard music room night morning loud bangs doors opening closing hear people talking hallway \tmaybe just noisy neighbors \taveda bath products nice \tdid not goldfish stay nice touch taken advantage staying longer \tlocation great walking distance shopping \toverall nice experience having pay 40 parking night
This step is very important to clean text from punctuation, numbers,
symbols, stop words and other items like whitespace.
cleaned_corpus = tm_map(corpus_vector, content_transformer(tolower))
cleaned_corpus = tm_map(corpus_vector, removePunctuation)
cleaned_corpus = tm_map(corpus_vector, removeNumbers)
cleaned_corpus = tm_map(corpus_vector, removeWords, stopwords('english'))
cleaned_corpus = tm_map(corpus_vector, stripWhitespace)
inspect( cleaned_corpus[1] )
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 1
##
## [1] nice hotel expensive parking got good deal stay hotel anniversary arrived late evening took advice previous reviews did valet parking check quick easy little disappointed non-existent view room room clean nice size bed comfortable woke stiff neck high pillows not soundproof like heard music room night morning loud bangs doors opening closing hear people talking hallway maybe just noisy neighbors aveda bath products nice did not goldfish stay nice touch taken advantage staying longer location great walking distance shopping overall nice experience having pay 40 parking night
Using a subset of the reviews is more manageable than trying to parse through the large dataset.
set.seed(123)
sampled_reviews = sample(data$Review,size = 200)
sampled_corpus = Corpus(VectorSource(iconv(sampled_reviews)))
cleaned_sample_corpus = tm_map(sampled_corpus, content_transformer(tolower))
cleaned_sample_corpus = tm_map(sampled_corpus, removePunctuation)
cleaned_sample_corpus = tm_map(sampled_corpus, removeNumbers)
cleaned_sample_corpus = tm_map(sampled_corpus, removeWords, stopwords('english'))
cleaned_sample_corpus = tm_map(sampled_corpus, stripWhitespace)
inspect(cleaned_sample_corpus[1])
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 1
##
## [1] beautiful atmosphere customer service construction problems visited paradisius royal service march 24-31 cut trip short days mainly combination construction hotel customer service concerns visited star hotels caribbean basis comparisons hotel beautiful happened replacing sidewalks-the hammering incessant relentless day raining half time trip hammering intolerable fourth consecutive day spoke royal service staff said yes know tomorrow-and day.keep mind resort opened months ago-so staff new pretty inexperienced witnessed yelling matches unhappy customers staff members did n't skills offer resolution think boiled new staff focused policy customer service resulting inflexibility unhappy customers sure spanish translation handbook not bilingual hotel website claims staff bilingual probably 3-4 individuals speak english hard time asking language barrier n't waited restaurants passion waiter did n't speak english decided ignore completely.also aware tips not included inclusive claim staff tell tips not necessary individuals live gratuity course knowing chose tip frequently generously did notice tipped got best service did n't ignored royal service beach staff decided leave early husband spoke royal service staff let know leaving early need check not usual complainers construction conditions n't tolerable longer manager natalya spoke husband promised send bellman room 12 noon scheduled meeting prior departure contact information no bellman showed royal service twice 12:30pm bellman 12:45pm royal service called ask needed bellman manager natalya stood did n't bother return staff phone calls surprised royal service staff rude checked actually irritated n't trouble checked quietly exception husband meeting manager happened request.we hated leave bad flavor place example metaphor experience staff continuously follow staff request breakfast menu not picked everyday told internet hotel later told available time lobby housekeeping barged knocking maid stood barked not come later think aware issues trip helpful-definitely confirm no construction going
Create sparse term document matrix
A Term Document Matrix (TDM) is a mathematical representation used in natural language processing to analyze the frequency of words (terms) in a collection of documents. It’s essentially a matrix where rows represent terms and columns represent documents. Each cell in the matrix contains the frequency of a particular term in a specific document.
Most real-world text corpora exhibit a property known as Zipf’s Law, which states that the frequency of a word is inversely proportional to its rank in the frequency table. This means that a few words appear very frequently, while most words appear infrequently.
tdm_sparse = TermDocumentMatrix(cleaned_sample_corpus, control = list(weighting = weightTfIdf ))
tdm_m_square = as.matrix(tdm_sparse)
Convert the sparse matrix to a dataframe of term frequencies and display the most frequent terms.
# show frequency of terms
term_freq = rowSums(tdm_m_square)
term_freq_sorted = sort(term_freq, decreasing = TRUE)
tdm_d_sparse = data.frame(word= names(term_freq_sorted), freq= term_freq_sorted)
# show top 5 most frequent words
head( tdm_d_sparse, 5)
## word freq
## location location 2.759693
## great great 2.660237
## hotel hotel 2.628621
## not not 2.401555
## good good 2.308165
Three different methods are used here (syuzhet, bing, afinn) to perform sentiment analysis on the text data.
# convert review column of dataframe to character vector
text = iconv(data$Review)
# text[1]
# sentiment scores
syuzhet_vector = syuzhet::get_sentiment(text, method = "syuzhet")
head(syuzhet_vector)
## [1] 3.25 10.70 5.10 8.75 6.30 12.20
see the summary of the Syuzhet vector
summary(syuzhet_vector)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -12.250 3.050 5.650 6.127 8.550 52.750
bing_vector = syuzhet::get_sentiment(text, method = "bing")
head(bing_vector)
## [1] 3 11 5 9 7 7
the summary of Bing vector
summary(bing_vector)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -23.000 2.000 6.000 5.931 9.000 43.000
AFINN is a sentiment lexicon, a resource containing a list of words and their associated sentiment scores. These scores range from -5 (extremely negative) to 5 (extremely positive), with 0 indicating neutrality.
afinn_vector = syuzhet::get_sentiment(text, method = "afinn")
head(afinn_vector)
## [1] 14 28 5 21 15 23
summary(afinn_vector)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -38.00 7.00 14.00 14.29 21.00 107.00
Compare the sentiment scores from the three methods
# Compare first row of each vector (sign function creates common scale)
rbind(
sign(head( syuzhet_vector)),
sign(head(bing_vector)),
sign(head(afinn_vector))
)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1 1 1 1 1 1
## [2,] 1 1 1 1 1 1
## [3,] 1 1 1 1 1 1
The sign function is used to create a common scale for comparison,
ensuring that the values are either 1 positive,
-1 negative or 0 neutral. The
rbind function combines the rows of multiple matrices or
dataframes into a single matrix.
We will visualize the sentiment of the sampled and cleaned data.
Create a wordcloud for the most frequent terms used in the reviews text.
library(wordcloud2)
set.seed(123)
wordcloud2(data = tdm_d_sparse,
color = brewer.pal(9,"Set3"),
backgroundColor = "#666699",
minSize = 5
)
Words with higher frequencies will appear larger and more prominent
in the word cloud. The colors of the words are determined by the
specified color palette, with each color representing a different word
in the cloud. The word n't was supposed to be removed in
the data cleaning process, however it is present.
Using a histogram to visualize the distribution of sentiment scores used in the Syuzhet method.
text_sampled = iconv(sampled_reviews)
syuzhet_vector_sampled = syuzhet::get_sentiment(text_sampled, method = "syuzhet")
syuz_df = as.data.frame(syuzhet_vector_sampled)
ggplot( syuz_df,
aes(x= syuzhet_vector_sampled)
) +
geom_histogram(binwidth = 0.4, fill="blue", color="grey0", bins = 10) +
scale_x_binned(nice.breaks = TRUE)+
labs(
title = "Sentiment Distribution using Syuzhet Method (Sampled Data)",
x= "Sentiment Score",
y= "Frequency"
) +
ggdark::dark_mode() +
theme(
plot.title = element_text(hjust = 0.5, size = 13, face = 'bold'),
axis.text.x = element_text(size = 10),
axis.text.y = element_text(size = 10),
)
The output is a histogram plot illustrating the distribution of sentiment scores obtained from the Syuzhet sentiment analysis method applied to the sampled data set. Each bar in the histogram represents a range of sentiment scores, and the height of the bar indicates the frequency of occurrence of sentiment scores within that range. This visualization allows for a quick assessment of the overall sentiment distribution within the sampled text data.
NRC Lexicon is a sentiment analysis lexicon developed by the National Research Council of Canada. It contains a list of words and their associated sentiment scores, categorized into eight emotions: anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.
# extract NRC sentiment score from text sample
nrc_sampled = syuzhet::get_nrc_sentiment(text_sampled)
# transpose list of sentiment scores into a dataframe, each row represents an emotion
# each column represents a word/phrase in text
nrc_df = data.frame(t(nrc_sampled))
# row sums of dataframe are calculates resulting in single column dataframe where each emotion & value
# is the total sentiment score for that emotion
nrc_df = data.frame(rowSums(nrc_df))
# new column 'sentiment' is added containing names of emotions
nrc_df_sent = cbind("sentiment" = rownames(nrc_df), nrc_df)
# row names of dataframe are removed
rownames(nrc_df_sent) = NULL
# name of the first column is changed to 'sentiment'
names(nrc_df_sent)[1] = "sentiment"
# the 2nd column name is changed to 'frequency'
names(nrc_df_sent)[2] = "frequency"
# new column 'percent' is added, containing the percentage of each emotion relative to the total sentiment score
nrc_df_sent = nrc_df_sent %>% mutate(percent = frequency/ sum(frequency ))
# only the first 8 rows of the dataframe are kept because there are 8 emotions in the NRC lexicon
nrc_df_sent = nrc_df_sent[1:8, ]
# names of the first column is changed to 'emotion'
colnames(nrc_df_sent[1]) = "emotion"
now the dataframe transformation is done, we can plot it
nrc_df_sent %>%
ggplot(
aes(x= reorder(sentiment, -frequency), y= frequency, fill = sentiment )
) +
geom_bar(stat = "identity", show.legend = FALSE) +
labs(
title = "NRC Emotion Distribution (Sampled Data)",
x= "Emotion",
y= "Frequency"
) +
scale_fill_brewer(palette = "Set3") +
ggdark::dark_mode() +
theme(
plot.title = element_text(hjust = 0.5, size = 13, face = 'bold'),
axis.text.x = element_text(size = 11, colour = 'grey70'),
axis.text.y = element_text(size = 11, colour = 'grey70'),
)
The output is a bar plot illustrating the distribution of emotions based on sentiment analysis using the NRC lexicon on the sampled dataset. Each bar represents a different emotion, and the height of the bar indicates the frequency of that emotion within the text data. The colors of the bars are determined by the specified color palette, allowing for easy visualization of different emotions.
A bar plot to show the most popular words in the text dataset, visualization of the frequency distribution of the words within the sample corpus.
tdm_d_sparse_sample = tdm_d_sparse[1:10,]
tdm_d_sparse_sample$word = str_remove_all(tdm_d_sparse_sample$word, "n't")
tdm_d_sparse_sample$word = reorder(tdm_d_sparse_sample$word, tdm_d_sparse_sample$freq)
tdm_d_sparse_sample %>%
ggplot(
aes(x= word, y= freq , fill = word)
) +
geom_bar(stat = "identity", show.legend = FALSE) +
coord_flip()+
labs(
title = "Top 10 Most Popular Words ",
x= "Word",
y="Frequency"
) +
ggdark::dark_mode() +
theme(
plot.title = element_text(hjust = 0.5, size = 12, face = 'bold'),
axis.text.x = element_text(size = 11, colour = 'grey70'),
axis.text.y = element_text(size = 11, colour = 'grey70'),
)
The output is a horizontal bar plot illustrating the frequency of the
top 10 most popular words in the text data. Each bar represents a word,
and the length of the bar indicates the frequency of that word in the
dataset. The green shade bar that has no word is the removed
n't from the word column but is represented in the
plot.
Creating a pie chart of sentiment distribution involves visualizing the proportion of different sentiment categories within a dataset.
This time we will create a dataframe for sentiment and the value counts.
sent_df = data.frame(
"sentiment" = c("Positive","Negative","Neutral"),
"count" = c( sum(syuzhet_vector_sampled > 0),
sum(syuzhet_vector_sampled < 0),
sum(syuzhet_vector_sampled == 0)
)
)
sent_df %>%
ggplot(
aes(x="", y= count, fill = sentiment)
) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y", start = 0) +
labs(
title = "Pie Chart Sentiment Distribution",
x= "",
y=""
) +
scale_fill_brewer(palette = "Set1") +
ggdark::dark_theme_void()
The output is a pie chart illustrating the distribution of sentiment categories within the dataset. Each segment of the pie chart represents a sentiment category (“Positive”, “Negative”, “Neutral”), and the size of each segment corresponds to the count of that sentiment category in the dataset.
Let us see the word frequencies that are more common in the data, filtering out the number and small words that have very low frequency counts.
tdm_d_sparse %>%
arrange(word, freq) %>%
filter(freq >1.5) %>%
ggplot(
aes(x= freq, y= word, color=freq)
) +
geom_point(size=3) +
scale_color_paletteer_c(`"ggthemes::Green"`) +
ggdark::dark_mode() +
labs(
title = "Word Frequency Sentiment Distribution",
y= "",
x="Frequency"
) +
theme(
plot.title = element_text(hjust = 0.5, size=12, face = 'bold'),
axis.text.x = element_text(size = 11, color = 'grey70'),
axis.text.y = element_text(size = 11, color = 'grey90', face = 'bold'),
)
The words location, hotel,
great and not are the most common words found
in the sampled data of the corpus.
The text analysis for our sampled TripAdvisor Reviews corpus shows that the overall sentiment in the reviews are positive based on the NRC lexicon.