Sentiment Analysis

Sentiment analysis, also known as opinion mining, computationally identifies and categorizes opinions expressed in text data. It is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine.

Steps in Sentiment Analysis

  1. Data Collection
  2. Data Preprocessing - text cleaning, tokenization, stop words removal and standardize significance
  3. Text Vectorization - bag-of-words, term frequency-inverse document frequency (TF-IDF), word embeddings
  4. Sentiment Analysis - machine learning models to classify the sentiment
  5. Evaluation and Validation
  6. Visualization and Interpretation

Load Data

The data is Customer Reviews from TripAdvisor Hotel Reviews, this data has over 20,000 reviews each with a rating review.

install required packages

# Install required packages
# install.packages("tm")         # Text Mining package
# install.packages("SnowballC")  # Snowball stemmer for text processing
# install.packages("syuzhet")    # Sentiment analysis package
# install.packages("tidyverse")  # Comprehensive data manipulation package
# install.packages("wordcloud")  # Word cloud generation package
# install.packages("ggplot2")    # Powerful visualization package

load the packages

read in the data

data = read_csv("tripadvisorhotelreviews.csv")
head(data)
## # A tibble: 6 × 3
##   S.No. Review                                                            Rating
##   <dbl> <chr>                                                              <dbl>
## 1     1 "nice hotel expensive parking got good deal stay hotel anniversa…      4
## 2     2 "ok nothing special charge diamond member hilton decided chain s…      2
## 3     3 "nice rooms not 4* experience hotel monaco seattle good hotel n'…      3
## 4     4 "unique \tgreat stay \twonderful time hotel monaco \tlocation ex…      5
## 5     5 "great stay great stay \twent seahawk game awesome \tdownfall vi…      5
## 6     6 "love monaco staff husband stayed hotel crazy weekend attending …      5

Inspect the corpus

We need to convert the review column into a character vector and create a corpus, which is a collection of text documents.

# Convert review column of dataframe to character vector
corpus <- iconv(data$Review)
corpus_vector = str_remove_all(corpus, "n't")
corpus_vector = Corpus(VectorSource(corpus))

inspect( corpus_vector[1] )
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 1
## 
## [1] nice hotel expensive parking got good deal stay hotel anniversary \tarrived late evening took advice previous reviews did valet parking \tcheck quick easy \tlittle disappointed non-existent view room room clean nice size \tbed comfortable woke stiff neck high pillows \tnot soundproof like heard music room night morning loud bangs doors opening closing hear people talking hallway \tmaybe just noisy neighbors \taveda bath products nice \tdid not goldfish stay nice touch taken advantage staying longer \tlocation great walking distance shopping \toverall nice experience having pay 40 parking night

Data Cleaning

This step is very important to clean text from punctuation, numbers, symbols, stop words and other items like whitespace.

cleaned_corpus = tm_map(corpus_vector, content_transformer(tolower)) 
cleaned_corpus = tm_map(corpus_vector, removePunctuation)  
cleaned_corpus = tm_map(corpus_vector, removeNumbers) 
cleaned_corpus = tm_map(corpus_vector, removeWords, stopwords('english'))
cleaned_corpus = tm_map(corpus_vector, stripWhitespace)

inspect( cleaned_corpus[1] )
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 1
## 
## [1] nice hotel expensive parking got good deal stay hotel anniversary arrived late evening took advice previous reviews did valet parking check quick easy little disappointed non-existent view room room clean nice size bed comfortable woke stiff neck high pillows not soundproof like heard music room night morning loud bangs doors opening closing hear people talking hallway maybe just noisy neighbors aveda bath products nice did not goldfish stay nice touch taken advantage staying longer location great walking distance shopping overall nice experience having pay 40 parking night

Sampling the data

Using a subset of the reviews is more manageable than trying to parse through the large dataset.

set.seed(123)

sampled_reviews = sample(data$Review,size = 200)
sampled_corpus = Corpus(VectorSource(iconv(sampled_reviews)))

Clean the sample data

cleaned_sample_corpus = tm_map(sampled_corpus, content_transformer(tolower))
cleaned_sample_corpus = tm_map(sampled_corpus, removePunctuation)  
cleaned_sample_corpus  = tm_map(sampled_corpus, removeNumbers) 
cleaned_sample_corpus  = tm_map(sampled_corpus, removeWords, stopwords('english'))
cleaned_sample_corpus  = tm_map(sampled_corpus, stripWhitespace)

inspect(cleaned_sample_corpus[1])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 1
## 
## [1] beautiful atmosphere customer service construction problems visited paradisius royal service march 24-31 cut trip short days mainly combination construction hotel customer service concerns visited star hotels caribbean basis comparisons hotel beautiful happened replacing sidewalks-the hammering incessant relentless day raining half time trip hammering intolerable fourth consecutive day spoke royal service staff said yes know tomorrow-and day.keep mind resort opened months ago-so staff new pretty inexperienced witnessed yelling matches unhappy customers staff members did n't skills offer resolution think boiled new staff focused policy customer service resulting inflexibility unhappy customers sure spanish translation handbook not bilingual hotel website claims staff bilingual probably 3-4 individuals speak english hard time asking language barrier n't waited restaurants passion waiter did n't speak english decided ignore completely.also aware tips not included inclusive claim staff tell tips not necessary individuals live gratuity course knowing chose tip frequently generously did notice tipped got best service did n't ignored royal service beach staff decided leave early husband spoke royal service staff let know leaving early need check not usual complainers construction conditions n't tolerable longer manager natalya spoke husband promised send bellman room 12 noon scheduled meeting prior departure contact information no bellman showed royal service twice 12:30pm bellman 12:45pm royal service called ask needed bellman manager natalya stood did n't bother return staff phone calls surprised royal service staff rude checked actually irritated n't trouble checked quietly exception husband meeting manager happened request.we hated leave bad flavor place example metaphor experience staff continuously follow staff request breakfast menu not picked everyday told internet hotel later told available time lobby housekeeping barged knocking maid stood barked not come later think aware issues trip helpful-definitely confirm no construction going

Text Vectorization

Create sparse term document matrix

A Term Document Matrix (TDM) is a mathematical representation used in natural language processing to analyze the frequency of words (terms) in a collection of documents. It’s essentially a matrix where rows represent terms and columns represent documents. Each cell in the matrix contains the frequency of a particular term in a specific document.

Most real-world text corpora exhibit a property known as Zipf’s Law, which states that the frequency of a word is inversely proportional to its rank in the frequency table. This means that a few words appear very frequently, while most words appear infrequently.

tdm_sparse = TermDocumentMatrix(cleaned_sample_corpus, control = list(weighting = weightTfIdf ))
tdm_m_square = as.matrix(tdm_sparse)

Analyze term frequencies

Convert the sparse matrix to a dataframe of term frequencies and display the most frequent terms.

# show frequency of terms
term_freq = rowSums(tdm_m_square)
term_freq_sorted = sort(term_freq, decreasing = TRUE)
tdm_d_sparse = data.frame(word= names(term_freq_sorted), freq= term_freq_sorted)
# show top 5 most frequent words
head( tdm_d_sparse, 5)
##              word     freq
## location location 2.759693
## great       great 2.660237
## hotel       hotel 2.628621
## not           not 2.401555
## good         good 2.308165

Sentiment Analysis

Three different methods are used here (syuzhet, bing, afinn) to perform sentiment analysis on the text data.

Syuzhet Sentiment

# convert review column of dataframe to character vector
text = iconv(data$Review)
# text[1]

# sentiment scores
syuzhet_vector = syuzhet::get_sentiment(text, method = "syuzhet")
head(syuzhet_vector)
## [1]  3.25 10.70  5.10  8.75  6.30 12.20

see the summary of the Syuzhet vector

summary(syuzhet_vector)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -12.250   3.050   5.650   6.127   8.550  52.750

Bing Sentiment

bing_vector = syuzhet::get_sentiment(text, method = "bing")
head(bing_vector)
## [1]  3 11  5  9  7  7

the summary of Bing vector

summary(bing_vector)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -23.000   2.000   6.000   5.931   9.000  43.000

AFINN sentiment

AFINN is a sentiment lexicon, a resource containing a list of words and their associated sentiment scores. These scores range from -5 (extremely negative) to 5 (extremely positive), with 0 indicating neutrality.

afinn_vector = syuzhet::get_sentiment(text, method = "afinn")
head(afinn_vector)
## [1] 14 28  5 21 15 23
summary(afinn_vector)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -38.00    7.00   14.00   14.29   21.00  107.00

Compare sentiment methods

Compare the sentiment scores from the three methods

# Compare first row of each vector (sign function creates common scale)
rbind(
  sign(head( syuzhet_vector)),
  sign(head(bing_vector)),
  sign(head(afinn_vector))
)
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    1    1    1    1    1
## [2,]    1    1    1    1    1    1
## [3,]    1    1    1    1    1    1

The sign function is used to create a common scale for comparison, ensuring that the values are either 1 positive, -1 negative or 0 neutral. The rbind function combines the rows of multiple matrices or dataframes into a single matrix.

Visualization of sentiment analysis

We will visualize the sentiment of the sampled and cleaned data.

WordClouds

Create a wordcloud for the most frequent terms used in the reviews text.

library(wordcloud2)

set.seed(123)
wordcloud2(data = tdm_d_sparse, 
           color = brewer.pal(9,"Set3"),
           backgroundColor = "#666699",
           minSize = 5
           )

Words with higher frequencies will appear larger and more prominent in the word cloud. The colors of the words are determined by the specified color palette, with each color representing a different word in the cloud. The word n't was supposed to be removed in the data cleaning process, however it is present.

Sentiment Histogram

Using a histogram to visualize the distribution of sentiment scores used in the Syuzhet method.

text_sampled = iconv(sampled_reviews)
syuzhet_vector_sampled = syuzhet::get_sentiment(text_sampled, method = "syuzhet")

syuz_df = as.data.frame(syuzhet_vector_sampled)

ggplot( syuz_df, 
        aes(x= syuzhet_vector_sampled)
        ) +
  geom_histogram(binwidth = 0.4, fill="blue", color="grey0", bins = 10) +
  scale_x_binned(nice.breaks = TRUE)+
  labs(
    title = "Sentiment Distribution using Syuzhet Method (Sampled Data)",
    x= "Sentiment Score",
    y= "Frequency"
  ) + 
  ggdark::dark_mode() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 13, face = 'bold'),
    axis.text.x = element_text(size = 10),
    axis.text.y = element_text(size = 10),
  )

The output is a histogram plot illustrating the distribution of sentiment scores obtained from the Syuzhet sentiment analysis method applied to the sampled data set. Each bar in the histogram represents a range of sentiment scores, and the height of the bar indicates the frequency of occurrence of sentiment scores within that range. This visualization allows for a quick assessment of the overall sentiment distribution within the sampled text data.

NRC Lexicon Bar Plot

NRC Lexicon is a sentiment analysis lexicon developed by the National Research Council of Canada. It contains a list of words and their associated sentiment scores, categorized into eight emotions: anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.

# extract NRC sentiment score from text sample
nrc_sampled = syuzhet::get_nrc_sentiment(text_sampled)

# transpose list of sentiment scores into a dataframe, each row represents an emotion
# each column represents a word/phrase in text
nrc_df = data.frame(t(nrc_sampled))

# row sums of dataframe are calculates resulting in single column dataframe where each emotion & value
# is the total sentiment score for that emotion
nrc_df = data.frame(rowSums(nrc_df))

# new column 'sentiment' is added containing names of emotions
nrc_df_sent = cbind("sentiment" = rownames(nrc_df), nrc_df)

# row names of dataframe are removed
rownames(nrc_df_sent) = NULL

# name of the first column is changed to 'sentiment'
names(nrc_df_sent)[1] = "sentiment"

# the 2nd column name is changed to 'frequency'
names(nrc_df_sent)[2] = "frequency"

# new column 'percent' is added, containing the percentage of each emotion relative to the total sentiment score
nrc_df_sent = nrc_df_sent %>% mutate(percent = frequency/ sum(frequency ))

# only the first 8 rows of the dataframe are kept because there are 8 emotions in the NRC lexicon
nrc_df_sent = nrc_df_sent[1:8, ]

# names of the first column is changed to 'emotion'
colnames(nrc_df_sent[1]) = "emotion"

now the dataframe transformation is done, we can plot it

nrc_df_sent %>% 
  ggplot(
    aes(x= reorder(sentiment, -frequency), y= frequency, fill = sentiment )
  ) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  labs(
    title = "NRC Emotion Distribution (Sampled Data)",
    x= "Emotion",
    y= "Frequency"
  ) +
  scale_fill_brewer(palette = "Set3") +
  ggdark::dark_mode() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 13, face = 'bold'),
    axis.text.x = element_text(size = 11, colour = 'grey70'),
    axis.text.y = element_text(size = 11, colour = 'grey70'),
  )

The output is a bar plot illustrating the distribution of emotions based on sentiment analysis using the NRC lexicon on the sampled dataset. Each bar represents a different emotion, and the height of the bar indicates the frequency of that emotion within the text data. The colors of the bars are determined by the specified color palette, allowing for easy visualization of different emotions.

Pie Chart Sentiment Distribution

Creating a pie chart of sentiment distribution involves visualizing the proportion of different sentiment categories within a dataset.

This time we will create a dataframe for sentiment and the value counts.

sent_df = data.frame(
  "sentiment" = c("Positive","Negative","Neutral"),
  "count" = c( sum(syuzhet_vector_sampled > 0),
               sum(syuzhet_vector_sampled < 0),
               sum(syuzhet_vector_sampled == 0)
               )
)

sent_df %>% 
  ggplot(
    aes(x="", y= count, fill = sentiment)
  ) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y", start = 0) +
  labs(
    title = "Pie Chart Sentiment Distribution",
    x= "",
    y=""
  ) +
  scale_fill_brewer(palette = "Set1") +
  ggdark::dark_theme_void()

The output is a pie chart illustrating the distribution of sentiment categories within the dataset. Each segment of the pie chart represents a sentiment category (“Positive”, “Negative”, “Neutral”), and the size of each segment corresponds to the count of that sentiment category in the dataset.

Scatterplot of Word Frequency

Let us see the word frequencies that are more common in the data, filtering out the number and small words that have very low frequency counts.

tdm_d_sparse %>% 
  arrange(word, freq) %>% 
  filter(freq >1.5) %>% 
  ggplot(
    aes(x= freq, y= word, color=freq)
  ) +
  geom_point(size=3) +
  scale_color_paletteer_c(`"ggthemes::Green"`) +
  ggdark::dark_mode() +
  labs(
    title = "Word Frequency Sentiment Distribution",
    y= "",
    x="Frequency"
  ) +
  theme(
    plot.title = element_text(hjust = 0.5, size=12, face = 'bold'),
    axis.text.x = element_text(size = 11, color = 'grey70'),
    axis.text.y = element_text(size = 11, color = 'grey90', face = 'bold'),
  )

The words location, hotel, great and not are the most common words found in the sampled data of the corpus.

Conclusion

The text analysis for our sampled TripAdvisor Reviews corpus shows that the overall sentiment in the reviews are positive based on the NRC lexicon.