LAB10_sentiment

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidytext)

## Warning: package 'tidytext' was built under R version 4.4.3

library(readr)
library(ggplot2)
library(syuzhet)

## Warning: package 'syuzhet' was built under R version 4.4.3

library(dplyr)

Loading data

In this assignment I will do sentiment analysis using a dataset of hotel reviews. It will be done using sentiment lexicons from analysis example from Chapter 2 of “Text Mining with R”, and additionally I will use another R package “syuzhet” for extra sentiment analysis.

reviews_raw <- read_delim(
  "https://raw.githubusercontent.com/farhodibr/CUNY-SPS-MSDS/main/DATA607/LAB10/DATA/REVIEWS/dataset-CalheirosMoroRita-2017.csv",
  delim = ";",
  locale = locale(encoding = "latin1")
)

## Rows: 401 Columns: 1
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (1): Review
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# View structure
glimpse(reviews_raw)

## Rows: 401
## Columns: 1
## $ Review <chr> " Everything from the weather, staff, food, property, fire pits…

# Create tidy tibble
review_data <- tibble(line = 1:nrow(reviews_raw), text = reviews_raw$Review)

# Tokenize
tidy_reviews <- review_data |>
  unnest_tokens(word, text)

head(tidy_reviews)

## # A tibble: 6 × 2
##    line word      
##   <int> <chr>     
## 1     1 everything
## 2     1 from      
## 3     1 the       
## 4     1 weather   
## 5     1 staff     
## 6     1 food

Sentiment analysis with “bing” library

bing <- get_sentiments("bing")

bing_sentiment <- tidy_reviews |>
  inner_join(bing, by = "word") |>
  count(sentiment, sort = TRUE)
table(bing_sentiment)

##           n
## sentiment  219 1877
##   negative   1    0
##   positive   0    1

ggplot(bing_sentiment, aes(x = sentiment, y = n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  labs(title = "Hotel review sentiment (Bing Lexicon)", x = "Sentiment", y = "Word Count")

The summary shows that more positive words were identified than negative ones, suggesting an overall favorable tone in the reviews. This bar plot visualizes the total number of positive vs. negative words. It confirms that positive sentiment dominates in the dataset.

NRC emotions.

nrc <- get_sentiments("nrc")

nrc_sentiment <- tidy_reviews |>
  inner_join(nrc, by = "word") |>
  count(sentiment, sort = TRUE)

## Warning in inner_join(tidy_reviews, nrc, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 6 of `x` matches multiple rows in `y`.
## ℹ Row 5045 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

print(nrc_sentiment)

## # A tibble: 10 × 2
##    sentiment        n
##    <chr>        <int>
##  1 positive      1703
##  2 joy           1073
##  3 trust          902
##  4 anticipation   621
##  5 surprise       446
##  6 negative       252
##  7 sadness        158
##  8 fear           114
##  9 anger           91
## 10 disgust         67

ggplot(nrc_sentiment, aes(x = reorder(sentiment, n), y = n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  labs(title = "Emotion distribution (NRC lexicon)", x = "Emotion", y = "Word count")

The most common emotions detected were positive, trust, and joy, showing that guests frequently express satisfaction and confidence in their stay. This horizontal bar chart displays the distribution of emotions. Emotions related to positivity and trust are most common, while negative emotions like disgust or fear are less frequent.

AFFIN analysis.

afinn <- get_sentiments("afinn")

afinn_sentiment <- tidy_reviews |>
  inner_join(afinn, by = "word") |>
  group_by(line) |>
  summarise(sentiment_score = sum(value))
print(afinn_sentiment)

## # A tibble: 396 × 2
##     line sentiment_score
##    <int>           <dbl>
##  1     1               0
##  2     2               9
##  3     3              11
##  4     4               4
##  5     5               7
##  6     6               7
##  7     7              10
##  8     8              13
##  9     9               7
## 10    10               7
## # ℹ 386 more rows

ggplot(afinn_sentiment, aes(x = sentiment_score)) +
  geom_histogram(binwidth = 1, fill = "steelblue", color = "white") +
  labs(title = "Distribution of sentiment scores (AFINN lexicon)", x = "Score per review", y = "Count")

This histogram shows the distribution of sentiment scores across reviews. Most reviews cluster on the positive side of the scale, with only a few strongly negative outliers.

Syuzhet analysis

reviews_data <- reviews_raw$Review
syuzhet_scores <- get_sentiment(reviews_data, method = "syuzhet")

syuzhet_df <- tibble(
  line = 1:length(syuzhet_scores),
  sentiment = syuzhet_scores
)

ggplot(syuzhet_df, aes(x = line, y = sentiment)) +
  geom_line(color = "darkgreen") +
  labs(title = "Sentiment trajectory using Syuzhet lexicon",
       x = "Review number", y = "Sentiment score")

Using the syuzhet lexicon, the code above computed sentiment scores at the review level, capturing subtle emotional changes across the dataset. These scores represent emotional valence based on narrative structure.

The sentiment trajectory line plot shows how sentiment varies from review to review. While most reviews are slightly positive, there are occasional dips into negative sentiment. This provides a dynamic, narrative view of customer experiences.

Summary

Overall, the hotel reviews show mostly positive sentiment across all the different tools are used.

The Bing lexicon found more positive words than negative ones.
The NRC lexicon showed that emotions like trust, joy, and anticipation were the most common, meaning people often felt good about their hotel experience.
The AFINN scores gave each review a number, and most were on the positive side.
The Syuzhet scores show how the tone expression of reviews changes from one to the next, and while most were positive, a few were more negative.

These different sentiment lexicon packages show that guest reviews are not describing just whether they were happy or unhappy, but also the kinds of emotions they expressed.