library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.4.3
library(readr)
library(ggplot2)
library(syuzhet)
## Warning: package 'syuzhet' was built under R version 4.4.3
library(dplyr)
In this assignment I will do sentiment analysis using a dataset of hotel reviews. It will be done using sentiment lexicons from analysis example from Chapter 2 of “Text Mining with R”, and additionally I will use another R package “syuzhet” for extra sentiment analysis.
reviews_raw <- read_delim(
"https://raw.githubusercontent.com/farhodibr/CUNY-SPS-MSDS/main/DATA607/LAB10/DATA/REVIEWS/dataset-CalheirosMoroRita-2017.csv",
delim = ";",
locale = locale(encoding = "latin1")
)
## Rows: 401 Columns: 1
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (1): Review
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View structure
glimpse(reviews_raw)
## Rows: 401
## Columns: 1
## $ Review <chr> " Everything from the weather, staff, food, property, fire pits…
# Create tidy tibble
review_data <- tibble(line = 1:nrow(reviews_raw), text = reviews_raw$Review)
# Tokenize
tidy_reviews <- review_data |>
unnest_tokens(word, text)
head(tidy_reviews)
## # A tibble: 6 × 2
## line word
## <int> <chr>
## 1 1 everything
## 2 1 from
## 3 1 the
## 4 1 weather
## 5 1 staff
## 6 1 food
bing <- get_sentiments("bing")
bing_sentiment <- tidy_reviews |>
inner_join(bing, by = "word") |>
count(sentiment, sort = TRUE)
table(bing_sentiment)
## n
## sentiment 219 1877
## negative 1 0
## positive 0 1
ggplot(bing_sentiment, aes(x = sentiment, y = n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
labs(title = "Hotel review sentiment (Bing Lexicon)", x = "Sentiment", y = "Word Count")
The summary shows that more positive words were identified than negative ones, suggesting an overall favorable tone in the reviews. This bar plot visualizes the total number of positive vs. negative words. It confirms that positive sentiment dominates in the dataset.
nrc <- get_sentiments("nrc")
nrc_sentiment <- tidy_reviews |>
inner_join(nrc, by = "word") |>
count(sentiment, sort = TRUE)
## Warning in inner_join(tidy_reviews, nrc, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 6 of `x` matches multiple rows in `y`.
## ℹ Row 5045 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
print(nrc_sentiment)
## # A tibble: 10 × 2
## sentiment n
## <chr> <int>
## 1 positive 1703
## 2 joy 1073
## 3 trust 902
## 4 anticipation 621
## 5 surprise 446
## 6 negative 252
## 7 sadness 158
## 8 fear 114
## 9 anger 91
## 10 disgust 67
ggplot(nrc_sentiment, aes(x = reorder(sentiment, n), y = n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
coord_flip() +
labs(title = "Emotion distribution (NRC lexicon)", x = "Emotion", y = "Word count")
The most common emotions detected were positive, trust, and joy, showing that guests frequently express satisfaction and confidence in their stay. This horizontal bar chart displays the distribution of emotions. Emotions related to positivity and trust are most common, while negative emotions like disgust or fear are less frequent.
afinn <- get_sentiments("afinn")
afinn_sentiment <- tidy_reviews |>
inner_join(afinn, by = "word") |>
group_by(line) |>
summarise(sentiment_score = sum(value))
print(afinn_sentiment)
## # A tibble: 396 × 2
## line sentiment_score
## <int> <dbl>
## 1 1 0
## 2 2 9
## 3 3 11
## 4 4 4
## 5 5 7
## 6 6 7
## 7 7 10
## 8 8 13
## 9 9 7
## 10 10 7
## # ℹ 386 more rows
ggplot(afinn_sentiment, aes(x = sentiment_score)) +
geom_histogram(binwidth = 1, fill = "steelblue", color = "white") +
labs(title = "Distribution of sentiment scores (AFINN lexicon)", x = "Score per review", y = "Count")
This histogram shows the distribution of sentiment scores across reviews. Most reviews cluster on the positive side of the scale, with only a few strongly negative outliers.
reviews_data <- reviews_raw$Review
syuzhet_scores <- get_sentiment(reviews_data, method = "syuzhet")
syuzhet_df <- tibble(
line = 1:length(syuzhet_scores),
sentiment = syuzhet_scores
)
ggplot(syuzhet_df, aes(x = line, y = sentiment)) +
geom_line(color = "darkgreen") +
labs(title = "Sentiment trajectory using Syuzhet lexicon",
x = "Review number", y = "Sentiment score")
Using the syuzhet lexicon, the code above computed
sentiment scores at the review level, capturing subtle emotional changes
across the dataset. These scores represent emotional valence based on
narrative structure.
The sentiment trajectory line plot shows how sentiment varies from review to review. While most reviews are slightly positive, there are occasional dips into negative sentiment. This provides a dynamic, narrative view of customer experiences.
Overall, the hotel reviews show mostly positive sentiment across all the different tools are used.
The Bing lexicon found more positive words than negative ones.
The NRC lexicon showed that emotions like trust, joy, and anticipation were the most common, meaning people often felt good about their hotel experience.
The AFINN scores gave each review a number, and most were on the positive side.
The Syuzhet scores show how the tone expression of reviews changes from one to the next, and while most were positive, a few were more negative.
These different sentiment lexicon packages show that guest reviews are not describing just whether they were happy or unhappy, but also the kinds of emotions they expressed.