Executive summary

I was looking to learn the management approach to costumers reviews on the properties that they visited during their vacations.

My questions

What is the strategy from the management, are they focusing on happy or unhappy costumers?
What is the relationship between how unhappy a guest is while they rating to how long their reviews are?
I also checked how the demographic affects the ratings and expectations on what 5 start experience is.

Data background

I have used TripAdvisor Hotel Reviews Dataset that was public available to learn the difference between the rating that are left by costumers. The difference between satisfied and unsatisfied guests, they rating and the length of their feedback and the managment approah to the reviews.

Data cleaning

library(flexdashboard)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(plotly)

## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

# Load the data
 df<- read_csv("C:/Users/yudit/Downloads/final project/tripadvisor_hotel_reviews_dataset.csv") %>%
  mutate(
    has_response = !is.na(management_response),
    word_count = str_count(review_text, "\\S+"),
    trip_type = str_to_title(ifelse(is.na(trip_type) | trip_type == "NONE", "Other", trip_type))
  )

## Rows: 1098 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (13): hotel_name, place_type, category, hotel_url, review_title, review...
## dbl   (3): review_id, rating, helpful_votes
## date  (2): publishedDate, stay_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Individual figures

Figure 1

Management Response Strategy (Lollipop Chart) Visual: The Red-to-Green Gradient Lollipop.

Crisis prioritization - Notice the high response rate for 1-star reviews (the ‘Red’ zone).

from the data it look that the Management is currently in “Damage Control” mode—they spend the most time responding to the 1 and 2 star reviews

response_summary <- df %>%
  group_by(rating) %>%
  summarize(rate = mean(has_response) * 100)

p1 <- ggplot(response_summary, aes(x = as.factor(rating), y = rate, fill = as.factor(rating))) +
  geom_col() +
  scale_fill_brewer(palette = "RdYlGn") +
  labs(x = "Rating (1-5)", y = "% Answered") +
  theme_minimal() +
  guides(fill = "none")

ggplotly(p1)

Figure 2

The “Shape” of guest happiness.

Satisfaction Density (Violin Plot)

The long ‘tails’ reaching down to the 1-star mark represent the critical risk areas.

This visual shows that while failures are rare, they are high-impact and addressed the most.

# 1. CALCULATE the rates - we must name it 'response_rate' here
response_summary <- df %>%
  group_by(rating) %>%
  summarize(
    response_rate = mean(has_response) * 100
  )

# 2. CREATE the Lollipop Chart
p_lollipop <- ggplot(response_summary, aes(x = as.factor(rating), y = response_rate)) +
  # Draw the 'stick' of the lollipop
  geom_segment(aes(x = as.factor(rating), 
                   xend = as.factor(rating), 
                   y = 0, 
                   yend = response_rate), 
               color = "grey70") +
  # Draw the 'candy' (the dot)
  geom_point(aes(color = response_rate), size = 5) +
  scale_color_gradient(low = "#e41a1c", high = "#4daf4a") +
  coord_flip() + 
  theme_minimal() +
  labs(
    title = "Response Rate Efficiency",
    x = "Guest Rating", 
    y = "Percent Answered (%)"
  ) +
  theme(legend.position = "none")

# 3. CONVERT TO INTERACTIVE
# This is where the error was happening because it couldn't find 'response_rate'
ggplotly(p_lollipop)

Figure 3

Demographic Comparison (Faceted Boxplots)

Boxplots split by Traveler Type (Business, Family, etc.). The plot Consistency across all guest types.

Across almost every group, the 1-star ‘boxes’ sit much higher on the Y-axis than the 5-star boxes.

This confirms that dissatisfaction creates a heavier workload for management, regardless of who the guest is.

p2 <- df %>%
  filter(!is.na(trip_type) & trip_type != "NONE") %>%
  ggplot(aes(x = trip_type, y = rating, fill = trip_type)) +
  geom_violin(alpha = 0.5) +
  coord_flip() +
  theme_minimal() +
  labs(x = "") +
  guides(fill = "none")


ggplotly(p2)

Figure 4

The ‘Venting’ Archetype (Faceted Scatter Plot) Scatter points with black regression lines.

According to the data set —Unhappy guests talk more.

As the rating goes up, the word count goes down. We call this the Venting Archetype. When guests feel wronged, they write ‘novels’ to justify their score.

# 1. Filter data to include only valid traveler types
boxplot_data <- df %>%
  filter(!is.na(trip_type)) %>%
  filter(trip_type != "Other" & trip_type != "NONE")

# 2. Create the Faceted Boxplot
# Note: Ensure 'word_count' was created in your setup chunk
p_faceted <- ggplot(boxplot_data, aes(x = as.factor(rating), y = word_count, fill = as.factor(rating))) +
  geom_boxplot(outlier.shape = NA, alpha = 0.7) +
  facet_wrap(~trip_type) + 
  coord_cartesian(ylim = c(0, 600)) + 
  scale_fill_brewer(palette = "RdYlGn") +
  theme_minimal() +
  labs(
    x = "Rating (1 = Unhappy, 5 = Happy)",
    y = "Review Word Count",
    fill = "Rating"
  ) +
  theme(legend.position = "none")

# 3. Output the interactive version
ggplotly(p_faceted)

Figure 5

Frequency of Effort (Word Count Distribution) The blue Histogram on the bottom left.

To quantify the management workload we need to optimize the the efficiency of the responses.

While most reviews are short, these high-word-count reviews take up 80% of management’s reading time. By using Word Count as an early warning system, we can flag high-intensity complaints before they even get a response.

# 1. Prepare the data
# We filter out 'NONE' and 'Other' to focus on the main demographics
violin_data <- df %>%
  filter(!is.na(trip_type) & !trip_type %in% c("NONE", "Other")) %>%
  mutate(trip_type = str_to_title(trip_type))

# 2. Create the Violin Plot
p_violin <- ggplot(violin_data, aes(x = trip_type, y = rating, fill = trip_type)) +
  # The violin shows the density of the ratings
  geom_violin(alpha = 0.5, trim = FALSE) +
  # Adding a thin boxplot inside helps show the median and quartiles clearly
  geom_boxplot(width = 0.1, color = "black", outlier.shape = NA, alpha = 0.7) +
  scale_fill_brewer(palette = "Set3") +
  theme_minimal() +
  labs(
    title = "Distribution of Satisfaction by Traveler Type",
    subtitle = "The width of the violin represents the concentration of ratings",
    x = "Traveler Category",
    y = "Rating (1-5)"
  ) +
  theme(legend.position = "none")

# 3. Make it interactive for your dashboard
ggplotly(p_violin)

Conclusion

Since we are looking to step away from the huge gap that ‘novels’ reading requiters.
My recommendation is to bridge the Gratitude Gap. We have mastered Damage Control; now we need to use this data to start rewarding our 5-star ‘promoters’ with the same energy we use to fix our 1-star ’detractors. That way we not only appreciating our promoters, but also reducing the time that it takes for the management to respond to the long wordy complains.

TripAdvisor Strategic Analysis

Judy Shulman

3/18/2026