First, loading the library and the dataset.
# Load necessary libraries
library(ggplot2)
library(dplyr)
library(tidyr)
library(purrr)
# Load the dataset
coffee <- read.csv("F:/CP8883/coffee.csv")
Now I’m going to recreate all the plots, and provide some insights
for each.
Note that I’ve added titles and labelled axes for better understanding
(not included in some of the plots in the instruction).
ggplot(coffee, aes(x = as.factor(avg_rating), y = hhincome)) +
geom_boxplot() +
labs(title = "Average Rating vs Household Income",
x = "Average Rating",
y = "Household Income (Median Annual)") +
theme_gray()
Findings: As seen from the boxplot, businesses with a
rating of 3 and 4 tend to be located in areas with a higher median
income, with some outliers extending beyond $150,000. On the other hand,
businesses with an average rating of 1 and 2 tend to be in lower-income
areas. This suggests a potential relationship where businesses in
wealthier areas receive higher Yelp ratings, though the spread within
each rating category is significant. Interestingly, businesses with an
average Yelp rating of 5 tend to be located in areas with lower median
household income, which contrasts with the general trend observed for
other ratings. While one might expect higher-rated businesses to be in
wealthier areas, this plot suggests that businesses in lower-income
areas can still achieve excellent customer satisfaction. This anomaly
may indicate that factors other than income are influencing these high
ratings.
ggplot(coffee, aes(x = as.factor(avg_rating), y = hhincome)) +
geom_boxplot() +
facet_wrap(~ county) +
labs(title = "Average Rating vs Household Income by County",
x = "Average Yelp Rating",
y = "Median Annual Household Income ($)") +
theme_gray()
Findings: The data reveals variability between
counties, with Fulton and Cobb showing a broader spread of income levels
across all rating categories. DeKalb County shows an increasing trend in
income as ratings increase, suggesting businesses in wealthier
neighborhoods tend to get higher ratings. Clayton and Gwinnett Counties
show less variation, with most businesses clustering around lower or
middle income brackets regardless of rating. This analysis shows that
the relationship between Yelp ratings and income varies by county, as
well as reflecting distinct income distributions across counties.
Besides, similar to the overall trend seen in Plot 1, ratings of 5 seem
to cluster around lower-income areas.
ggplot(coffee, aes(x = review_count_log, y = hhincome, color = pct_white)) +
geom_point(alpha = 0.6, size = 3) +
facet_wrap(~ county, ncol = 3) + # Keep ncol = 3 for 3 columns
scale_color_gradient(low = "blue", high = "red") +
labs(title = "Scatterplot: Review Count vs. Household Income",
x = "Review Count (log)",
y = "Median Annual Household Income",
color = "Proportion of residents \nwho self-identified as white") +
theme_minimal() +
theme( # aesthetics application below
strip.background = element_rect(fill = "gray90", color = "black", size = 0.3),
strip.text = element_text(size = 8),
panel.border = element_rect(color = "black", fill = NA, size = 0.3),
panel.grid.major = element_line(size = 0.5),
)
Findings: The color gradient indicates that areas with
a higher proportion of white residents tend to have more reviews,
particularly in Fulton and Cobb counties. There is a weak positive
correlation between household income and review count in most counties,
though the strength of this relationship varies. For example, in DeKalb
and Clayton counties, lower-income areas tend to have fewer reviews.
This trend suggests that both demographic and economic factors play a
role in influencing customer engagement with businesses. Additionally,
businesses in predominantly white and wealthier areas are likely to
attract more reviews, indicating a potential relationship between
community demographics and online business visibility.
First I’m going to preform several data and label preparations.
# Reshape the data using pivot_longer()
coffee_long <- coffee %>%
pivot_longer(cols = c(hhincome, pct_pov_log, pct_white, pop),
names_to = "variable", values_to = "value")
# Compute R and p-values for each county and variable combination
cor_results_overall <- coffee_long %>%
group_by(variable) %>%
summarise(cor_test = list(cor.test(review_count_log, value, method = "pearson"))) %>%
mutate(
R = map_dbl(cor_test, ~ .x$estimate),
p_value = map_dbl(cor_test, ~ .x$p.value)
)
# Add formatted R and p-values as labels
cor_results_overall <- cor_results_overall %>%
mutate(
label = paste0("R = ", signif(R, 2), ", p = ", signif(p_value, 2))
)
Now, plotting.
ggplot(coffee_long, aes(x = review_count_log, y = value, color = county)) +
geom_point(alpha = 1) +
geom_smooth(method = "lm", se = FALSE) + # Add trend lines
facet_wrap(~ variable, scales = "free", labeller = labeller(variable = c(
hhincome = "Median Annual Household Income ($)",
pct_pov_log = "Percent Residents Under Poverty (log)",
pct_white = "Percent White Resident",
pop = "Total Population"
))) +
geom_text(data = cor_results_overall, aes(x = 1, y = Inf, label = label),
vjust = 1.5, hjust = 0.2, inherit.aes = FALSE, size = 3) +
labs(title = "Scatterplot between logged review count & neighborhood characteristics",
subtitle = "Using Yelp data in Five Counties Around Atlanta, GA",
x = "Review Count Logged",
y = "Values",
color = "County") +
theme_minimal() +
theme(strip.background = element_rect(fill = "gray90", color = "black", size = 0.3),
strip.text = element_text(size = 8),
panel.border = element_rect(color = "black", fill = NA, size = 0.3))
Findings: The top-left plot shows a weak positive
correlation between household income and review count (R = 0.13, p =
0.011), while the percentage of residents under poverty has a weak
negative correlation with review count (R = -0.19, p = 0.00032). The
percentage of white residents shows a stronger positive correlation with
review count (R = 0.28, p = 7.5e-08), while total population has no
significant relationship (R = -0.017, p = 0.75). These results suggest
that while demographics like race and income play a role in the number
of reviews a business receives, population size does not seem to be a
significant factor.