knitr::opts_chunk$set(fig.width = 10, fig.height = 7, fig.align = "center", dpi = 300)
#load datasets
coffee <- read.csv(here("coffee.csv"))
#plot 1 - box plot
coffee_clean <- coffee %>%
select(GEOID, avg_rating, county, hhincome, review_count_log, pct_white, pct_pov_log, pop)
#generate box plot
coffee_rating_boxplot <- ggplot(coffee_clean, aes(x = factor(avg_rating), y = hhincome)) +
geom_boxplot() +
theme(
plot.title = element_text(size = 15, face = "bold", hjust = 0.5),
axis.title = element_text(size = 10, face = "bold"),
axis.text = element_text(size = 14),
legend.title = element_text(size = 16, face = "bold"),
legend.text = element_text(size = 14)
) +
labs(
x = "avg_rating",
y = "hhincome"
)
print(coffee_rating_boxplot)
Based on the box plot, the median household income (hhincome) is highest when the average rating equals 4, compared to other rating levels. Additionally, the distribution of hhincome shows more outliers when the average rating is 2. This suggests that coffee shops with moderately high ratings tend to be located in areas with higher median household incomes, whereas lower-rated businesses (average rating = 2) are found across a wider range of income levels, indicating greater variability in their surrounding socioeconomic contexts.
# plot 2- box plot by county
coffee_rating_by_county_boxplot <- ggplot(coffee_clean, aes(x = factor(avg_rating), y = hhincome)) +
geom_boxplot() +
theme(
plot.title = element_text(size = 15, face = "bold", hjust = 0.5),
axis.title = element_text(size = 10, face = "bold"),
axis.text = element_text(size = 14),
legend.title = element_text(size = 16, face = "bold"),
legend.text = element_text(size = 14)
) +
labs(
x = "Average Rating",
y = "Median Annual Household Income($)"
) +
facet_wrap(~ county)
print(coffee_rating_by_county_boxplot)
When we group average ratings by counties, we observe
interesting patterns that provide insights into each county’s
socioeconomic dynamics. In Clayton County, the box plots appear very
tight, suggesting a limited number of coffee shops in the dataset.
Moreover, there are no coffee shops with an average rating of 5, which
may reflect either a smaller sample size or relatively lower customer
satisfaction compared to other counties. In contrast, the box plots for
Fulton County exhibit long whiskers across almost all rating categories
(except average rating ==1
), indicating a wide range of
household income groups (hhincome) along the y-axis. These patterns
suggest that more urbanized and economically diverse areas, such as
Fulton County, host coffee shops catering to a broader socioeconomic
spectrum, while smaller or less economically vibrant counties like
Clayton display more uniform market conditions.
# plot 3- Scatterplot: Review Count vs. Household Income
coffee_review_by_county_scatterplot <- ggplot(coffee_clean, aes(x = review_count_log, y = hhincome, color = pct_white)) +
geom_point() +
theme(
plot.title = element_text(size = 15, face = "bold", hjust = 0),
axis.title = element_text(size = 10, face = "bold"),
axis.text = element_text(size = 14),
legend.title = element_text(size = 10, face = "bold", lineheight = 0.9),
legend.text = element_text(size = 5)
) +
labs(
title = "Scatterplot: Review Count vs. Household Income",
x = "Review Count (log)",
y = "Median Annual Household Income",
color = "Proportion of residents who self-\nidentified as white"
) +
scale_color_gradient(low="darkblue", high="red") +
facet_wrap(~ county)
print(coffee_review_by_county_scatterplot)
In this scatter plot, we observe that counties with a higher
review count also tend to have a higher proportion of residents who
self-identify as White. Additionally, these counties, particularly
DeKalb County, show a higher proportion of such residents associated
with higher median annual household incomes. Comparing this with the
previous box plot, even without any background knowledge about U.S.
counties, we can infer that DeKalb County likely has a more affluent
population, whereas Clayton County appears to be home to relatively less
affluent or economically disadvantaged residents.
#plot 4- Scatterplot Between logged review and neighborhood characteristics
coffee_long <- coffee_clean %>%
pivot_longer(
cols = c(hhincome, pct_white, pct_pov_log, pop),
names_to = "characteristic",
values_to = "value"
)
coffee_long_characteristics <- ggplot(coffee_long, aes(x = review_count_log, y = value, color = county)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE) +
ggpubr::stat_cor(
data = coffee_long,
inherit.aes = FALSE,
aes(x = review_count_log, y = value,
label = paste(..r.label.., ..p.label.., sep = "~`,`~")),
method = "pearson",
label.x.npc = "left",
label.y.npc = "top",
size = 3.5
) +
facet_wrap(
~ characteristic,
scales = "free_y",
labeller = as_labeller(c(
hhincome = "Median Annual Household Income ($)",
pct_white = "Residents who self-identify as White (%)",
pct_pov_log = "Residents Under Poverty Level (%;log)",
pop = "Total Population"
))
) +
theme_minimal() +
theme(
plot.title = element_text(size = 13, face = "bold", hjust = 0),
axis.title = element_text(size = 12, face = "bold"),
axis.text = element_text(size = 10),
legend.title = element_text(size = 12, face = "bold"),
legend.text = element_text(size = 10),
strip.text = element_text(size = 10, face = "bold", color = "black"),
strip.background = element_rect(fill = "lightgray")
) +
labs(
title = "Scatterplot between logged review count & neighborhood characteristics",
x = "Review Count (log)",
y = "Values",
color = "County"
)
print(coffee_long_characteristics)
## Warning: The dot-dot notation (`..r.label..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(r.label)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
The scatter plots between logged review counts and neighborhood characteristics provide insights into the demographic and socioeconomic composition of each county. For median annual household income, we observe a weak but statistically significant positive relationship with the log of review counts, indicating that areas with higher review counts tend to have slightly higher median incomes, though the effect size is small. Similarly, there is a weak negative relationship between the proportion of residents living below the poverty level and the log of review counts, suggesting that neighborhoods with higher poverty rates tend to have slightly lower review counts.
Among all variables considered, the strongest association is observed between the proportion of residents who self-identify as White and the log of review counts, with DeKalb County showing the most pronounced relationship. Interestingly, total population exhibits a negative association with log review counts, which may reflect that larger populations do not necessarily correspond to higher engagement with coffee shops or review activity. Overall, these patterns suggest that neighborhoods with higher socioeconomic status and a larger share of White residents tend to generate more reviews, while areas with higher poverty or larger populations may have lower relative review activity.