library(ggplot2)
library(tidyverse)
library(tidyr)
library(dplyr)
coffee0 <- read.csv("coffee.csv")
head(coffee0)
## X GEOID county hhincome hhincome_log pct_pov pct_pov_log
## 1 1 13063040202 Clayton County 47946 10.77804 0.1450162 -1.864226
## 2 2 13063040302 Clayton County 45791 10.73206 0.2070401 -1.527673
## 3 3 13063040308 Clayton County 34032 10.43535 0.1587766 -1.779179
## 4 4 13063040407 Clayton County 70364 11.16158 0.1517782 -1.821529
## 5 5 13063040410 Clayton County 52441 10.86763 0.1400612 -1.896712
## 6 6 13063040414 Clayton County 59519 10.99422 0.2449726 -1.366599
## pct_white pop review_count review_count_log avg_rating avg_price n
## 1 0.07664628 2779 57.00000 4.060443 2 1 1
## 2 0.16588302 5317 29.00000 3.401197 1 1 1
## 3 0.20111483 4485 13.00000 1.975622 3 1 2
## 4 0.18585944 4724 29.33333 3.320837 2 1 3
## 5 0.05908257 8175 41.00000 3.737670 1 1 1
## 6 0.18159659 1641 31.00000 3.465247 2 1 2
# prepare the dataset (remove NAs, classify ratings)
coffee <- coffee0 %>%
filter(!is.na(hhincome), !is.na(avg_rating)) %>%
mutate(rating_cat = factor(round(avg_rating), levels = 1:5))
# recreate the boxplot
ggplot(coffee, aes(x = rating_cat, y = hhincome)) +
geom_boxplot(color = "black", fill = "white", outlier.shape = 16, outlier.size = 1) +
labs(x = "avg_rating", y = "hhincome") +
theme_gray(base_size = 10) +
theme(panel.background = element_rect(fill = "gray95"))
My findings: The plot shows a clear upward trend from ratings 1 to 4: higher coffee shop ratings are generally associated with higher household incomes. However, the 5-rating category presents an exception. While some tracts with an average rating of 5 fall among the wealthiest communities, others are located in much lower-income areas, resulting in a wide spread for this group. In addition, several outliers across the 2- to 5-rating categories stand out at both the low and high ends of income, suggesting that certain neighborhoods deviate from the overall socioeconomic gradient.
# plot boxplots, faceted by county
ggplot(coffee, aes(x = rating_cat, y = hhincome)) +
geom_boxplot(color = "black", fill = "white", outlier.shape = 16, outlier.size = 1) +
labs(x = "Average Rating", y = "Median Annual Household Income ($)") +
facet_wrap(~ county) +
theme_gray(base_size = 10) +
theme(panel.background = element_rect(fill = "gray95"))
My findings: Compared with the overall boxplot, the faceted version highlights how the relationship between coffee shop ratings and household income differs across counties.
# scatterplot faceted by county
ggplot(coffee, aes(x = review_count_log, y = hhincome, color = pct_white)) +
geom_point(alpha = 0.7, size = 3) +
scale_color_gradient(low = "blue", high = "red",
name = "Proportion of residents\nwho self-identified as white") +
labs(x = "Review Count (log)", y = "Median Annual Household Income ($)",
title = "Scatterplot: Review Count vs. Household Income") +
facet_wrap(~ county) +
theme_bw(base_size = 10)
My findings: Across all counties, review activity tends to rise with household income, but the strength of this relationship differs. Fulton and DeKalb show clearer socioeconomic and racial divides, as indicated by the shift from blue to red points. Clayton and Gwinnett remain more concentrated at lower income levels, while Cobb falls somewhere in between.
# use pivot_longer
coffee_long <- coffee %>%
select(county, review_count_log, hhincome, pct_pov_log, pct_white, pop) %>%
pivot_longer(
cols = c(hhincome, pct_pov_log, pct_white, pop),
names_to = "variable",
values_to = "value"
)
# compute correlations with Pearson
corrs <- coffee_long %>%
group_by(variable) %>%
summarise(
cor_test = list(cor.test(review_count_log, value, method = "pearson")),
.groups = "drop"
) %>%
mutate(tidy = map(cor_test, broom::tidy)) %>%
unnest(tidy) %>%
mutate(
label = paste0("R = ", signif(estimate, 2),
", p = ", signif(p.value, 2))
) %>%
select(variable, label)
# get facet-wise y-max for annotation placement
y_max <- coffee_long %>%
group_by(variable) %>%
summarise(ypos = max(value, na.rm = TRUE), .groups = "drop")
corrs <- left_join(corrs, y_max, by = "variable") %>%
mutate(xpos = min(coffee_long$review_count_log, na.rm = TRUE) + 0.1)
# labels for facets
var_labels <- c(
hhincome = "Median Annual Household Income ($)",
pct_pov_log = "Residents Under Poverty Level (%; log)",
pct_white = "Residents who self-identify as White (%)",
pop = "Total Population"
)
# plot
ggplot(coffee_long, aes(x = review_count_log, y = value, color = county)) +
geom_point(size = 1) +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~ variable, scales = "free_y", labeller = labeller(variable = var_labels)) +
geom_text(
data = corrs,
aes(x = xpos, y = ypos, label = label),
inherit.aes = FALSE,
hjust = 0.05, vjust = 1.2, size = 2.5
) +
labs(x = "Review Count (log)", y = "Values",
title = "Scatterplot between logged review count & neighborhood characteristics") +
theme_bw(base_size = 8) +
theme(
plot.title = element_text(size = 9)
)
My findings: The results show no meaningful association between review count and total population or poverty rate, as both correlations are weak and either non-significant or only marginal. By contrast, review count is positively related to median household income and, more strongly, to the share of residents who self-identify as White. In particular, DeKalb County exhibits the clearest positive relationship for the latter, where neighborhoods with higher proportions of White residents tend to have coffee shops that attract more reviews.