Replicating the Plots

Import data

coffee0 <- read.csv("coffee.csv")
head(coffee0)

##   X       GEOID         county hhincome hhincome_log   pct_pov pct_pov_log
## 1 1 13063040202 Clayton County    47946     10.77804 0.1450162   -1.864226
## 2 2 13063040302 Clayton County    45791     10.73206 0.2070401   -1.527673
## 3 3 13063040308 Clayton County    34032     10.43535 0.1587766   -1.779179
## 4 4 13063040407 Clayton County    70364     11.16158 0.1517782   -1.821529
## 5 5 13063040410 Clayton County    52441     10.86763 0.1400612   -1.896712
## 6 6 13063040414 Clayton County    59519     10.99422 0.2449726   -1.366599
##    pct_white  pop review_count review_count_log avg_rating avg_price n
## 1 0.07664628 2779     57.00000         4.060443          2         1 1
## 2 0.16588302 5317     29.00000         3.401197          1         1 1
## 3 0.20111483 4485     13.00000         1.975622          3         1 2
## 4 0.18585944 4724     29.33333         3.320837          2         1 3
## 5 0.05908257 8175     41.00000         3.737670          1         1 1
## 6 0.18159659 1641     31.00000         3.465247          2         1 2

Re-create the four given plots

Plot 1

# prepare the dataset (remove NAs, classify ratings)
coffee <- coffee0 %>%
  filter(!is.na(hhincome), !is.na(avg_rating)) %>%
  mutate(rating_cat = factor(round(avg_rating), levels = 1:5))

# recreate the boxplot
ggplot(coffee, aes(x = rating_cat, y = hhincome)) +
  geom_boxplot(color = "black", fill = "white", outlier.shape = 16, outlier.size = 1) +
  labs(x = "avg_rating", y = "hhincome") +
  theme_gray(base_size = 10) +
  theme(panel.background = element_rect(fill = "gray95"))

My findings: The plot shows a clear upward trend from ratings 1 to 4: higher coffee shop ratings are generally associated with higher household incomes. However, the 5-rating category presents an exception. While some tracts with an average rating of 5 fall among the wealthiest communities, others are located in much lower-income areas, resulting in a wide spread for this group. In addition, several outliers across the 2- to 5-rating categories stand out at both the low and high ends of income, suggesting that certain neighborhoods deviate from the overall socioeconomic gradient.

Plot 2

# plot boxplots, faceted by county
ggplot(coffee, aes(x = rating_cat, y = hhincome)) +
  geom_boxplot(color = "black", fill = "white", outlier.shape = 16, outlier.size = 1) +
  labs(x = "Average Rating", y = "Median Annual Household Income ($)") +
  facet_wrap(~ county) +
  theme_gray(base_size = 10) +
  theme(panel.background = element_rect(fill = "gray95"))

My findings: Compared with the overall boxplot, the faceted version highlights how the relationship between coffee shop ratings and household income differs across counties.

Clayton County: Coffee shop ratings show little variation across income levels. Median household income remains relatively low and stable, regardless of average rating.
Cobb County: A moderate upward trend is visible: tracts with higher ratings (2–4) tend to have higher household incomes, though the 5-rating group drops slightly.
DeKalb County: Ratings 1–4 show a positive association with income, but the 5-rating group shifts downward, concentrated in lower-income tracts despite some outliers.
Fulton County: The strongest gradient: ratings increase from 1 to 4 alongside rising household income, but the 5-rating group again shows wide variation spanning both low and high incomes.
Gwinnett County: Patterns are less consistent: incomes rise slightly from ratings 1 to 3, but flatten or decline at 4–5, with more variability and several outliers.

Plot 3

# scatterplot faceted by county
ggplot(coffee, aes(x = review_count_log, y = hhincome, color = pct_white)) +
  geom_point(alpha = 0.7, size = 3) +
  scale_color_gradient(low = "blue", high = "red",
                       name = "Proportion of residents\nwho self-identified as white") +
  labs(x = "Review Count (log)", y = "Median Annual Household Income ($)",
       title = "Scatterplot: Review Count vs. Household Income") +
  facet_wrap(~ county) +
  theme_bw(base_size = 10)

My findings: Across all counties, review activity tends to rise with household income, but the strength of this relationship differs. Fulton and DeKalb show clearer socioeconomic and racial divides, as indicated by the shift from blue to red points. Clayton and Gwinnett remain more concentrated at lower income levels, while Cobb falls somewhere in between.

Clayton County: Coffee shops are clustered around lower household incomes with relatively few reviews.
Cobb County: There is a moderate spread: higher review counts are associated with middle-to-higher incomes.
DeKalb County: A wide range of incomes is observed, but review counts tend to cluster in the midrange. Many low-income tracts are shaded blue or purple, showing lower proportions of White residents, while higher-income tracts lean red.
Fulton County: The strongest variation appears here. Review counts and incomes span broadly, and both very wealthy and lower-income neighborhoods have high review activity. The color gradient highlights a racial divide: higher-income tracts are more often majority-White (red), while lower-income tracts lean more blue/purple.
Gwinnett County: Patterns are less pronounced, but review counts increase slightly with income.

Plot 4

# use pivot_longer
coffee_long <- coffee %>%
  select(county, review_count_log, hhincome, pct_pov_log, pct_white, pop) %>%
  pivot_longer(
    cols = c(hhincome, pct_pov_log, pct_white, pop),
    names_to = "variable",
    values_to = "value"
  )

# compute correlations with Pearson
corrs <- coffee_long %>%
  group_by(variable) %>%
  summarise(
    cor_test = list(cor.test(review_count_log, value, method = "pearson")),
    .groups = "drop"
  ) %>%
  mutate(tidy = map(cor_test, broom::tidy)) %>%
  unnest(tidy) %>%
  mutate(
    label = paste0("R = ", signif(estimate, 2),
                   ", p = ", signif(p.value, 2))
  ) %>%
  select(variable, label)

# get facet-wise y-max for annotation placement
y_max <- coffee_long %>%
  group_by(variable) %>%
  summarise(ypos = max(value, na.rm = TRUE), .groups = "drop")

corrs <- left_join(corrs, y_max, by = "variable") %>%
  mutate(xpos = min(coffee_long$review_count_log, na.rm = TRUE) + 0.1)

# labels for facets
var_labels <- c(
  hhincome = "Median Annual Household Income ($)",
  pct_pov_log = "Residents Under Poverty Level (%; log)",
  pct_white = "Residents who self-identify as White (%)",
  pop = "Total Population"
)

# plot
ggplot(coffee_long, aes(x = review_count_log, y = value, color = county)) +
  geom_point(size = 1) +
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~ variable, scales = "free_y", labeller = labeller(variable = var_labels)) +
  geom_text(
    data = corrs,
    aes(x = xpos, y = ypos, label = label),
    inherit.aes = FALSE,
    hjust = 0.05, vjust = 1.2, size = 2.5
  ) +
  labs(x = "Review Count (log)", y = "Values",
       title = "Scatterplot between logged review count & neighborhood characteristics") +
  theme_bw(base_size = 8) +
  theme(
    plot.title = element_text(size = 9)
  )

My findings: The results show no meaningful association between review count and total population or poverty rate, as both correlations are weak and either non-significant or only marginal. By contrast, review count is positively related to median household income and, more strongly, to the share of residents who self-identify as White. In particular, DeKalb County exhibits the clearest positive relationship for the latter, where neighborhoods with higher proportions of White residents tend to have coffee shops that attract more reviews.

Mini-Assignment 4

Yuhan Xu

2025-10-03