library(tidyverse); library(scales)
coffee <- read_csv("/home/rstudio/data/coffee.csv", show_col_types = FALSE) |>
select(!starts_with("...")) |>
mutate(
county = factor(county),
avg_rating = factor(avg_rating, levels = sort(unique(avg_rating)))
)
## New names:
## • `` -> `...1`
p1 <- ggplot(coffee, aes(x = avg_rating, y = hhincome)) +
geom_boxplot(fill = "white", color = "black", outlier.alpha = 0.8) +
labs(x = "avg_rating", y = "hhincome") +
theme_minimal(base_size = 14)
p1
Median income generally rises from 1→4 stars, with the widest spread at 4. The 5-star group dips slightly, likely because there are fewer tracts with uniformly perfect ratings. Overall, better-rated coffee areas tend to be wealthier, but there’s still wide overlap across ratings.
p2 <- ggplot(coffee, aes(x = avg_rating, y = hhincome)) +
geom_boxplot(fill = "white", color = "black", outlier.alpha = 0.6) +
facet_wrap(~ county, ncol = 3) +
labs(x = "Average Rating", y = "Median Annual Household Income ($)",
title = "Income–rating relationship varies by county") +
theme_minimal(base_size = 13)
p2
The income–rating gradient is not uniform across counties. Fulton and Cobb show higher medians and greater dispersion across ratings. Clayton stays comparatively low across all ratings. DeKalb and Gwinnett show moderate increases in income with rating.
p3 <- ggplot(coffee, aes(x = review_count_log, y = hhincome, color = pct_white)) +
geom_point(size = 3, alpha = 0.85) +
scale_color_gradientn(
colours = c("#2c7fb8", "#f03b20"),
values = rescale(c(0, 1)),
name = "Proportion of white residents"
) +
facet_wrap(~ county, ncol = 3) +
labs(x = "Review Count (log)", y = "Median Annual Household Income",
title = "Review activity vs. neighborhood income") +
theme_minimal(base_size = 13)
p3
##Plot 3 — Review volume vs. income (color = % White)
Logged review count is positively related to income; more reviews appear in higher-income tracts. Warmer colors (higher % White) are more common in the upper-income/high-review space, especially in Fulton/Cobb, though there are exceptions. Relationship strength varies by county, suggesting local context matters.
library(tidyverse)
library(scales)
coffee <- readr::read_csv("/home/rstudio/data/coffee.csv", show_col_types = FALSE) %>%
select(!starts_with("...")) %>% # remove unnamed column like ...1
mutate(county = factor(county))
## New names:
## • `` -> `...1`
# Labels (names) -> variable columns (values)
vars <- c(
"Median Annual Household Income ($)" = "hhincome",
"Residents Under Poverty Level (%; log)" = "pct_pov_log",
"Residents who self-identify as White (%)" = "pct_white",
"Total Population" = "pop"
)
# Correct mapping: .x = column name in data, .y = facet label
df_long <- purrr::imap_dfr(
vars,
~ coffee %>%
transmute(
review_count_log,
value = .data[[.x]], # <-- variable/column name
facet = .y, # <-- label for facet strip
county
)
)
# Per-facet correlation label (overall across counties)
ann <- df_long %>%
group_by(facet) %>%
summarise(
r = cor(review_count_log, value, use = "complete.obs"),
p = cor.test(review_count_log, value)$p.value,
x = min(review_count_log, na.rm = TRUE),
y = max(value, na.rm = TRUE),
.groups = "drop"
) %>%
mutate(
label = paste0("R = ", sprintf("%.2f", r), ", p = ", scales::pvalue(p)),
x = x + 0.02 * (max(df_long$review_count_log, na.rm=TRUE) -
min(df_long$review_count_log, na.rm=TRUE)),
y = y - 0.05 * y
)
ggplot(df_long, aes(review_count_log, value, color = county)) +
geom_point(alpha = 0.75, size = 2.1) +
geom_smooth(method = "lm", se = FALSE, size = 1) +
facet_wrap(~ facet, scales = "free_y", ncol = 2) +
geom_text(data = ann, aes(x = x, y = y, label = label),
inherit.aes = FALSE, hjust = 0, vjust = 1,
fontface = "italic", size = 4) +
scale_color_manual(
values = c("Clayton County"="#e76f51","Cobb County"="#6a994e",
"DeKalb County"="#2a9d8f","Fulton County"="#1d77ff",
"Gwinnett County"="#bc6ff1"),
name = "County"
) +
labs(
title = "Scatterplot between logged review count & neighborhood characteristics",
x = "Review Count (log)",
y = "Values"
) +
theme_minimal(base_size = 14) +
theme(legend.position = "right")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
##Plot 4 — Rating vs. review volume
Ratings show a weak positive trend with review volume, popular areas rate slightly higher. Larger points (more shops) cluster at moderate–high review counts, indicating denser retail areas. Price level varies across the cloud and does not show a clear monotonic link to rating.