Mini Assignment 4

Plot 2 — Income by rating, faceted by county

The income–rating gradient is not uniform across counties. Fulton and Cobb show higher medians and greater dispersion across ratings. Clayton stays comparatively low across all ratings. DeKalb and Gwinnett show moderate increases in income with rating.

p3 <- ggplot(coffee, aes(x = review_count_log, y = hhincome, color = pct_white)) +
  geom_point(size = 3, alpha = 0.85) +
  scale_color_gradientn(
    colours = c("#2c7fb8", "#f03b20"),
    values  = rescale(c(0, 1)),
    name    = "Proportion of white residents"
  ) +
  facet_wrap(~ county, ncol = 3) +
  labs(x = "Review Count (log)", y = "Median Annual Household Income",
       title = "Review activity vs. neighborhood income") +
  theme_minimal(base_size = 13)
p3

##Plot 3 — Review volume vs. income (color = % White)

Logged review count is positively related to income; more reviews appear in higher-income tracts. Warmer colors (higher % White) are more common in the upper-income/high-review space, especially in Fulton/Cobb, though there are exceptions. Relationship strength varies by county, suggesting local context matters.

library(tidyverse)
library(scales)

coffee <- readr::read_csv("/home/rstudio/data/coffee.csv", show_col_types = FALSE) %>%
  select(!starts_with("...")) %>%     # remove unnamed column like ...1
  mutate(county = factor(county))

## New names:
## • `` -> `...1`

# Labels (names) -> variable columns (values)
vars <- c(
  "Median Annual Household Income ($)"       = "hhincome",
  "Residents Under Poverty Level (%; log)"   = "pct_pov_log",
  "Residents who self-identify as White (%)" = "pct_white",
  "Total Population"                         = "pop"
)

# Correct mapping: .x = column name in data, .y = facet label
df_long <- purrr::imap_dfr(
  vars,
  ~ coffee %>%
      transmute(
        review_count_log,
        value = .data[[.x]],   # <-- variable/column name
        facet = .y,            # <-- label for facet strip
        county
      )
)

# Per-facet correlation label (overall across counties)
ann <- df_long %>%
  group_by(facet) %>%
  summarise(
    r = cor(review_count_log, value, use = "complete.obs"),
    p = cor.test(review_count_log, value)$p.value,
    x = min(review_count_log, na.rm = TRUE),
    y = max(value, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(
    label = paste0("R = ", sprintf("%.2f", r), ",  p = ", scales::pvalue(p)),
    x = x + 0.02 * (max(df_long$review_count_log, na.rm=TRUE) -
                    min(df_long$review_count_log, na.rm=TRUE)),
    y = y - 0.05 * y
  )

ggplot(df_long, aes(review_count_log, value, color = county)) +
  geom_point(alpha = 0.75, size = 2.1) +
  geom_smooth(method = "lm", se = FALSE, size = 1) +
  facet_wrap(~ facet, scales = "free_y", ncol = 2) +
  geom_text(data = ann, aes(x = x, y = y, label = label),
            inherit.aes = FALSE, hjust = 0, vjust = 1,
            fontface = "italic", size = 4) +
  scale_color_manual(
    values = c("Clayton County"="#e76f51","Cobb County"="#6a994e",
               "DeKalb County"="#2a9d8f","Fulton County"="#1d77ff",
               "Gwinnett County"="#bc6ff1"),
    name = "County"
  ) +
  labs(
    title = "Scatterplot between logged review count & neighborhood characteristics",
    x = "Review Count (log)",
    y = "Values"
  ) +
  theme_minimal(base_size = 14) +
  theme(legend.position = "right")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `geom_smooth()` using formula = 'y ~ x'

##Plot 4 — Rating vs. review volume

Ratings show a weak positive trend with review volume, popular areas rate slightly higher. Larger points (more shops) cluster at moderate–high review counts, indicating denser retail areas. Price level varies across the cloud and does not show a clear monotonic link to rating.

Mini Assignment 4

Quan Duong

2025-10-12

Plot 1 — Household income by average rating

Plot 2 — Income by rating, faceted by county