R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

coffee <- read.csv("coffee.csv")

Including Plots

You can also embed plots, for example:

The Median Household Income tends to increase as the average ratings increase, but they dip at the 5 rating, peaking at the 4. There are a few high income outliers at ratings 2, 3, 4, and 5 with 2 having the most outliers.

coffee2 <- coffee %>%
  mutate(
    rating_group = round(avg_rating) 
  ) %>%
  filter(!is.na(rating_group), !is.na(hhincome), !is.na(county))

ggplot(coffee2, aes(x = factor(rating_group), y = hhincome, fill = factor(rating_group))) +
  geom_boxplot(alpha = 0.75, outlier.shape = 16, outlier.alpha = 0.5) +
  stat_summary(fun = mean, geom = "point", size = 1.8, color = "black", position = position_dodge(width = 0.75)) +
  labs(
    title = "Median Household Income by Average Rating by County",
    x = "Average Coffee Shop Rating",
    y = "Median Household Income $"
  ) +
  scale_fill_brewer(palette = "Set3") +
  theme_minimal() + 
  facet_wrap(~ county, ncol = 3, scales = "fixed") +
  theme(legend.position = "none")

Across the counties, higher neighborhood incomes are aligning with higher coffee ratings. Most of the counties have the ratings peak at 4. Fulton County has the highest median incomes spread out against the ratings. Clayton County is the lowest income county, with little variation compared to the other counties.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

ggplot(coffee, aes(x = review_count_log, y = hhincome, color = pct_white)) +
  geom_point(alpha = 0.7, size = 3) +
  facet_wrap(~ county, ncol = 3, scales = "fixed") +
  coord_cartesian(ylim = c(0, 250000)) +
  scale_color_gradientn(colors = c("blue", "red")) +
  labs(
    title = "Scatterplot: Review Count vs. Household Income",
    x = "Review Count (log)",
    y = "Median Annual Household Income",
    color = "Proportion of residents 
who self-identified as white"
  ) +
theme_minimal()

Clayton County has the lowest income band overall, with a lower number of reviews as well. Cob County has mid-high incomes with mid-tier log reviews coinciding with higher incomes and more proportion of residents identify as white. Dekalb County has a wide spread in income, with higher incomes having more reviews. Fulton County had the most dispersion with the relationship between reviews and income being weak. Gwinnett County has mid-range incomes. OVerall, Dekalb Coun

metric_levels <- c(
  "hhincome"    = "Median Annual Household Income ($)",
  "pct_pov_log" = "Residents Under Poverty Level (%; log)",
  "pct_white"   = "Residents who self-identify as White (%)",
  "pop"         = "Total Population"
)

coffee_long <- coffee %>%
  select(county, review_count_log, hhincome, pct_pov_log, pct_white, pop) %>%
  pivot_longer(
    cols = c(hhincome, pct_pov_log, pct_white, pop),
    names_to = "metric",
    values_to = "value"
  ) %>%
  mutate(metric_label = factor(recode(metric, !!!metric_levels),
                               levels = unname(metric_levels)))

panel_ranges <- coffee_long %>%
  group_by(metric_label) %>%
  summarise(
    x_min = min(review_count_log, na.rm = TRUE),
    x_max = max(review_count_log, na.rm = TRUE),
    y_min = min(value, na.rm = TRUE),
    y_max = max(value, na.rm = TRUE),
    .groups = "drop"
  )

panel_stats <- coffee_long %>%
  group_by(metric_label) %>%
  summarise(
    r = cor(review_count_log, value, use = "complete.obs"),
    p = cor.test(review_count_log, value)$p.value,
    .groups = "drop"
  ) %>%
  left_join(panel_ranges, by = "metric_label") %>%
  mutate(
    label = paste0("R = ", round(r, 2), ", p = ", signif(p, 2)),
    x = x_min + 0.05 * (x_max - x_min),
    y = y_max - 0.07 * (y_max - y_min)
  )

ggplot(coffee_long, aes(x = review_count_log, y = value, color = county)) +
  geom_point(alpha = 0.9, size = 0.5) +
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~ metric_label, ncol = 2, scales = "free_y") +
  labs(
    title = "Scatterplot between logged review count & neighborhood characteristics",
    x = "Review Count (log)",
    y = "Values",
    color = "County"
  ) +
  geom_text(
    data = panel_stats,
    aes(x = x, y = y, label = label),
    inherit.aes = FALSE,
    hjust = 0, vjust = 1, size = 3.8
  ) + 
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Median Household Income is positivey associated with income. The slope for Dekalb county is largely positive. For the residents under poverty section, the relationship is negative with the reviews leaning towards lower poverty. The residents who self-identify as white had the highest positive association with the review counts. For total population, there’s basically no association.