R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.4     ✔ tibble    3.3.0
## ✔ purrr     1.1.0     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr)
library(ggpmisc)
## Loading required package: ggpp
## Registered S3 methods overwritten by 'ggpp':
##   method                  from   
##   heightDetails.titleGrob ggplot2
##   widthDetails.titleGrob  ggplot2
## 
## Attaching package: 'ggpp'
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate

Instructions

Directions

  1. Download the data prepared for this assignment from here. The data is based on Census 2023 ACS 5-year estimates and Yelp POI data. It includes Census Tracts in Fulton, DeKalb, Cobb, Gwinnett, and Clayton Counties that have at least one coffee shop. The columns are:
  1. Using this dataset, re-create the four given plots as closely as possible. Be sure to include the code you used to generate each plot. When replicating, you do not need to exactly match the aesthetics. For example:
  1. For each plot, write a few sentences describing your findings.

Loading Data

coffee <- read.csv("coffee.csv")

Plot 1

ggplot(coffee, aes(x = factor(avg_rating), y = hhincome, fill = factor(avg_rating))) +
  geom_boxplot(alpha = 0.85, width = 0.6, outlier.shape = 21, outlier.fill = "white") +
  scale_fill_brewer(palette = "YlOrBr") +
  theme_minimal(base_size = 14) +
  labs(
    title = "Household Income by Average Coffee Shop Rating",
    x = "Average Rating",
    y = "Household Income (USD)",
    fill = "Avg Rating",
    caption = "Source: 2023 ACS 5-year estimates and Yelp POI data"
  ) +
  theme(
    plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
    plot.subtitle = element_text(size = 13, hjust = 0.5),
    legend.position = "none",
    panel.grid.minor = element_blank(),
    axis.title = element_text(face = "bold")
  )

The boxplot from figure plot 1 show that the median average coffee rating of 4 is highest among household income earners of about $105,000. Similarly, those household income earners of below $5000 had the lowest rating of coffee shops with those earning about $65000 giving a rating of 5. Overall, the median rating coffee shops was between $48000 and $105,000 household income earners. Households with the highest income levels mostly gave coffeee shops rating of 2 whiles others also gave ratings of 4.

Plot 2

ggplot(coffee, aes(x = factor(avg_rating), y = hhincome, fill = factor(avg_rating))) +
  geom_boxplot(alpha = 0.85, width = 0.6, outlier.shape = 21, outlier.fill = "white") +
  scale_fill_brewer(palette = "YlOrBr") +
  facet_wrap(~county, scales = "free_y") +
  theme_minimal(base_size = 12) +
  labs(
    title = "Household Income by Average Coffee Shop Rating",
    x = "Average Rating",
    y = "Median Annual Household Income (USD)",
    fill = "Avg Rating",
    caption = "Source: 2023 ACS 5-year estimates and Yelp POI data"
  ) +
  theme(
    plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
    plot.subtitle = element_text(size = 13, hjust = 0.5),
    legend.position = "none",
    panel.grid.minor = element_blank(),
    axis.title = element_text(face = "bold")
  )

Plot 2 figure shows the average ratings among annual household income earners across the five selected counties. Overall, higher-rated coffee shops (rating of 4 0r 5) tend to be located in neighborhoods like in Fulton and DeKalb counties. Median household incomes are generally lower in Clayton County and higher in Cobb and Fulton Counties. This pattern suggests a positive association between neighborhood affluence and coffee shop ratings.

Plot 3

ggplot(data = coffee) +
  geom_point(mapping = aes(x=review_count_log, y=hhincome, color = pct_white ), 
             size = 5) +
  facet_wrap(~county, scales = "free_y") +
  theme_minimal(base_size = 12) +
  labs(
    title = "Reviews by Household Income levels among White Residents accross all counties",
    x = "Review count (log)",
    y = "Median Annual Household Income (USD)",
    color = "White Self-identfying",
    caption = "Source: 2023 ACS 5-year estimates and Yelp POI data"
  ) +
  scale_color_gradient(low="darkblue", high="red") +
  theme_bw()

  theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
    legend.position = "none",
    panel.grid.minor = element_blank(),
    axis.title = element_text(face = "bold")
  )
## List of 4
##  $ axis.title      :List of 11
##   ..$ family       : NULL
##   ..$ face         : chr "bold"
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ legend.position : chr "none"
##  $ panel.grid.minor: list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ plot.title      :List of 11
##   ..$ family       : NULL
##   ..$ face         : chr "bold"
##   ..$ colour       : NULL
##   ..$ size         : num 14
##   ..$ hjust        : num 0.5
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE

The figure for plot 3 illustrates the relationship between the number of coffee shop reviews and median household income across the five counties, with the color intensity representing the proportion of White residents. In general, counties with higher median incomes—such as Fulton, Cobb, and DeKalb—tend to have more reviews and a higher proportion of White residents. Clayton County shows fewer reviews and lower income levels overall. The gradient pattern suggests a positive association between neighborhood affluence, racial composition, and local coffee shop activity.

Plot 4

# Reshape the data into long format
coffee_long <- coffee %>%
  pivot_longer(
    cols = c(hhincome, pct_pov, pct_white, pop),
    names_to = "variable",
    values_to = "value"
  )

# Calculate r and p values for each variable
cor_values <- coffee_long %>%
  group_by(variable) %>%
  summarise(
    r = cor(review_count_log, value, use = "complete.obs", method = "pearson"),
    p = cor.test(review_count_log, value, method = "pearson")$p.value
  ) %>%
  mutate(
    label = paste0("R = ", round(r, 3), "\n", "P = ", signif(p, 3))
  )

# Rename variables for facet labels
variable_labels <- c(
  "hhincome"  = "Median annual household income",
  "pct_pov"   = "Residents under poverty level (%)",
  "pct_white" = "Residents who self-identify as White (%)",
  "pop"       = "Total population"
)

# Merge correlation labels with plotting data
coffee_long <- coffee_long %>%
  left_join(cor_values, by = "variable")
# Create scatter plot with regression lines and facet-specific r & p

ggplot(data = coffee_long, aes(x = review_count_log, y = value, color = county)) +
  geom_point(alpha = 0.7, size = 2) +
  geom_smooth(method = "lm", se = FALSE, size = 0.8) +
  
  # Add facet-specific correlation text
  geom_text(
    data = cor_values,
    aes(x = -Inf, y = Inf, label = label),
    hjust = -0.1, vjust = 1.5,
    size = 4,
    inherit.aes = FALSE
  ) +
  
  facet_wrap(~ variable, scales = "free_y", ncol = 2, labeller = as_labeller(variable_labels)) +
  scale_color_brewer(palette = "Dark2") +
  
  labs(
    title = "Scatterplot between logged review count and neighborhood characteristics",
    x = "Review count (log)",
    y = "Value",
    color = "County",
    caption = "Source: 2023 ACS 5-year estimates and Yelp POI data"
  ) +
  
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
    panel.grid.minor = element_blank(),
    legend.position = "bottom",
    axis.title = element_text(face = "bold"),
    strip.text = element_text(face = "bold", size = 11)
  )
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'

The figure from plot 4 shows scatterplots illustrating the relationships between the logarithm of coffee shop review counts and several neighborhood characteristics across five Atlanta-area counties. Each panel presents a different socioeconomic variable, with trend lines drawn by county. The results show a weak but statistically significant positive correlation between review count and median household income (R = 0.164, p < 0.001), suggesting that neighborhoods with higher incomes tend to have more active coffee shop reviews. Similarly, the proportion of residents who self-identify as White (%) shows a moderate positive correlation (R = 0.236, p < 0.001), indicating that areas with a higher White population share are more likely to have a greater volume of reviews. Conversely, residents under the poverty level (%) exhibit a negative correlation (R = –0.147, p = 0.002), implying that economically disadvantaged areas tend to have fewer reviews. Finally, the relationship between total population and review count is weak and not statistically significant (R = –0.031, p = 0.52), suggesting that population size alone does not drive review activity.