1. Load Data

1.1 Import CSV containing Census 2023 ACS 5-year Estimates and Yelp POI data

coffee <- read.csv("coffee.csv")

2. Plot 1

ggplot(data = coffee) +
  #plot boxplot
  geom_boxplot(aes(x = avg_rating, y = hhincome, group = avg_rating),
               fill = "white", color = "black") +
  #remove minor breaks
  scale_x_continuous(breaks=seq(1,5, by=1),, minor_breaks = NULL)

Findings

    Out of average ratings 1-5, coffee shops with a average rating 4 locate at census tracts with the highest median in median annual household income. This observation can come to a general conclusion that coffee shops with higher average ratings locate in census tracts with higher median household incomes. While coffee shops with average rating of 1 locate in tracts with the lowest median in median annual household income, it has the smallest range of values in median household income. One thing to note is the large amount of outlier values for median annual household income for coffee shops with average rating of 2.

3. Plot 2

ggplot(data = coffee) +
  #plot boxplot
  geom_boxplot(aes(x = avg_rating, y = hhincome, group = avg_rating),
               fill = "white", color = "black") +
  #remove minor breaks
  scale_x_continuous(breaks=seq(1,5, by=1),, minor_breaks = NULL) +
  #create separate boxplots for each county
  facet_wrap(~county) +
  #add x- and y-axis labels
  labs(x = "Average Rating",
       y = "Median Annual Household Income ($)")

Findings

    When divided boxplot visualization by county, we can see more details specific to conditions of each county. Among all 5 counties, coffee shops in Clayton County generally have low median and ranges in median household income across all average ratings, followed by Gwinett County. On the other hand, Cobb and Fulton counties generally follow the observation that coffee shops with higher average ratings locate in tracts with higher median household income.

4. Plot 3

ggplot(data = coffee) +
  #plot scatterplot
  geom_point(mapping = aes(x=review_count_log, y = hhincome, 
                           color=pct_white), alpha = 0.5, size=3.4) + 
  #create color gradient
  scale_color_gradient(low = "#1F00F3", high = "#F6010D") +
  #create separate scatter plots for each county
  facet_wrap(~county) +
  #add x- and y-axis labels and change legend title
  labs(x = "Review Count (log)",
       y = "Median Annual Household Income",
       color = "Proportion of residents\nwho self-identified as white") +
  #add title
  ggtitle("Scatterplot: Review Count vs. Household Income") +
  #set font size of legend title and text
  theme(
    legend.title = element_text(size = 10),
    legend.text  = element_text(size = 8)
  ) +
  theme_bw()

Findings

    Extending from the conclusion from boxplots, we can also see that Clayton County points are very clustered at low median household income values. Clayton County census tracts also generally have lower proportion of white residents. Across the other four counties, we observe that census tracts with higher percentage of residents who self-identify as white coincide with census tracts with higher median household income. Specifically in Dekalb County, census tracts with higher proportion of white residents not only coincide with higher median household income but also higher average ratings of coffee shops.

5. Plot 4

#Pivot longer
coffee_long <- coffee %>%
  select(county, review_count_log, hhincome, pct_white, pct_pov_log, pop) %>%
  pivot_longer(cols = c(hhincome, pct_white, pct_pov_log, pop),
               names_to = "variable", values_to = "value")

ggplot(data = coffee_long) +
  #plot scatterplot
  geom_point(mapping = aes(x=review_count_log, y = value, 
                           color=county), size=1.5) + 
  #create separate scatter plots for each variable
  facet_wrap(~variable, scales = "free_y",
             labeller = as_labeller(c(
               "hhincome" = "Median Annual Household Income ($)",
               "pct_white" = "Residents who self-identify as White (%)",
               "pct_pov_log" = "Residents Under Poverty Level (%; log)",
               "pop" = "Total Population"))
             ) +
  #create regression line for each county within each variable
  geom_smooth(mapping = aes(x=review_count_log, y = value, color=county), method = "lm", se = FALSE) +
  #add x- and y-axis labels and change legend title
  labs(x = "Review Count (log)",
       y = "Values",
       color = "County") +
  #annotate correlation and p value for each variable's scatterplot
  ggpubr::stat_cor(mapping = aes(x=review_count_log, y = value), 
                 method = "pearson", size=3.5) +
  #add title
  ggtitle("Scatterplot between logged review count & neighborhood characteristics") + 
  theme_bw()

Findings

    Among all variables, proportion of white residents have the strongest linear relationship with coffee shop average rating, followed by median household income. On the other hand, percentage of residents under poverty level and total population have negative relationship with average rating, with percentage of residents under poverty having a stronger negative relationship and total population have a very weak negative relationship.