coffee <- read.csv("https://ujhwang.github.io/urban-analytics-2024/Assignment/mini_4/coffee.csv")
library(ggplot2)

ggplot(data= coffee) +
  geom_boxplot(aes(x = as.factor(avg_rating), y = hhincome),
               color="black",fill="white") +
  labs(title = "Plot 1",
       x = "avg_rating",
       y = "hhincome")

The lower the average rating of a coffee shop, the lower the median annual household income in the area where it is located. In contrast, coffee shops with average ratings between 2 and 4 tend to be located in areas with above-average median annual household incomes. Notably, coffee shops with an average rating of 1 are predominantly located in areas with lower household incomes.

ggplot(data= coffee) +
  geom_boxplot(aes(x = as.factor(avg_rating), y = hhincome, fill = ),
               color="black",fill="white") +
  facet_wrap(~county) +
  labs(title = "Plot 2",
       x = "avg_rating",
       y = "hhincome")

In Fulton County, the range of median annual household accross different average ratings is much wider than other counties. It suggests that coffee shops in this area is located in area with a more diverse range of median annual household incomes. In contrast, Clayton Counties shows the shortest range of median annual household incomes.

ggplot(data = coffee) +
  geom_point(mapping = aes(x=review_count_log, y= hhincome, color = pct_white)) + 
  facet_wrap(~county) +
  labs(title = "Plot 3",
       x = "Review Count (log)",
       y = "Median Annual Household Income") +
  theme_bw()

In Clayton County, coffee shops tend to have a lower review count compared to other counties, and the areas where these shops are located generally have a lower proportion of residents who self-identify as white. In contrast, Fulton County’s coffee shops are situated in areas with a more diverse range of median household incomes, and the proportion of residents who self-identify as white also shows greater variability. Overall, in Cobb County and Gwinnett County, coffee shops are more likely to be located in areas with a higher proportion of white residents.

library(tidyr)
coffee_long <- coffee %>%
  pivot_longer(cols = c(hhincome, pct_pov_log, pct_white, pop),
               names_to = "variable", values_to = "Values")

correlation_results <- coffee_long %>%
  group_by(variable) %>%
  summarise(correlation = cor(review_count_log, Values),
            p_value = cor.test(review_count_log, Values)$p.value)

ggplot(coffee_long, aes(x = review_count_log, y = Values, color = county)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~ variable, scales = "free_y") +
  labs(title = "Scatterplot between logged review count & neighborhood characteristics",
       subtitle = "Using Yelp data in Five Counties Around Atlanta, GA",
       x = "Review Count Logged",
       y = "Values",
       color = "County") +
  theme_minimal()+
  geom_text(data = correlation_results, 
            aes(x = Inf, y = Inf, label = paste("R = ", round(correlation, 2), 
                                                ", p = ", format.pval(p_value, digits = 2))), 
            hjust = 1.1, vjust = 1.5, inherit.aes = FALSE, size = 2)
## `geom_smooth()` using formula = 'y ~ x'

Overall, the distribution of values appears relatively flat across the logged review counts, meaning that review count doesn’t seem to have a strong impact on most of the neighborhood characteristics. However, Fulton County shows a clear pattern: as the review counts increase, the proportion of white residents and the median annual household income tend to rise, while the percentage of residents under poverty decreases.