coffee <- read.csv("/Users/helenalindsay/Documents/Fall_23/CP8883/Mini4/coffee.csv")%>%
  select(-X)

Plot 1

ggplot(data = coffee) +
  geom_boxplot(aes(x=factor(avg_rating), y=hhincome), 
               fill = "white", color = "black")+
  labs(x = "Average Rating", y = "Household Income") 

The plot implies that the higher household income relates to average yelp ratings of 3 and 4.

Plot 2

ggplot(data = coffee) +
  geom_boxplot(aes(x=factor(avg_rating), y=hhincome), 
               fill = "white", color = "black")+
  labs(x = "Average Yelp Rating", y = "Median Average Household Income ($)") +
  facet_wrap(~county)

Plot 2 indicates that the trend that we saw in plot 1 (higher household income correlates to average yelp ratings of 3 and 4) applies across counties.

Plot 3

ggplot(data = coffee) +
  geom_point(aes(x=review_count_log, y=hhincome,color = pct_white),)+
  labs(x = "Review Count (log)", y = "Median Average Household Income ($)") +
  facet_wrap(~county)

Plot 3 implies that the tracts with a higher percentage of residents who self-identify as white have a higher median average household income across counties. In addition, the log of the average number of reviews seem to have a positive correlation with median average household income.

Plot 4

long_coffee <- coffee %>%
  pivot_longer(cols = c(pct_pov_log, hhincome, pct_white, race.tot),
               names_to = "var",
               values_to = "Values")

long_coffee <- long_coffee %>%
  mutate(
    var = case_when(
      var == "pct_pov_log" ~ "Percent Residents Under Poverty",
      var == "hhincome" ~ "Median Annual Household Income ($)",
      var == "pct_white" ~ "Percent White Resident",
      var == "race.tot" ~ "Total Population",
      TRUE ~ as.character(var)  # Keep other values unchanged
    )
  )

scatterplots <- ggplot(long_coffee) +
  geom_point(aes_string(x = long_coffee$review_count_log, y = "Values", color = "county")) +
  stat_smooth(aes_string(x = long_coffee$review_count_log, y = "Values", color = "county"), method = "lm", se = FALSE) +
  labs(
    x = "Review Count Logged",
    y = "Values",
    color = "County"
  ) +
  facet_wrap(~ var, scales = "free")+
  stat_cor(aes(x=Values,y = review_count_log),  method = "pearson")


scatterplots

There seems to be a positive correlation between the log of the average number of reviews and median annual household income as well as the percentage of residents who self-identify as white. On the other hand, total population and the percentage of residents under poverty seem to be negatively correlated to the log of the average number of reviews.