Mini Project 4

Download the data prepared for this assignment. This data was prepared using the following steps:

Yelp data was downloaded for categories = ‘coffee’. This data covers Fulton, DeKalb, Clayton, Cobb, and Gwinnett counties.
American Community Survey 5-Year Estimate for 2019 was downloaded for the counties specified above. It contains
1. median annual household income (hhincome)
2. percent residents under poverty (pct_pov)
3. percent residents who self-identify as white (pct_white) total population (race.tot).
4. log-transformed version of median annual household income (hhincome_log)
5. log-transformed version of percent residents under poverty (pct_pov_log)

The two data are spatially joined. After joining, a few additional columns were generated, including

the total number of businesses (yelp_n)
average rating (avg_rating)
average number of reviews (review_count)
log of the average number of reviews (review_count_log)
average price (avg_price)

Using this data, re-create the following plots as closely as possible. Make sure you provide the code you wrote to generate the plot. When you re-create them, you DO NOT need to make plots aesthetics similar. For example, When custom colors are used, the choice of colors does not matter as long as you appropriately use some custom colors of your choice to display the designated data; When opacity is used, the actual level of opacity doesn’t matter as long as a reasonable level of opacity is applied. Other minor aesthetics, such as the aspect ratio and theme of the plots, do not matter. If you want to modify them for aesthetics, feel free to do so.

For each of the plot, write a few sentences to describe your findings.

Load in Data

coffee_raw <- read_csv('./coffee.csv') %>% select(-'...1')

## New names:
## Rows: 363 Columns: 14
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): county dbl (13): ...1, GEOID, hhincome, pct_pov, review_count, avg_rating,
## race.tot...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

Recreate Plots

Recreate boxplot of avg_rating= x, hhincome = y

boxplot_rating <- ggplot(coffee_raw, aes(x = avg_rating, group = avg_rating, y = hhincome)) +
  geom_boxplot(col = 'lightblue') +
  labs(x = 'Average Rating', y = 'Median Household Income') +
  dark_mode()

## Inverted geom defaults of fill and color/colour.
## To change them back, use invert_geom_defaults().

boxplot_rating

Here it appears that higher income tracts tend to have businesses rated around 3-4. Surprisingly, the distribution of 1 rated and 5 rated businesses tends to be the same across income levels.

Facet wrap boxplot on county

boxplot_rating + facet_wrap(~county)

Fulton county tends to have the most variability within rating scores across income levels. Clayton county has the least. Surprisingly, Clayton County has no 5 star businesses! Furthermore, Cobb county has no 1 star rated businesses!

Scatterplot where x = log review count, y = hhincome, col = prop white, facet wrap on county

boxplot_review <- ggplot(coffee_raw, aes(x = review_count_log, y = hhincome, col = pct_white)) +
  geom_point(alpha = 0.5, size = 3) +
  scale_color_gradient(low = 'blue', high = 'red') +
  facet_wrap(~county) +
  labs(x = 'Review Count (Log)', y = 'Median Annual Household Income', col = str_wrap('Proportion of residents who self-identify as white', width = 30)) +
  theme_light() +
  dark_mode() +
  theme(legend.title = element_text(size = 8), legend.key.size = unit(12, 'pt'), legend.text = element_text(size = '6'))

boxplot_review

Once again, Fulton county has the largest spread of review counts across income levels. Just looking at the plot, it appears Dekalb County has the highest correlation between # of reviews and income level. Furthermore, all the counties with the exception of Clayton have a correlation between # of reviews, income level, and % residents who identify as white.

Scatter plot of the following four regressions, colored by county:

Household Income ~ Review Count (Log)
Poverty Rate ~ Review Count (Log)
Prop White ~ Review Count (Log)
Total Pop ~ Review Count (Log)

#Build models to get Rsq and and pval labels
hhi_lm <- lm(hhincome ~ review_count_log, data = coffee_raw) %>% summary()
poverty_lm <- lm(pct_pov_log ~ review_count_log, data = coffee_raw) %>% summary()
white_lm <- lm(pct_white ~ review_count_log, data = coffee_raw) %>% summary()
pop_lm <- lm(race.tot ~ review_count_log, data = coffee_raw) %>% summary()

labels <- data.frame(var_type = character(), r = numeric(), pval = numeric())

#For every model
for (model in list(hhi_lm, poverty_lm, white_lm, pop_lm)){
  
  #Build the row of the dataframe
  x = model[['terms']][[3]]
  y = model[['terms']][[2]]
  row <- data.frame(var_type = model[['terms']][[2]] %>% as.character(), r = cor(coffee_raw[[x]], coffee_raw[[y]]), pval = model[['coefficients']][2,4])
  
  #Bind the row to the labels dataframe
  labels <- labels %>%
    bind_rows(row)
}

#Format the text statement
labels <- labels %>%
  mutate(text = paste0('R = ', r %>% round(2), ', p = ', pval %>% signif(2)))

#Prep data for plotting via pivot_longer
coffee_pivot <- coffee_raw %>%
  pivot_longer(cols = c('hhincome', 'pct_pov_log', 'pct_white', 'race.tot'), names_to = 'var_type')

facet_labels = c(`hhincome` = 'Median Annual Household Income ($)', `pct_pov_log` = 'Percent Residents Under Poverty', `pct_white` = 'Percent White Resident', `race.tot` = 'Total Population')

#Plot the data
lm_plot <- ggplot(coffee_pivot, aes(x = review_count_log, y = value, col = county)) +
  facet_wrap(.~var_type, scales = 'free', labeller = as_labeller(facet_labels)) +
  geom_point() +
  geom_smooth(method = 'lm', se = FALSE) +
  geom_text(data = labels, aes(x = -Inf, y = Inf, label = text, fontface = 'italic'), size = 3, hjust = -.02, vjust = 1, inherit.aes = FALSE) +
  labs(x = 'Review Count Logged', y = 'Value', title = 'Scatterplot between logged review count & neighborhood characteristics', subtitle = 'Using Yelp data in Five Counties Around Atlanta, GA', col = 'County') +
  dark_mode() +
  theme(legend.title = element_text(size = 8), legend.key.size = unit(12, 'pt'), legend.text = element_text(size = '6'), plot.title = element_text(size = 10), plot.subtitle = element_text(size = 8), axis.text = element_text(size = '4'))

lm_plot

## `geom_smooth()` using formula 'y ~ x'

Based on the R coefficients, the # of reviews is most heavily associated with the proportion of white residents in a tract. Furthermore, this coefficient is significant. This correlation is most prevalent in Dekalb County. Similarly, the poverty rate has a significant negative relationship with the number of reviews, and the annual median income has a significant positive relationship with the number of reviews (with an alpha of 0.05). I’m surprised to see % white is the most effective predictor of the number of reviews. I think it could be interesting to do further analysis of the demographics of yelp users (are white users more likely to leave reviews on white businesses)?

Mini Project 4

Samuel Martinez

2022-10-06

Mini Project 4

Load in Data

Recreate Plots