Download the data prepared for this assignment. This data was prepared using the following steps:
Yelp data was downloaded for categories = ‘coffee’. This data covers Fulton, DeKalb, Clayton, Cobb, and Gwinnett counties.
American Community Survey 5-Year Estimate for 2019 was downloaded for the counties specified above. It contains
The two data are spatially joined. After joining, a few additional columns were generated, including
Using this data, re-create the following plots as closely as possible. Make sure you provide the code you wrote to generate the plot. When you re-create them, you DO NOT need to make plots aesthetics similar. For example, When custom colors are used, the choice of colors does not matter as long as you appropriately use some custom colors of your choice to display the designated data; When opacity is used, the actual level of opacity doesn’t matter as long as a reasonable level of opacity is applied. Other minor aesthetics, such as the aspect ratio and theme of the plots, do not matter. If you want to modify them for aesthetics, feel free to do so.
For each of the plot, write a few sentences to describe your findings.
coffee_raw <- read_csv('./coffee.csv') %>% select(-'...1')
## New names:
## Rows: 363 Columns: 14
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): county dbl (13): ...1, GEOID, hhincome, pct_pov, review_count, avg_rating,
## race.tot...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
Recreate boxplot of avg_rating= x, hhincome = y
boxplot_rating <- ggplot(coffee_raw, aes(x = avg_rating, group = avg_rating, y = hhincome)) +
geom_boxplot(col = 'lightblue') +
labs(x = 'Average Rating', y = 'Median Household Income') +
dark_mode()
## Inverted geom defaults of fill and color/colour.
## To change them back, use invert_geom_defaults().
boxplot_rating
Here it appears that higher income tracts tend to have businesses rated
around 3-4. Surprisingly, the distribution of 1 rated and 5 rated
businesses tends to be the same across income levels.
Facet wrap boxplot on county
boxplot_rating + facet_wrap(~county)
Fulton county tends to have the most variability within rating scores
across income levels. Clayton county has the least. Surprisingly,
Clayton County has no 5 star businesses! Furthermore, Cobb county has no
1 star rated businesses!
Scatterplot where x = log review count, y = hhincome, col = prop white, facet wrap on county
boxplot_review <- ggplot(coffee_raw, aes(x = review_count_log, y = hhincome, col = pct_white)) +
geom_point(alpha = 0.5, size = 3) +
scale_color_gradient(low = 'blue', high = 'red') +
facet_wrap(~county) +
labs(x = 'Review Count (Log)', y = 'Median Annual Household Income', col = str_wrap('Proportion of residents who self-identify as white', width = 30)) +
theme_light() +
dark_mode() +
theme(legend.title = element_text(size = 8), legend.key.size = unit(12, 'pt'), legend.text = element_text(size = '6'))
boxplot_review
Once again, Fulton county has the largest spread of review counts across
income levels. Just looking at the plot, it appears Dekalb County has
the highest correlation between # of reviews and income level.
Furthermore, all the counties with the exception of Clayton have a
correlation between # of reviews, income level, and % residents who
identify as white.
Scatter plot of the following four regressions, colored by county:
#Build models to get Rsq and and pval labels
hhi_lm <- lm(hhincome ~ review_count_log, data = coffee_raw) %>% summary()
poverty_lm <- lm(pct_pov_log ~ review_count_log, data = coffee_raw) %>% summary()
white_lm <- lm(pct_white ~ review_count_log, data = coffee_raw) %>% summary()
pop_lm <- lm(race.tot ~ review_count_log, data = coffee_raw) %>% summary()
labels <- data.frame(var_type = character(), r = numeric(), pval = numeric())
#For every model
for (model in list(hhi_lm, poverty_lm, white_lm, pop_lm)){
#Build the row of the dataframe
x = model[['terms']][[3]]
y = model[['terms']][[2]]
row <- data.frame(var_type = model[['terms']][[2]] %>% as.character(), r = cor(coffee_raw[[x]], coffee_raw[[y]]), pval = model[['coefficients']][2,4])
#Bind the row to the labels dataframe
labels <- labels %>%
bind_rows(row)
}
#Format the text statement
labels <- labels %>%
mutate(text = paste0('R = ', r %>% round(2), ', p = ', pval %>% signif(2)))
#Prep data for plotting via pivot_longer
coffee_pivot <- coffee_raw %>%
pivot_longer(cols = c('hhincome', 'pct_pov_log', 'pct_white', 'race.tot'), names_to = 'var_type')
facet_labels = c(`hhincome` = 'Median Annual Household Income ($)', `pct_pov_log` = 'Percent Residents Under Poverty', `pct_white` = 'Percent White Resident', `race.tot` = 'Total Population')
#Plot the data
lm_plot <- ggplot(coffee_pivot, aes(x = review_count_log, y = value, col = county)) +
facet_wrap(.~var_type, scales = 'free', labeller = as_labeller(facet_labels)) +
geom_point() +
geom_smooth(method = 'lm', se = FALSE) +
geom_text(data = labels, aes(x = -Inf, y = Inf, label = text, fontface = 'italic'), size = 3, hjust = -.02, vjust = 1, inherit.aes = FALSE) +
labs(x = 'Review Count Logged', y = 'Value', title = 'Scatterplot between logged review count & neighborhood characteristics', subtitle = 'Using Yelp data in Five Counties Around Atlanta, GA', col = 'County') +
dark_mode() +
theme(legend.title = element_text(size = 8), legend.key.size = unit(12, 'pt'), legend.text = element_text(size = '6'), plot.title = element_text(size = 10), plot.subtitle = element_text(size = 8), axis.text = element_text(size = '4'))
lm_plot
## `geom_smooth()` using formula 'y ~ x'
Based on the R coefficients, the # of reviews is most heavily associated with the proportion of white residents in a tract. Furthermore, this coefficient is significant. This correlation is most prevalent in Dekalb County. Similarly, the poverty rate has a significant negative relationship with the number of reviews, and the annual median income has a significant positive relationship with the number of reviews (with an alpha of 0.05). I’m surprised to see % white is the most effective predictor of the number of reviews. I think it could be interesting to do further analysis of the demographics of yelp users (are white users more likely to leave reviews on white businesses)?