Import Libraries

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(sf)
## Linking to GEOS 3.9.1, GDAL 3.4.3, PROJ 7.2.1; sf_use_s2() is TRUE
library(tmap)
library(leaflet)
library(here)
## here() starts at C:/Users/chan303/OneDrive - Georgia Institute of Technology/CP8883BK/UA_module1
library(tidycensus)

Download Data

yelp_poi <- read.csv(file = here("Data", "coffee.csv"))
head(yelp_poi)
##   X       GEOID         county hhincome    pct_pov review_count avg_rating
## 1 1 13063040202 Clayton County    33276 0.20134228     57.00000          2
## 2 2 13063040308 Clayton County    28422 0.21071800     13.00000          3
## 3 3 13063040407 Clayton County    49271 0.10825507     29.33333          2
## 4 4 13063040408 Clayton County    44551 0.18095661     20.00000          4
## 5 5 13063040410 Clayton County    49719 0.11468019     41.00000          1
## 6 6 13063040411 Clayton County    57924 0.09068942     18.00000          2
##   race.tot avg_price  pct_white hhincome_log review_count_log pct_pov_log
## 1     2850         1 0.07508772     10.41289         4.060443   -1.554276
## 2     4262         1 0.26067574     10.25527         1.975622   -1.510869
## 3     4046         1 0.20514088     10.80529         3.320837   -2.134911
## 4     8489         1 0.16868889     10.70461         3.044522   -1.655709
## 5     7166         1 0.19369244     10.81434         3.737670   -2.082003
## 6    13311         1 0.16512659     10.96706         2.944439   -2.295715
##   yelp_n
## 1      1
## 2      2
## 3      3
## 4      1
## 5      1
## 6      1

Plot 1. Box Plot - avg_rating, hhincome

boxplot(hhincome~avg_rating, data=yelp_poi, main="Average Ratings of Point of Interests by Household Income", xlab="Average Yelp Rating", ylab="Meadian Annual Household Income($)")

It is hard to say there is a trend between median annual household income and average ratings, except that most point of interests are distributed in neighborhoods where household income is lower than $100,000.

Plot 2. Box Plot by County - avg_rating, hhincome

ggplot(data=yelp_poi, aes(x = factor(avg_rating), y = hhincome, xlab = "Average Yelp Rating", ylab = "Median Annual Household Income ($)")) +
  geom_boxplot() +
  facet_wrap(~county)

Fulton County has point of interests with higher rating distributed widely among various household income levels. Meanwhile, Clayton county is mostly low income neighborhoods, with no point of interest with five-star rating. Cobb, DeKalb and Gwinnett county have fairly distributed point of interests ratings among household income levels.

Plot 3. Scatterplot by County - review_count_log, hhincome, county, pct_white

ggplot(data=yelp_poi, aes(x = review_count_log, y = hhincome, color = pct_white)) +
  geom_point(alpha=0.5) +
  labs(title = "Scatterplot: Review Count vs. Household Income", 
       x = "Review Count(log)", 
       y = "Median Annual Household Income ($)",
       color = "Proportion of residents who self-identified as white")+
  scale_color_gradient(low="darkblue", high="red")+
  facet_wrap(~county, nrow=2, ncol=3)

Point of interests are most widely reviewed in Fulton County. Places located in neighborhoods with higher household income level are places where White resident proportion is high. Conversely, Clayton county has the smallest number of reviews, places mostly distributed in low-income, low-White proportion neighborhoods.

Plot 4. Scatterplot by hhincome, pct white, pct pov log, race.tot, review count log, county

library(ggpubr)

new_labels <- c('hhincome' = 'Median Annual Household Income ($)', "pct_pov_log" = "Percent Residents Under Poverty", "pct_white" = "Percent White Resident", "race.tot" = "Total Popoulation")

ggplot(yelp_poi %>% pivot_longer(cols = c('hhincome','pct_pov_log', 'pct_white','race.tot'), names_to = 'neighborhood_characteristics', values_to='Values')) + 
  geom_point(aes(x=review_count_log, y=Values, color=county)) +
  geom_smooth(aes(x=review_count_log, y = Values, color = county), se=FALSE, method="lm") +
  labs(title="Scatterplot between logged review count & neighborhood characteristics",
       subtitle= "Using Yelp data in Five Counties Around Atlanta, GA",
       x = "Review Count Logged",
       y = "Values",
       color = "County") + 
  stat_cor(aes(x=review_count_log, y = Values))+
  facet_wrap(~neighborhood_characteristics, scales = "free_y", labeller = labeller(neighborhood_characteristics=new_labels, width = 100, height = 100))
## `geom_smooth()` using formula 'y ~ x'

In the top-left plot, DeKalb county shows slight positive trend between median household income and review counts. Other counties do not show significant trend. In the top-right plot, DeKalbd and Fulton county shows slight negative trend between review counts and percent residents under poverty. From these two plots, I could see some relationship between review counts and income level. Poorer neighborhoods are likely to have less review counts.

The bottom-left plot presents that DeKalb and Fulton County has comparably stronger positive relationship between review counts and percent of White residents. In these counties, neighborhoods with more White residents are likely to have more review counts. The bottom-right plot shows that the number of total population does not show much relationship with review counts.