library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(sf)
## Linking to GEOS 3.9.1, GDAL 3.4.3, PROJ 7.2.1; sf_use_s2() is TRUE
library(tmap)
library(leaflet)
library(here)
## here() starts at C:/Users/chan303/OneDrive - Georgia Institute of Technology/CP8883BK/UA_module1
library(tidycensus)
yelp_poi <- read.csv(file = here("Data", "coffee.csv"))
head(yelp_poi)
## X GEOID county hhincome pct_pov review_count avg_rating
## 1 1 13063040202 Clayton County 33276 0.20134228 57.00000 2
## 2 2 13063040308 Clayton County 28422 0.21071800 13.00000 3
## 3 3 13063040407 Clayton County 49271 0.10825507 29.33333 2
## 4 4 13063040408 Clayton County 44551 0.18095661 20.00000 4
## 5 5 13063040410 Clayton County 49719 0.11468019 41.00000 1
## 6 6 13063040411 Clayton County 57924 0.09068942 18.00000 2
## race.tot avg_price pct_white hhincome_log review_count_log pct_pov_log
## 1 2850 1 0.07508772 10.41289 4.060443 -1.554276
## 2 4262 1 0.26067574 10.25527 1.975622 -1.510869
## 3 4046 1 0.20514088 10.80529 3.320837 -2.134911
## 4 8489 1 0.16868889 10.70461 3.044522 -1.655709
## 5 7166 1 0.19369244 10.81434 3.737670 -2.082003
## 6 13311 1 0.16512659 10.96706 2.944439 -2.295715
## yelp_n
## 1 1
## 2 2
## 3 3
## 4 1
## 5 1
## 6 1
boxplot(hhincome~avg_rating, data=yelp_poi, main="Average Ratings of Point of Interests by Household Income", xlab="Average Yelp Rating", ylab="Meadian Annual Household Income($)")
It is hard to say there is a trend between median annual household income and average ratings, except that most point of interests are distributed in neighborhoods where household income is lower than $100,000.
ggplot(data=yelp_poi, aes(x = factor(avg_rating), y = hhincome, xlab = "Average Yelp Rating", ylab = "Median Annual Household Income ($)")) +
geom_boxplot() +
facet_wrap(~county)
Fulton County has point of interests with higher rating distributed
widely among various household income levels. Meanwhile, Clayton county
is mostly low income neighborhoods, with no point of interest with
five-star rating. Cobb, DeKalb and Gwinnett county have fairly
distributed point of interests ratings among household income
levels.
ggplot(data=yelp_poi, aes(x = review_count_log, y = hhincome, color = pct_white)) +
geom_point(alpha=0.5) +
labs(title = "Scatterplot: Review Count vs. Household Income",
x = "Review Count(log)",
y = "Median Annual Household Income ($)",
color = "Proportion of residents who self-identified as white")+
scale_color_gradient(low="darkblue", high="red")+
facet_wrap(~county, nrow=2, ncol=3)
Point of interests are most widely reviewed in Fulton County. Places located in neighborhoods with higher household income level are places where White resident proportion is high. Conversely, Clayton county has the smallest number of reviews, places mostly distributed in low-income, low-White proportion neighborhoods.
library(ggpubr)
new_labels <- c('hhincome' = 'Median Annual Household Income ($)', "pct_pov_log" = "Percent Residents Under Poverty", "pct_white" = "Percent White Resident", "race.tot" = "Total Popoulation")
ggplot(yelp_poi %>% pivot_longer(cols = c('hhincome','pct_pov_log', 'pct_white','race.tot'), names_to = 'neighborhood_characteristics', values_to='Values')) +
geom_point(aes(x=review_count_log, y=Values, color=county)) +
geom_smooth(aes(x=review_count_log, y = Values, color = county), se=FALSE, method="lm") +
labs(title="Scatterplot between logged review count & neighborhood characteristics",
subtitle= "Using Yelp data in Five Counties Around Atlanta, GA",
x = "Review Count Logged",
y = "Values",
color = "County") +
stat_cor(aes(x=review_count_log, y = Values))+
facet_wrap(~neighborhood_characteristics, scales = "free_y", labeller = labeller(neighborhood_characteristics=new_labels, width = 100, height = 100))
## `geom_smooth()` using formula 'y ~ x'
In the top-left plot, DeKalb county shows slight positive trend between median household income and review counts. Other counties do not show significant trend. In the top-right plot, DeKalbd and Fulton county shows slight negative trend between review counts and percent residents under poverty. From these two plots, I could see some relationship between review counts and income level. Poorer neighborhoods are likely to have less review counts.
The bottom-left plot presents that DeKalb and Fulton County has comparably stronger positive relationship between review counts and percent of White residents. In these counties, neighborhoods with more White residents are likely to have more review counts. The bottom-right plot shows that the number of total population does not show much relationship with review counts.