MiniAssignment4

setwd("C:\\Users\\srini\\OneDrive\\Documents\\Urban Analytics")

Loading the libraries

library(tidycensus)
library(sf)

## Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE

library(tmap)

## Breaking News: tmap 3.x is retiring. Please test v4, e.g. with
## remotes::install_github('r-tmap/tmap')

library(jsonlite)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()  masks stats::filter()
## ✖ purrr::flatten() masks jsonlite::flatten()
## ✖ dplyr::lag()     masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(httr)
library(jsonlite)
library(reshape2)

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

library(yelpr)
library(knitr)
library(skimr)
library(ggplot2)
library(dplyr) 
library(tidyr)
library(here)

## here() starts at C:/Users/srini/OneDrive/Documents/Urban Analytics

library(ggpmisc)

## Loading required package: ggpp

## Registered S3 methods overwritten by 'ggpp':
##   method                  from   
##   heightDetails.titleGrob ggplot2
##   widthDetails.titleGrob  ggplot2

## 
## Attaching package: 'ggpp'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

library(ggpubr)

## 
## Attaching package: 'ggpubr'

## The following objects are masked from 'package:ggpp':
## 
##     as_npc, as_npcx, as_npcy

Loading the yelp data for the coffee shops for Fulton, DeKalb, Clayton, Cobb, and Gwinnett counties.

yelp_coffee=read.csv(file="coffee.csv")

Using this data, plotting the required graphs

Plot 1:Boxplot of avg_rating= x, hhincome = y

boxplot_rating = ggplot(yelp_coffee, aes(x = avg_rating, group = avg_rating, y = hhincome)) +
  geom_boxplot(col = 'lightblue') +
  labs(x = 'Average Rating', y = 'Median Household Income')

boxplot_rating

Higher income census tracts tend to have more businesses rated between 3 and 4 stars. Interestingly, the distribution of 1-star and 5-star businesses remains consistent across income levels, showing little variation. This suggests that while mid-range business ratings are more common in wealthier areas, extreme ratings, whether very low (1 star) or very high (5 stars), occur at similar rates in both lower and higher income tracts.

Plot 2:Facet wrap boxplot on county

boxplot_rating + facet_wrap(~county)

Fulton County shows the most variation in business ratings across income levels, while Clayton County exhibits the least. Notably, Clayton County has no 5-star rated businesses, and Cobb County has no 1-star businesses. This suggests a distinct pattern in these areas, with Fulton having a wider range of business ratings, while Clayton and Cobb counties experience more uniformity, particularly at the extremes of the rating scale.

Plot 3:Scatterplot where x = log review count, y = hhincome, col = prop white, facet wrap on county

boxplot_review = ggplot(yelp_coffee, aes(x = review_count_log, y = hhincome, col = pct_white)) +
  geom_point(alpha = 0.5, size = 3) +
  scale_color_gradient(low = 'blue', high = 'red') +
  facet_wrap(~county) +
  labs(x = 'Review Count (Log)', y = 'Median Annual Household Income', col = str_wrap('Proportion of residents who self-identify as white', width = 30)) +
  theme_light() +
  
  theme(legend.title = element_text(size = 8), legend.key.size = unit(12, 'pt'), legend.text = element_text(size = '6'))

boxplot_review

Fulton County again shows the widest range of review counts across income levels. From the plot, DeKalb County appears to have the strongest correlation between the number of reviews and income. Additionally, all counties, except Clayton, exhibit a correlation between the number of reviews, income levels, and the percentage of residents identifying as white. This suggests that income and racial demographics influence business review activity in most counties, with Clayton being the notable exception.

Plot 4:

Scatter plot of the following four regressions, colored by county:

1.Household Income ~ Review Count (Log) 2.Poverty Rate ~ Review Count (Log) 3.Prop White ~ Review Count (Log) 4.Total Pop ~ Review Count (Log)

#Build models to get R square and and P value labels
hhi_lm = lm(hhincome ~ review_count_log, data = yelp_coffee) %>% summary()
poverty_lm = lm(pct_pov_log ~ review_count_log, data = yelp_coffee) %>% summary()
white_lm = lm(pct_white ~ review_count_log, data = yelp_coffee) %>% summary()
pop_lm = lm(pop ~ review_count_log, data = yelp_coffee) %>% summary()

labels = data.frame(var_type = character(), r = numeric(), pval = numeric())

#For every model
for (model in list(hhi_lm, poverty_lm, white_lm, pop_lm)){
  
  #Build the row of the dataframe
  x = model[['terms']][[3]]
  y = model[['terms']][[2]]
  row = data.frame(var_type = model[['terms']][[2]] %>% as.character(), r = cor(yelp_coffee[[x]], yelp_coffee[[y]]), pval = model[['coefficients']][2,4])
  
  #Bind the row to the labels dataframe
  labels = labels %>%
    bind_rows(row)
}

#Formatting the text statement
labels = labels %>%
  mutate(text = paste0('R = ', r %>% signif(2), ', p = ', pval %>% signif(2)))

#Prep data for plotting via pivot_longer
yelp_coffee_pivot <- yelp_coffee %>%
  pivot_longer(cols = c('hhincome', 'pct_pov_log', 'pct_white', 'pop'), names_to = 'var_type')

facet_labels = c(`hhincome` = 'Median Annual Household Income ($)', `pct_pov_log` = 'Percent Residents Under Poverty', `pct_white` = 'Percent White Resident', `pop` = 'Total Population')

#Plotting the data
lm_plot = ggplot(yelp_coffee_pivot, aes(x = review_count_log, y = value, col = county)) +
  facet_wrap(.~var_type, scales = 'free', labeller = as_labeller(facet_labels)) +
  geom_point() +
  geom_smooth(method = 'lm', se = FALSE) +
  geom_text(data = labels, aes(x = -Inf, y = Inf, label = text, fontface = 'italic'), size = 3, hjust = -.02, vjust = 1, inherit.aes = FALSE) +
  labs(x = 'Review Count Logged', y = 'Value', title = 'Scatterplot between logged review count & neighborhood characteristics', subtitle = 'Using Yelp data in Five Counties Around Atlanta, GA', col = 'County') +
  
  theme(legend.title = element_text(size = 8), legend.key.size = unit(12, 'pt'), legend.text = element_text(size = '6'), plot.title = element_text(size = 10), plot.subtitle = element_text(size = 8), axis.text = element_text(size = '4'))

lm_plot

## `geom_smooth()` using formula = 'y ~ x'

The R coefficients show that the number of reviews is most strongly associated with the proportion of white residents in a tract, and this relationship is statistically significant. This correlation is especially prominent in DeKalb County. Additionally, the poverty rate has a significant negative relationship with the number of reviews, while annual median income shows a significant positive relationship (at an alpha level of 0.05). I’m surprised that % white is the strongest predictor of reviews, and further analysis on Yelp user demographics—such as whether white users are more likely to review white-owned businesses—could be insightful.