In this exercise, we will examine the potential relationship between the attractiveness of Points of Interest (POIs) in a neighborhood and its status as an advantaged area. By analyzing data, we aim to determine which neighborhood characteristic shows the strongest correlation with POI attractiveness. This assignment will specifically utilize the ggplot2 package in R to explore the link between neighborhood advantages and the presence of attractive POIs.
library(tidycensus)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tmap)
## Warning: package 'tmap' was built under R version 4.3.3
## Breaking News: tmap 3.x is retiring. Please test v4, e.g. with
## remotes::install_github('r-tmap/tmap')
library(sf)
## Warning: package 'sf' was built under R version 4.3.3
## Linking to GEOS 3.11.2, GDAL 3.8.2, PROJ 9.3.1; sf_use_s2() is TRUE
library(jsonlite)
##
## Attaching package: 'jsonlite'
##
## The following object is masked from 'package:purrr':
##
## flatten
library(dplyr)
library(ggplot2)
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.3.2
library(broom)
## Warning: package 'broom' was built under R version 4.3.2
For the data analysis, we need to load the dataset required for the analysis provided from the course.
coffee <- read.csv("E:/Georgia_Tech_MCRP/2024_FALL/CP8883_Urban_Analytics/Assignment/Assignment_4/coffee.csv")
Then, we will examine the structure of the dataset and clean the data if necessary.
summary(coffee)
## X GEOID county hhincome
## Min. : 1.0 Min. :1.306e+10 Length:363 Min. : 12485
## 1st Qu.: 91.5 1st Qu.:1.307e+10 Class :character 1st Qu.: 50344
## Median :182.0 Median :1.312e+10 Mode :character Median : 71482
## Mean :182.0 Mean :1.310e+10 Mean : 78313
## 3rd Qu.:272.5 3rd Qu.:1.312e+10 3rd Qu.: 96042
## Max. :363.0 Max. :1.314e+10 Max. :236149
##
## pct_pov review_count avg_rating pop
## Min. :0.008508 Min. : 1.00 Min. :1.000 Min. : 1254
## 1st Qu.:0.056117 1st Qu.: 22.25 1st Qu.:2.000 1st Qu.: 4170
## Median :0.106396 Median : 40.25 Median :3.000 Median : 5879
## Mean :0.132704 Mean : 65.76 Mean :3.091 Mean : 6365
## 3rd Qu.:0.187150 3rd Qu.: 71.50 3rd Qu.:4.000 3rd Qu.: 7914
## Max. :0.767031 Max. :1326.00 Max. :5.000 Max. :26399
##
## avg_price pct_white hhincome_log review_count_log
## Min. :1.000 Min. :0.0000 Min. : 9.433 Min. :0.6931
## 1st Qu.:1.000 1st Qu.:0.3161 1st Qu.:10.827 1st Qu.:2.9174
## Median :1.000 Median :0.5052 Median :11.177 Median :3.4756
## Mean :1.349 Mean :0.4914 Mean :11.166 Mean :3.4348
## 3rd Qu.:2.000 3rd Qu.:0.7105 3rd Qu.:11.473 3rd Qu.:3.9889
## Max. :2.000 Max. :0.9630 Max. :12.372 Max. :7.1907
## NA's :39
## pct_pov_log yelp_n
## Min. :-3.9895 Min. : 1.000
## 1st Qu.:-2.7163 1st Qu.: 1.000
## Median :-2.1508 Median : 2.000
## Mean :-2.1913 Mean : 2.543
## 3rd Qu.:-1.6238 3rd Qu.: 3.000
## Max. :-0.2523 Max. :19.000
##
head(coffee)
## X GEOID county hhincome pct_pov review_count avg_rating
## 1 1 13063040202 Clayton County 33276 0.20134228 57.00000 2
## 2 2 13063040308 Clayton County 28422 0.21071800 13.00000 3
## 3 3 13063040407 Clayton County 49271 0.10825507 29.33333 2
## 4 4 13063040408 Clayton County 44551 0.18095661 20.00000 4
## 5 5 13063040410 Clayton County 49719 0.11468019 41.00000 1
## 6 6 13063040411 Clayton County 57924 0.09068942 18.00000 2
## pop avg_price pct_white hhincome_log review_count_log pct_pov_log yelp_n
## 1 2850 1 0.07508772 10.41289 4.060443 -1.554276 1
## 2 4262 1 0.26067574 10.25527 1.975622 -1.510869 2
## 3 4046 1 0.20514088 10.80529 3.320837 -2.134911 3
## 4 8489 1 0.16868889 10.70461 3.044522 -1.655709 1
## 5 7166 1 0.19369244 10.81434 3.737670 -2.082003 1
## 6 13311 1 0.16512659 10.96706 2.944439 -2.295715 1
Next, let’s create a boxplot that explains the relationship between the average rating of POIs and household income. From this boxplot, we can observe that as the median household income increases, the average rating also increases until it reaches an average rating of 5. However, the trend then goes downward for average ratings above 5. This gives us the curiosity to look at the relationship between the two variables at the county level, so that we can explore whether median household income actually correlates with POI attractiveness in county level.
ggplot(coffee, aes(x=as.factor(avg_rating), y=hhincome)) +
geom_boxplot() +
labs(title = "Boxplot: Average Rating vs Household Income",
x = "Average Rating",
y = "Household Income")
From this boxplot, we can observe that there is no clear correlation between median household income and the average rating for two counties, Clayton and Cobb. On the other hand, the remaining three counties display a similar trend that we could observe from the first boxplot. This result encourages us to explore other neighborhood characteristics as potential correlating factors with POI attractiveness.
ggplot(coffee, aes(x=as.factor(avg_rating), y=hhincome)) +
geom_boxplot() +
facet_wrap(~county) +
labs(title = "Boxplot: Average Rating vs Household Income by County",
x = "Average Rating",
y = "Household Income")
From this scatterplot, we can observe that: 1) there is a clear correlation between the % of white residents and median household income, while 2) the % of white residents and review counts exhibit a positive correlation, with the exception of Clayton County. Clayton County may not be a good counterexample to the hypothesis of a positive correlation between the % of white residents and review counts, as a limited number of census tracts with over 50% white residents are shown. This result encourages us to explore the coefficients of various socio-economic variables in relation to review counts to determine which variable is the strongest correlating factor with POI attractiveness.
ggplot(coffee, aes(x = review_count_log, y=hhincome, color=pct_white)) +
geom_point(alpha = 0.4, size = 2) +
facet_wrap(~county, ncol = 2) +
labs(title = "Scatterplot: Review Count vs Household Income",
x = "Review Count (Log)",
y = "Household Income",
color = "% of White") +
scale_color_gradient(low = "purple", high = "green")
From the scatterplot, we can observe that, at the level of p=0.05, there is a significant and positive relationship (R=0.28) between the % of white residents and the review counts of coffee shops. Also, at the level of p=0.05, % of residents under poverty and the review counts of coffee shops show a week, negative relationship (R=-0.19). In county level, Dekalb and Fulton County show relatively stronger positive/negative relationships, while Gwinett, Cobb, and Clayton shows relatively weaker positive/negative relationships. On the other hand, at the level of p=0.05 and p=0.1, median household income and total population have no statistically significant relationships with the review counts.
scatplot <- coffee %>% # Reshape the data using pivot_longer()
pivot_longer(cols = c(hhincome, pct_pov_log, pct_white, pop),
names_to = "variable",
values_to = "value")
get_stats <- function(data, x_var, y_vars) { # Function: R and p-value for multiple variables
sapply(y_vars, function(y_var) {
model <- lm(as.formula(paste(y_var, "~", x_var)), data = data)
r <- round(cor(data[[x_var]], data[[y_var]]), 2)
p_val <- round(summary(model)$coefficients[2, 4], 2)
paste("R =", r, "\np =", p_val)
})
}
stats_df <- data.frame( # Calculate statistics for each variable
variable = c("pct_pov_log", "hhincome", "pct_white", "pop"),
stats = get_stats(coffee, "review_count_log", c("pct_pov_log", "hhincome", "pct_white", "pop")),
y = rep(Inf, 4) # Define annotating position
)
ggplot(scatplot, aes(x = review_count_log, y = value, color = county)) + # Create scatter plot
geom_point(size = 1) +
geom_smooth(method = "lm", se = FALSE, linewidth = 0.8) +
facet_wrap(~ variable, scales = "free_y") +
labs(
x = "Review Count (log)",
y = "Value",
title = "Scatterplot between Logged Review Count & Neighborhood Characteristics",
subtitle = "Using Yelp Data in Five Counties Around Atlanta, GA"
) +
geom_text(data = stats_df, aes(x = 0.7, y = y, label = stats),
vjust = 1.0, size = 2, inherit.aes = FALSE) # Annotate stats to plot
## `geom_smooth()` using formula = 'y ~ x'