1. Research Question

In this exercise, we will examine the potential relationship between the attractiveness of Points of Interest (POIs) in a neighborhood and its status as an advantaged area. By analyzing data, we aim to determine which neighborhood characteristic shows the strongest correlation with POI attractiveness. This assignment will specifically utilize the ggplot2 package in R to explore the link between neighborhood advantages and the presence of attractive POIs.

2. Data Preparation & Cleaning

2.1. Loading required library

library(tidycensus)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tmap)
## Warning: package 'tmap' was built under R version 4.3.3
## Breaking News: tmap 3.x is retiring. Please test v4, e.g. with
## remotes::install_github('r-tmap/tmap')
library(sf)
## Warning: package 'sf' was built under R version 4.3.3
## Linking to GEOS 3.11.2, GDAL 3.8.2, PROJ 9.3.1; sf_use_s2() is TRUE
library(jsonlite)
## 
## Attaching package: 'jsonlite'
## 
## The following object is masked from 'package:purrr':
## 
##     flatten
library(dplyr)
library(ggplot2)
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.3.2
library(broom)
## Warning: package 'broom' was built under R version 4.3.2

2.2. Load the dataset

For the data analysis, we need to load the dataset required for the analysis provided from the course.

coffee <- read.csv("E:/Georgia_Tech_MCRP/2024_FALL/CP8883_Urban_Analytics/Assignment/Assignment_4/coffee.csv")

Then, we will examine the structure of the dataset and clean the data if necessary.

summary(coffee)
##        X             GEOID              county             hhincome     
##  Min.   :  1.0   Min.   :1.306e+10   Length:363         Min.   : 12485  
##  1st Qu.: 91.5   1st Qu.:1.307e+10   Class :character   1st Qu.: 50344  
##  Median :182.0   Median :1.312e+10   Mode  :character   Median : 71482  
##  Mean   :182.0   Mean   :1.310e+10                      Mean   : 78313  
##  3rd Qu.:272.5   3rd Qu.:1.312e+10                      3rd Qu.: 96042  
##  Max.   :363.0   Max.   :1.314e+10                      Max.   :236149  
##                                                                         
##     pct_pov          review_count       avg_rating         pop       
##  Min.   :0.008508   Min.   :   1.00   Min.   :1.000   Min.   : 1254  
##  1st Qu.:0.056117   1st Qu.:  22.25   1st Qu.:2.000   1st Qu.: 4170  
##  Median :0.106396   Median :  40.25   Median :3.000   Median : 5879  
##  Mean   :0.132704   Mean   :  65.76   Mean   :3.091   Mean   : 6365  
##  3rd Qu.:0.187150   3rd Qu.:  71.50   3rd Qu.:4.000   3rd Qu.: 7914  
##  Max.   :0.767031   Max.   :1326.00   Max.   :5.000   Max.   :26399  
##                                                                      
##    avg_price       pct_white       hhincome_log    review_count_log
##  Min.   :1.000   Min.   :0.0000   Min.   : 9.433   Min.   :0.6931  
##  1st Qu.:1.000   1st Qu.:0.3161   1st Qu.:10.827   1st Qu.:2.9174  
##  Median :1.000   Median :0.5052   Median :11.177   Median :3.4756  
##  Mean   :1.349   Mean   :0.4914   Mean   :11.166   Mean   :3.4348  
##  3rd Qu.:2.000   3rd Qu.:0.7105   3rd Qu.:11.473   3rd Qu.:3.9889  
##  Max.   :2.000   Max.   :0.9630   Max.   :12.372   Max.   :7.1907  
##  NA's   :39                                                        
##   pct_pov_log          yelp_n      
##  Min.   :-3.9895   Min.   : 1.000  
##  1st Qu.:-2.7163   1st Qu.: 1.000  
##  Median :-2.1508   Median : 2.000  
##  Mean   :-2.1913   Mean   : 2.543  
##  3rd Qu.:-1.6238   3rd Qu.: 3.000  
##  Max.   :-0.2523   Max.   :19.000  
## 
head(coffee)
##   X       GEOID         county hhincome    pct_pov review_count avg_rating
## 1 1 13063040202 Clayton County    33276 0.20134228     57.00000          2
## 2 2 13063040308 Clayton County    28422 0.21071800     13.00000          3
## 3 3 13063040407 Clayton County    49271 0.10825507     29.33333          2
## 4 4 13063040408 Clayton County    44551 0.18095661     20.00000          4
## 5 5 13063040410 Clayton County    49719 0.11468019     41.00000          1
## 6 6 13063040411 Clayton County    57924 0.09068942     18.00000          2
##     pop avg_price  pct_white hhincome_log review_count_log pct_pov_log yelp_n
## 1  2850         1 0.07508772     10.41289         4.060443   -1.554276      1
## 2  4262         1 0.26067574     10.25527         1.975622   -1.510869      2
## 3  4046         1 0.20514088     10.80529         3.320837   -2.134911      3
## 4  8489         1 0.16868889     10.70461         3.044522   -1.655709      1
## 5  7166         1 0.19369244     10.81434         3.737670   -2.082003      1
## 6 13311         1 0.16512659     10.96706         2.944439   -2.295715      1

3. Boxplot: average rating vs household income

Next, let’s create a boxplot that explains the relationship between the average rating of POIs and household income. From this boxplot, we can observe that as the median household income increases, the average rating also increases until it reaches an average rating of 5. However, the trend then goes downward for average ratings above 5. This gives us the curiosity to look at the relationship between the two variables at the county level, so that we can explore whether median household income actually correlates with POI attractiveness in county level.

ggplot(coffee, aes(x=as.factor(avg_rating), y=hhincome)) +
  geom_boxplot() +
  labs(title = "Boxplot: Average Rating vs Household Income",
       x = "Average Rating",
       y = "Household Income") 

4. Boxplot: average rating vs household income by county

From this boxplot, we can observe that there is no clear correlation between median household income and the average rating for two counties, Clayton and Cobb. On the other hand, the remaining three counties display a similar trend that we could observe from the first boxplot. This result encourages us to explore other neighborhood characteristics as potential correlating factors with POI attractiveness.

ggplot(coffee, aes(x=as.factor(avg_rating), y=hhincome)) +
  geom_boxplot() +
  facet_wrap(~county) +
  labs(title = "Boxplot: Average Rating vs Household Income by County",
       x = "Average Rating",
       y = "Household Income") 

5. Scatterplot: review counts vs % of white residents

From this scatterplot, we can observe that: 1) there is a clear correlation between the % of white residents and median household income, while 2) the % of white residents and review counts exhibit a positive correlation, with the exception of Clayton County. Clayton County may not be a good counterexample to the hypothesis of a positive correlation between the % of white residents and review counts, as a limited number of census tracts with over 50% white residents are shown. This result encourages us to explore the coefficients of various socio-economic variables in relation to review counts to determine which variable is the strongest correlating factor with POI attractiveness.

ggplot(coffee, aes(x = review_count_log, y=hhincome, color=pct_white)) +
  geom_point(alpha = 0.4, size = 2) +
  facet_wrap(~county, ncol = 2) +
  labs(title = "Scatterplot: Review Count vs Household Income",
       x = "Review Count (Log)",
       y = "Household Income",
       color = "% of White") +
  scale_color_gradient(low = "purple", high = "green")

6. Scatterplot: review counts vs socio-economic variables

From the scatterplot, we can observe that, at the level of p=0.05, there is a significant and positive relationship (R=0.28) between the % of white residents and the review counts of coffee shops. Also, at the level of p=0.05, % of residents under poverty and the review counts of coffee shops show a week, negative relationship (R=-0.19). In county level, Dekalb and Fulton County show relatively stronger positive/negative relationships, while Gwinett, Cobb, and Clayton shows relatively weaker positive/negative relationships. On the other hand, at the level of p=0.05 and p=0.1, median household income and total population have no statistically significant relationships with the review counts.

scatplot <- coffee %>% # Reshape the data using pivot_longer()
  pivot_longer(cols = c(hhincome, pct_pov_log, pct_white, pop), 
               names_to = "variable", 
               values_to = "value")

get_stats <- function(data, x_var, y_vars) { # Function: R and p-value for multiple variables
  sapply(y_vars, function(y_var) {
    model <- lm(as.formula(paste(y_var, "~", x_var)), data = data)
    r <- round(cor(data[[x_var]], data[[y_var]]), 2)
    p_val <- round(summary(model)$coefficients[2, 4], 2)
    paste("R =", r, "\np =", p_val)
  })
}

stats_df <- data.frame( # Calculate statistics for each variable
  variable = c("pct_pov_log", "hhincome", "pct_white", "pop"),
  stats = get_stats(coffee, "review_count_log", c("pct_pov_log", "hhincome", "pct_white", "pop")),
  y = rep(Inf, 4)  # Define annotating position
)

ggplot(scatplot, aes(x = review_count_log, y = value, color = county)) + # Create scatter plot
  geom_point(size = 1) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 0.8) +
  facet_wrap(~ variable, scales = "free_y") +
  labs(
    x = "Review Count (log)", 
    y = "Value", 
    title = "Scatterplot between Logged Review Count & Neighborhood Characteristics",
    subtitle = "Using Yelp Data in Five Counties Around Atlanta, GA"
  ) +
  geom_text(data = stats_df, aes(x = 0.7, y = y, label = stats), 
            vjust = 1.0, size = 2, inherit.aes = FALSE)  # Annotate stats to plot
## `geom_smooth()` using formula = 'y ~ x'