Introduction
This assignment is focusing on analysis Point of Interests in 5 counties of Georgia, USA including, Fulton, Cobb, DeKalb, Clayton, and Gwinnett counties and find association to socio-economic variables of the neighborhoods. The Yelp POIs data and ACS 5-year estimate for 2019 were used in this analysis.
Load Packages
library(tidyverse)
library(knitr)
library(skimr)
library(units)
library(ggplot2)
library(ggpubr)
Load Data
df <- read.csv("https://ujhwang.github.io/urban-analytics-2024/Assignment/mini_4/coffee.csv")
skim(df)
Name | df |
Number of rows | 363 |
Number of columns | 14 |
_______________________ | |
Column type frequency: | |
character | 1 |
numeric | 13 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
county | 0 | 1 | 11 | 15 | 0 | 5 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
X | 0 | 1.00 | 1.820000e+02 | 104.93 | 1.000000e+00 | 9.150000e+01 | 1.8200e+02 | 2.725000e+02 | 3.630000e+02 | ▇▇▇▇▇ |
GEOID | 0 | 1.00 | 1.310284e+10 | 26584603.18 | 1.306304e+10 | 1.306703e+10 | 1.3121e+10 | 1.312101e+10 | 1.313505e+10 | ▃▃▁▁▇ |
hhincome | 0 | 1.00 | 7.831331e+04 | 36236.17 | 1.248500e+04 | 5.034400e+04 | 7.1482e+04 | 9.604200e+04 | 2.361490e+05 | ▆▇▃▁▁ |
pct_pov | 0 | 1.00 | 1.300000e-01 | 0.10 | 1.000000e-02 | 6.000000e-02 | 1.1000e-01 | 1.900000e-01 | 7.700000e-01 | ▇▃▁▁▁ |
review_count | 0 | 1.00 | 6.576000e+01 | 98.08 | 1.000000e+00 | 2.225000e+01 | 4.0250e+01 | 7.150000e+01 | 1.326000e+03 | ▇▁▁▁▁ |
avg_rating | 0 | 1.00 | 3.090000e+00 | 1.04 | 1.000000e+00 | 2.000000e+00 | 3.0000e+00 | 4.000000e+00 | 5.000000e+00 | ▁▇▅▇▁ |
pop | 0 | 1.00 | 6.364500e+03 | 3322.49 | 1.254000e+03 | 4.170500e+03 | 5.8790e+03 | 7.914000e+03 | 2.639900e+04 | ▇▅▁▁▁ |
avg_price | 39 | 0.89 | 1.350000e+00 | 0.48 | 1.000000e+00 | 1.000000e+00 | 1.0000e+00 | 2.000000e+00 | 2.000000e+00 | ▇▁▁▁▅ |
pct_white | 0 | 1.00 | 4.900000e-01 | 0.26 | 0.000000e+00 | 3.200000e-01 | 5.1000e-01 | 7.100000e-01 | 9.600000e-01 | ▅▅▇▇▅ |
hhincome_log | 0 | 1.00 | 1.117000e+01 | 0.46 | 9.430000e+00 | 1.083000e+01 | 1.1180e+01 | 1.147000e+01 | 1.237000e+01 | ▁▂▇▇▂ |
review_count_log | 0 | 1.00 | 3.430000e+00 | 1.01 | 6.900000e-01 | 2.920000e+00 | 3.4800e+00 | 3.990000e+00 | 7.190000e+00 | ▁▅▇▂▁ |
pct_pov_log | 0 | 1.00 | -2.190000e+00 | 0.72 | -3.990000e+00 | -2.720000e+00 | -2.1500e+00 | -1.620000e+00 | -2.500000e-01 | ▂▆▇▆▁ |
yelp_n | 0 | 1.00 | 2.540000e+00 | 2.20 | 1.000000e+00 | 1.000000e+00 | 2.0000e+00 | 3.000000e+00 | 1.900000e+01 | ▇▁▁▁▁ |
1. Box Plot between Avg. Rating vs Median Household Income
ggplot(data = df) +
geom_boxplot(aes(x=avg_rating, y=hhincome, group=avg_rating),
color="black",fill="white") +
labs(
x = "Average Yelp Rating",
y = "Median Annual Household Income ($)",
title = "Box Plot - Average Yelp Rating VS Median Annual Household Income"
) +
theme_minimal()
From the box plot, the businesses with 3 to 4 Yelp stars are located in higher-income neighborhoods, with 4-star businesses being most associated with affluent areas. Lower-rated businesses (1-star) tend to be located in lower-income neighborhoods. The presence of 5-star rated businesses in lower-income neighborhoods is an interesting finding, suggesting that exceptional business quality can exist in areas with lower median household income.
2. Box Plot - Average Rating vs Median Household Income by County
ggplot(data=df) +
geom_boxplot(aes(x = avg_rating, y=hhincome, group=avg_rating),
color = "black") +
labs(
x = "Average Yelp Rating",
y = "Median Annual Household Income ($)",
title = "Box Plot - Average Yelp Rating VS Median Annual Household Income by County"
) +
facet_wrap(~county) +
scale_fill_brewer(palette = "Blues") +
theme_minimal() +
theme(strip.text = element_text(face = "bold"))
From the box plot of median annual household income vs average rating on yelp by county, Cobb, DeKalb and Fulton county shows positive correlation between rating (2-4) and median household income. In contrast with 5-star rating businesses which unexpectedly associated with lower-income neighborhoods.
3. Scatter Plot - Review Count vs Median Household Income by County
ggplot(data = df, aes(x=review_count_log, y=hhincome, color = pct_white)) +
geom_point(alpha=0.7, size=2) +
labs(x = "Review Count (log)",
y = "Median Annual Household Income",
color = "Proportion of residents who\nself-identified as white",
title = "Scatter Plot - Review Count vs Median Household Income by County") +
scale_color_gradient(low="darkblue", high="red") +
facet_wrap(~county) +
theme_minimal() +
theme(strip.text = element_text(face = "bold"),
legend.text = element_text(size=7),
legend.title = element_text(size = 8))
The scatter plot shows number of reviews (log-transformed) against median household income, with the color representing the proportion of the white population. From the distribution of the data points, it is evident that lower-income neighborhoods are more commonly associated with a lower proportion of white residents (indicated by the purple points).
DeKalb and Fulton Counties show a greater variation in the proportion of white residents compared to the other counties. Clayton County stands out as a predominantly non-white area.
In DeKalb County, there is a pattern where lower review counts and lower household incomes are associated with a lower proportion of white residents (purple points). In Fulton County, the pattern is less distinct, with lower review counts spread across both lower and higher-income areas. In Clayton and Gwinnett Counties, there is less diversity in both review count and income, with most businesses located in lower-income neighborhoods. In Cobb County, while there is a wider distribution of incomes and review counts, no significant trend or association is seen.
Scatterplot between Review Count and Neighborhood characteristics
# Make long format dataframe
df_long <- df %>%
pivot_longer(cols = c(hhincome, pct_white, pct_pov_log, pop),
names_to = "variable",
values_to = "value")
# Calculate R and p-values for each variable
cor_results <- data.frame(
variable = c("hhincome", "pct_white", "pct_pov_log", "pop"),
R = c(cor.test(df$review_count_log, df$hhincome)$estimate,
cor.test(df$review_count_log, df$pct_white)$estimate,
cor.test(df$review_count_log, df$pct_pov_log)$estimate,
cor.test(df$review_count_log, df$pop)$estimate),
p_value = c(cor.test(df$review_count_log, df$hhincome)$p.value,
cor.test(df$review_count_log, df$pct_white)$p.value,
cor.test(df$review_count_log, df$pct_pov_log)$p.value,
cor.test(df$review_count_log, df$pop)$p.value)
)
# Create an annotation for R and p-value
annotations <- data.frame(
variable = c("hhincome", "pct_white", "pct_pov_log", "pop"),
label = c(paste("R =", round(cor_results$R[1], 2), ", p =", round(cor_results$p_value[1], 3)),
paste("R =", round(cor_results$R[2], 2), ", p =", round(cor_results$p_value[2], 9)),
paste("R =", round(cor_results$R[3], 2), ", p =", round(cor_results$p_value[3], 5)),
paste("R =", round(cor_results$R[4], 2), ", p =", round(cor_results$p_value[4], 3))),
x = 2.2,
y = c(250000, 1, -0.5, 25000)
)
# Create scatterplot and manually add the R and p-values
ggplot(df_long, aes(x = review_count_log, y = value, color = county)) +
geom_point(alpha=0.9, size=0.85) +
geom_smooth(method = "lm", se = FALSE, size = 0.75) +
facet_wrap(~variable, scales = "free_y", ncol = 2, labeller = labeller(variable = c(
hhincome = "Median Annual Household Income ($)",
pct_white = "Percent White Resident",
pct_pov_log = "Percent Residents Under Poverty",
pop = "Total Population"
))) + # Faceting by the variable
labs(x = "Review Count Logged", y = "Values",
title = "Scatter Plot between Logged Review Count & Neighborhood Characteristics",
subtitle = "Using Yelp Data in Five Counties Around Atlanta, GA") +
theme_minimal() +
scale_color_manual(
values = c(
"Clayton County" = "#4E79A7", # Using Tableau theme colors!
"Cobb County" = "#F28E2B",
"DeKalb County" = "#59A14F",
"Fulton County" = "#E15759",
"Gwinnett County" = "#B07AA1"
)
) +
theme(strip.text = element_text(face = "bold", size = 10)) +
geom_text(data = annotations, aes(x = x, y = y, label = label), inherit.aes = FALSE, size = 3, color = "black")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
The scatter plots show number of review counts (log-transformed) against 4 socio-economic characteristics including median household income, Percent of Residents under poverty, percent of white resident and total population.
There is a slight tendency for businesses in wealthier areas (with lower poverty rates) to have more reviews, especially in DeKalb County (green). The proportion of white residents has a stronger positive correlation with review count than income or poverty, especially in Fulton (red) and DeKalb (green) counties where we can see the steep trend lines. While total population does not appear to affect the number of reviews a business receives.
Conclusion
The analysis of POIs and their association with socio-economic characteristics reveals several key trends. Businesses with higher Yelp ratings (3 to 4 stars) are generally located in higher-income neighborhoods, particularly in Cobb, DeKalb, and Fulton Counties, while 5-star businesses are unexpectedly found in lower-income areas. Additionally, lower-income neighborhoods are more commonly associated with a lower proportion of white residents, as seen in DeKalb and Fulton Counties, while Clayton County is predominantly non-white and shows less diversity in both income and review count. Moreover, a positive correlation exists between the proportion of white residents and the number of business reviews, especially in Fulton and DeKalb Counties, indicating that businesses in predominantly white areas tend to receive more reviews. In contrast, the total population of a neighborhood shows no significant association with review counts.