In this assignment, I recreated four plots using the ggplot2 package.
library(tidyverse)
library(scales) # for number formatting if needed
coffee <- read_csv("C://Intro to UA//coffee.csv")
# Quick check
glimpse(coffee)
summary(coffee)
This plot shows how median household income (hhincome) varies across census tracts with different average coffee shop ratings (avg_rating).
ggplot(coffee, aes(x = factor(avg_rating), y = hhincome)) +
geom_boxplot(color = "black", fill = "white", outlier.shape = 19) +
facet_wrap(~ county) +
theme_gray(base_size = 14) +
labs(
x = "Average Rating",
y = "Median Annual Household Income ($)"
)
1.Tracts with coffee shops rated 4 and above generally have higher median household incomes. 2.The lowest-rated shops (rating 1–2) are more common in lower-income tracts. 3.The spread of income increases with higher ratings, suggesting that highly rated shops exist across a wider range of neighborhood incomes, but mostly in more affluent areas.
This faceted plot breaks down the relationship between coffee shop ratings and household income across the five Metro Atlanta counties: Clayton, Cobb, DeKalb, Fulton, and Gwinnett.
ggplot(coffee, aes(x = factor(avg_rating), y = hhincome)) +
geom_boxplot(color = "black", fill = "white", outlier.shape = 19) +
theme_gray(base_size = 14) +
labs(
x = "avg_rating",
y = "hhincome"
)
1.Fulton and Cobb counties show higher median incomes and greater variability across ratings, reflecting economic diversity within these urban areas. 2.Clayton County exhibits consistently lower incomes across all rating levels. 3.The pattern of higher income aligning with higher ratings remains consistent, though the effect is more pronounced in certain counties (notably Fulton and DeKalb).
This scatterplot compares median household income and review count (log scale) for coffee shops, colored by the proportion of White residents (pct_white), with separate panels for each county.
ggplot(coffee, aes(
x = review_count_log,
y = hhincome,
color = pct_white
)) +
geom_point(size = 5, alpha = 0.7) +
facet_wrap(~ county) +
scale_color_gradient(
name = "Proportion of residents\nwho self-identified as white",
low = "blue",
high = "red"
) +
theme_gray(base_size = 14) +
labs(
title = "Scatterplot: Review Count vs. Household Income",
x = "Review Count (log)",
y = "Median Annual Household Income"
)
1.Higher-income tracts tend to have coffee shops with more reviews, indicating stronger customer engagement or higher foot traffic. 2.Tracts with a higher share of White residents (shown in red) are generally clustered toward higher income and review counts. 3.This suggests potential socioeconomic and racial concentration of popular coffee shop activity, especially visible in Fulton and Cobb counties.
Plot 4 presents a series of scatterplots exploring the relationships between logged review count of coffee shops (a proxy for local business activity and online engagement) and four key neighborhood characteristics across Census tracts in Fulton, DeKalb, Cobb, Gwinnett, and Clayton Counties. Each point represents a Census tract, colored by county, and linear regression lines summarize county-level trends. The analysis uses logged review counts to reduce skewness, allowing fairer comparisons between areas with few and many reviews.
The plots collectively show that Yelp review activity for coffee shops is not randomly distributed across neighborhoods—it is influenced by income, poverty, and racial composition, but not by total population. This highlights underlying spatial and socioeconomic inequalities in how digital engagement reflects neighborhood conditions.
# Load dataset
coffee_data <- read_csv("C://Intro to UA//coffee.csv")
# Reshape data into long format
coffee_long <- coffee_data %>%
pivot_longer(
cols = c(hhincome, pct_pov_log, pct_white, pop),
names_to = "variable",
values_to = "value"
) %>%
mutate(
variable = recode(variable,
hhincome = "Median Annual Household Income ($)",
pct_pov_log = "Residents Under Poverty Level (%; log)",
pct_white = "Residents who self-identify as White (%)",
pop = "Total Population"
)
)
# Define correlation labels (optional: matches your uploaded R & p-values)
corr_labels <- tibble(
variable = c(
"Median Annual Household Income ($)",
"Residents Under Poverty Level (%; log)",
"Residents who self-identify as White (%)",
"Total Population"
),
label = c(
"italic(R) == 0.16*','~~italic(p) == 6*e-04",
"italic(R) == -0.12*','~~italic(p) == 0.014",
"italic(R) == 0.24*','~~italic(p) == 6.7*e-07",
"italic(R) == -0.031*','~~italic(p) == 0.52"
)
)
# Merge labels into the plotting dataframe
coffee_long <- coffee_long %>%
left_join(corr_labels, by = "variable")
# Create the plot
ggplot(coffee_long, aes(x = review_count_log, y = value, color = county)) +
geom_point(alpha = 0.8, size = 2) +
geom_smooth(method = "lm", se = FALSE, size = 1) +
facet_wrap(~ variable, scales = "free_y") +
geom_text(
data = corr_labels,
aes(x = -Inf, y = Inf, label = label),
parse = TRUE,
hjust = -0.1, vjust = 1.5,
size = 4.5, color = "black"
) +
labs(
title = "Scatterplot between logged review count & neighborhood characteristics",
x = "Review Count (log)",
y = "Values",
color = "County"
) +
scale_color_manual(
values = c(
"Clayton County" = "#F8766D",
"Cobb County" = "#7CAE00",
"DeKalb County" = "#00BFC4",
"Fulton County" = "#619CFF",
"Gwinnett County" = "#C77CFF"
)
) +
theme_bw(base_size = 14) +
theme(
legend.position = "right",
strip.background = element_rect(fill = "grey85", color = "grey60"),
panel.grid.minor = element_blank(),
panel.grid.major = element_line(color = "grey90")
)
1.Median Annual Household Income ($)
There is a weak positive correlation (R = 0.16, p = 6e-04) between logged review count and median household income. Neighborhoods with higher incomes tend to have slightly more reviews, suggesting that wealthier areas have greater online visibility or engagement with coffee shops.
2.Residents Under Poverty Level (%; log)
A weak negative relationship (R = -0.12, p = 0.014) exists between review count and poverty level. This indicates that areas with higher poverty rates generally see fewer reviews, implying less Yelp engagement or fewer businesses in lower-income neighborhoods.
3.Residents Who Self-Identify as White (%)
This variable shows the strongest positive correlation (R = 0.24, p = 6.7e-07). Census tracts with a higher proportion of White residents are associated with greater review activity, suggesting possible demographic disparities in both business density and online participation.
4.Total Population
The relationship between population size and review count is insignificant (R = -0.031, p = 0.52). This implies that larger populations do not necessarily correspond to more active Yelp activity, reinforcing that social and economic context matters more than population size alone.
Overall, the results suggest that online engagement with local businesses (as measured by Yelp review activity) is unevenly distributed across socioeconomic and demographic lines. Wealthier and predominantly White neighborhoods show higher activity levels, while poorer areas exhibit lower participation. These findings illustrate how digital traces of urban life can mirror—and even amplify—existing patterns of urban inequality.
This analysis reveals that digital engagement, as measured through Yelp coffee shop reviews, is far from evenly distributed across the Atlanta metropolitan area. While one might expect population size to drive online review activity, the data show otherwise—review density is shaped more by income and demographic composition than by the sheer number of residents. Wealthier and predominantly White neighborhoods tend to exhibit greater Yelp activity, reflecting higher business visibility, consumer participation, and possibly greater access to digitally connected amenities.
Conversely, tracts with higher poverty rates show markedly lower review counts, pointing to both economic and digital divides. The weak but consistent correlations between review activity and socioeconomic indicators underscore how online platforms can serve as proxies for offline inequalities. In essence, Yelp review patterns not only reflect but may reinforce the uneven geography of opportunity, visibility, and participation across urban neighborhoods.
In the broader urban analytics context, this study highlights the potential—and the bias—of using volunteered digital data as indicators of neighborhood vitality. It underscores the importance of combining such digital metrics with socioeconomic datasets to interpret them critically and equitably.