In this assignment, I recreated four plots using the ggplot2 package.
library(tidyverse)
library(scales) # for number formatting if needed
coffee <- read_csv("C://Intro to UA//coffee.csv")
# Quick check
glimpse(coffee)
summary(coffee)
This plot shows how median household income (hhincome) varies across census tracts with different average coffee shop ratings (avg_rating).
ggplot(coffee, aes(x = factor(avg_rating), y = hhincome)) +
geom_boxplot(color = "black", fill = "white", outlier.shape = 19) +
facet_wrap(~ county) +
theme_gray(base_size = 14) +
labs(
x = "Average Rating",
y = "Median Annual Household Income ($)"
)
1.Tracts with coffee shops rated 4 and above generally have higher median household incomes. 2.The lowest-rated shops (rating 1–2) are more common in lower-income tracts. 3.The spread of income increases with higher ratings, suggesting that highly rated shops exist across a wider range of neighborhood incomes, but mostly in more affluent areas.
This faceted plot breaks down the relationship between coffee shop ratings and household income across the five Metro Atlanta counties: Clayton, Cobb, DeKalb, Fulton, and Gwinnett.
print(ggplot(coffee, aes(x = factor(avg_rating), y = hhincome)) +
geom_boxplot(color = "black", fill = "white", outlier.shape = 19) +
theme_gray(base_size = 14) +
labs(
x = "avg_rating",
y = "hhincome"
)
)
1.Fulton and Cobb counties show higher median incomes and greater variability across ratings, reflecting economic diversity within these urban areas. 2.Clayton County exhibits consistently lower incomes across all rating levels. 3.The pattern of higher income aligning with higher ratings remains consistent, though the effect is more pronounced in certain counties (notably Fulton and DeKalb).
This scatterplot compares median household income and review count (log scale) for coffee shops, colored by the proportion of White residents (pct_white), with separate panels for each county.
library(ggplot2)
library(dplyr)
# Plot
plot3 <- ggplot(coffee, aes(
x = review_count_log,
y = hhincome,
color = pct_white
)) +
geom_point(size = 1.8, alpha = 0.8) +
facet_wrap(~ county, ncol = 3) +
scale_color_gradient(
name = "Proportion of residents\nwho self-identified as white",
low = "blue",
high = "red",
limits = c(0, 0.75), # Set limits to match the required output's scale top
breaks = seq(0, 0.75, 0.25)
) +
labs(
title = "Scatterplot: Review Count vs. Household Income",
x = "Review Count (log)",
y = "Median Annual Household Income ($)"
) +
# Use theme_light() as it matches the grid lines and background
theme_light(base_size = 8) +
theme(
# --- Critical Adjustments for Final Alignment ---
# Main Plot Title Position and Margin:
# Use hjust=0 to align left and increase bottom margin to push the facets down.
plot.title = element_text(size = 6, face = "plain", hjust = 0, margin = margin(b = 5)),
# Facet Title Background:
strip.background = element_rect(fill = "lightgray", color = "darkgray"),
strip.text = element_text(size = 5, face = "plain", color = "black"),
# Axis Title/Label styling:
axis.title = element_text(size = 5),
axis.text = element_text(size = 4),
# Legend Styling:
legend.position = "right",
legend.title = element_text(size = 5),
legend.text = element_text(size = 4),
# Overall Plot Margins (reduces the whitespace around the image):
# Reduced top/left margin ensures the title starts near the edge.
plot.margin = margin(t = 2, r = 2, b = 2, l = 2),
panel.spacing = unit(0.5, "lines")
)
print(plot3)
1.Higher-income tracts tend to have coffee shops with more reviews, indicating stronger customer engagement or higher foot traffic. 2.Tracts with a higher share of White residents (shown in red) are generally clustered toward higher income and review counts. 3.This suggests potential socioeconomic and racial concentration of popular coffee shop activity, especially visible in Fulton and Cobb counties.
Plot 4 presents a series of scatterplots exploring the relationships between logged review count of coffee shops (a proxy for local business activity and online engagement) and four key neighborhood characteristics across Census tracts in Fulton, DeKalb, Cobb, Gwinnett, and Clayton Counties. Each point represents a Census tract, colored by county, and linear regression lines summarize county-level trends. The analysis uses logged review counts to reduce skewness, allowing fairer comparisons between areas with few and many reviews.
The plots collectively show that Yelp review activity for coffee shops is not randomly distributed across neighborhoods—it is influenced by income, poverty, and racial composition, but not by total population. This highlights underlying spatial and socioeconomic inequalities in how digital engagement reflects neighborhood conditions.
library(tidyverse)
library(tidyr)
library(ggplot2)
library(dplyr)
# 1. Use pivot_longer() to transform the data
coffee_long <- coffee %>%
pivot_longer(
cols = c(hhincome, pct_pov_log, pct_white, pop),
names_to = "variable",
values_to = "value"
) %>%
mutate(
variable = recode(variable,
hhincome = "Median Annual Household Income ($)",
pct_pov_log = "Residents Under Poverty Level (%; log)",
pct_white = "Residents who self-identify as White (%)",
pop = "Total Population") # Corrected capitalization
)
# 2. Define correlation labels for annotation (Corrected p-value for Poverty)
corr_labels <- tibble(
variable = unique(coffee_long$variable),
label = c(
"italic(R) == 0.16*','~~italic(p) == 6*e-04",
"italic(R) == -0.12*','~~italic(p) == 0.014", # Changed from 0.01 to 0.014
"italic(R) == 0.24*','~~italic(p) == 6.7*e-07",
"italic(R) == -0.031*','~~italic(p) == 0.52"
)
)
# 3. Create the plot using facet_wrap()
plot4 <- ggplot(coffee_long, aes(x = review_count_log, y = value, color = county)) +
geom_point(alpha = 0.6, size = 1.5) + # Reduced alpha for scatter points to match image 2
geom_smooth(method = "lm", se = FALSE, size = 1.2) + # Increased line size
facet_wrap(~ variable, scales = "free_y") +
# Add R and p-value labels
geom_text(
data = corr_labels,
# Adjusted position to move text slightly left/up
aes(x = -Inf, y = Inf, label = label),
parse = TRUE,
hjust = -0.15, vjust = 1.2, # Adjusted h/v justification
size = 2.5, # Increased size back to match image 2 font
color = "black"
) +
labs(
title = "Scatterplot between logged review count & neighborhood characteristics", # Added the rest of the title
x = "Review Count (log)",
y = "Values",
color = "County"
) +
scale_color_manual(
# Used the exact colors from the image provided in the initial prompt
values = c(
"Clayton County" = "#F8766D", # Lighter Red/Pink
"Cobb County" = "#7CAE00", # Olive Green
"DeKalb County" = "#00BFC4", # Teal
"Fulton County" = "#619CFF", # Blue
"Gwinnett County" = "#C77CFF" # Purple/Pink
)
) +
# Use theme_light() for a closer match to the grid lines and background
theme_light(base_size = 8) +
theme(
# Adjust plot title position and size
plot.title = element_text(size = 6, face = "plain", hjust = 0, margin = margin(t = 0, b = 10)),
# Facet title background and alignment (key visual difference)
strip.background = element_rect(fill = "lightgray", color = "darkgray"),
strip.text = element_text(size = 5, face = "bold"),
# Move X-axis title closer to the axis
axis.title.x = element_text(margin = margin(t = 5)),
# Move Y-axis title closer to the axis
axis.title.y = element_text(margin = margin(r = 5)),
legend.position = "right",
legend.box.margin = margin(t = 100), # Push legend down slightly
panel.grid.minor = element_blank(),
plot.margin = margin(t = 5, r = 5, b = 2, l = 2) # Smaller overall margins
)
print(plot4)
1.Median Annual Household Income ($)
There is a weak positive correlation (R = 0.16, p = 6e-04) between logged review count and median household income. Neighborhoods with higher incomes tend to have slightly more reviews, suggesting that wealthier areas have greater online visibility or engagement with coffee shops.
2.Residents Under Poverty Level (%; log)
A weak negative relationship (R = -0.12, p = 0.014) exists between review count and poverty level. This indicates that areas with higher poverty rates generally see fewer reviews, implying less Yelp engagement or fewer businesses in lower-income neighborhoods.
3.Residents Who Self-Identify as White (%)
This variable shows the strongest positive correlation (R = 0.24, p = 6.7e-07). Census tracts with a higher proportion of White residents are associated with greater review activity, suggesting possible demographic disparities in both business density and online participation.
4.Total Population
The relationship between population size and review count is insignificant (R = -0.031, p = 0.52). This implies that larger populations do not necessarily correspond to more active Yelp activity, reinforcing that social and economic context matters more than population size alone.
Overall, the results suggest that online engagement with local businesses (as measured by Yelp review activity) is unevenly distributed across socioeconomic and demographic lines. Wealthier and predominantly White neighborhoods show higher activity levels, while poorer areas exhibit lower participation. These findings illustrate how digital traces of urban life can mirror and even amplify existing patterns of urban inequality.
This analysis reveals that digital engagement, as measured through Yelp coffee shop reviews, is far from evenly distributed across the Atlanta metropolitan area. While one might expect population size to drive online review activity, the data show otherwise review density is shaped more by income and demographic composition than by the sheer number of residents. Wealthier and predominantly White neighborhoods tend to exhibit greater Yelp activity, reflecting higher business visibility, consumer participation, and possibly greater access to digitally connected amenities.
Conversely, tracts with higher poverty rates show markedly lower review counts, pointing to both economic and digital divides. The weak but consistent correlations between review activity and socioeconomic indicators underscore how online platforms can serve as proxies for offline inequalities. In essence, Yelp review patterns not only reflect but may reinforce the uneven geography of opportunity, visibility, and participation across urban neighborhoods.
In the broader urban analytics context, this study highlights the potential and the bias of using volunteered digital data as indicators of neighborhood vitality. It underscores the importance of combining such digital metrics with socioeconomic datasets to interpret them critically and equitably.