coffee <- read.csv("coffee.csv")
ggplot(data = coffee) +
#plot boxplot
geom_boxplot(aes(x = avg_rating, y = hhincome, group = avg_rating),
fill = "white", color = "black") +
#remove minor breaks
scale_x_continuous(breaks=seq(1,5, by=1),, minor_breaks = NULL)
Out of average ratings 1-5, coffee shops with a average rating 4 locate at census tracts with the highest median in median annual household income. This observation can come to a general conclusion that coffee shops with higher average ratings locate in census tracts with higher median household incomes. While coffee shops with average rating of 1 locate in tracts with the lowest median in median annual household income, it has the smallest range of values in median household income. One thing to note is the large amount of outlier values for median annual household income for coffee shops with average rating of 2.
ggplot(data = coffee) +
#plot boxplot
geom_boxplot(aes(x = avg_rating, y = hhincome, group = avg_rating),
fill = "white", color = "black") +
#remove minor breaks
scale_x_continuous(breaks=seq(1,5, by=1),, minor_breaks = NULL) +
#create separate boxplots for each county
facet_wrap(~county) +
#add x- and y-axis labels
labs(x = "Average Rating",
y = "Median Annual Household Income ($)")
When divided boxplot visualization by county, we can see more details specific to conditions of each county. Among all 5 counties, coffee shops in Clayton County generally have low median and ranges in median household income across all average ratings, followed by Gwinett County. On the other hand, Cobb and Fulton counties generally follow the observation that coffee shops with higher average ratings locate in tracts with higher median household income.
ggplot(data = coffee) +
#plot scatterplot
geom_point(mapping = aes(x=review_count_log, y = hhincome,
color=pct_white), alpha = 0.5, size=3.4) +
#create color gradient
scale_color_gradient(low = "#1F00F3", high = "#F6010D") +
#create separate scatter plots for each county
facet_wrap(~county) +
#add x- and y-axis labels and change legend title
labs(x = "Review Count (log)",
y = "Median Annual Household Income",
color = "Proportion of residents\nwho self-identified as white") +
#add title
ggtitle("Scatterplot: Review Count vs. Household Income") +
#set font size of legend title and text
theme(
legend.title = element_text(size = 10),
legend.text = element_text(size = 8)
) +
theme_bw()
Extending from the conclusion from boxplots, we can also see that Clayton County points are very clustered at low median household income values. Clayton County census tracts also generally have lower proportion of white residents. Across the other four counties, we observe that census tracts with higher percentage of residents who self-identify as white coincide with census tracts with higher median household income. Specifically in Dekalb County, census tracts with higher proportion of white residents not only coincide with higher median household income but also higher average ratings of coffee shops.
#Pivot longer
coffee_long <- coffee %>%
select(county, review_count_log, hhincome, pct_white, pct_pov_log, pop) %>%
pivot_longer(cols = c(hhincome, pct_white, pct_pov_log, pop),
names_to = "variable", values_to = "value")
ggplot(data = coffee_long) +
#plot scatterplot
geom_point(mapping = aes(x=review_count_log, y = value,
color=county), size=1.5) +
#create separate scatter plots for each variable
facet_wrap(~variable, scales = "free_y",
labeller = as_labeller(c(
"hhincome" = "Median Annual Household Income ($)",
"pct_white" = "Residents who self-identify as White (%)",
"pct_pov_log" = "Residents Under Poverty Level (%; log)",
"pop" = "Total Population"))
) +
#create regression line for each county within each variable
geom_smooth(mapping = aes(x=review_count_log, y = value, color=county), method = "lm", se = FALSE) +
#add x- and y-axis labels and change legend title
labs(x = "Review Count (log)",
y = "Values",
color = "County") +
#annotate correlation and p value for each variable's scatterplot
ggpubr::stat_cor(mapping = aes(x=review_count_log, y = value),
method = "pearson", size=3.5) +
#add title
ggtitle("Scatterplot between logged review count & neighborhood characteristics") +
theme_bw()
Among all variables, proportion of white residents have the strongest linear relationship with coffee shop average rating, followed by median household income. On the other hand, percentage of residents under poverty level and total population have negative relationship with average rating, with percentage of residents under poverty having a stronger negative relationship and total population have a very weak negative relationship.