This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
coffee <- read.csv("coffee.csv")
You can also embed plots, for example:
The Median Household Income tends to increase as the average ratings
increase, but they dip at the 5 rating, peaking at the 4. There are a
few high income outliers at ratings 2, 3, 4, and 5 with 2 having the
most outliers.
coffee2 <- coffee %>%
mutate(
rating_group = round(avg_rating)
) %>%
filter(!is.na(rating_group), !is.na(hhincome), !is.na(county))
ggplot(coffee2, aes(x = factor(rating_group), y = hhincome, fill = factor(rating_group))) +
geom_boxplot(alpha = 0.75, outlier.shape = 16, outlier.alpha = 0.5) +
stat_summary(fun = mean, geom = "point", size = 1.8, color = "black", position = position_dodge(width = 0.75)) +
labs(
title = "Median Household Income by Average Rating by County",
x = "Average Coffee Shop Rating",
y = "Median Household Income $"
) +
scale_fill_brewer(palette = "Set3") +
theme_minimal() +
facet_wrap(~ county, ncol = 3, scales = "fixed") +
theme(legend.position = "none")
Across the counties, higher neighborhood incomes are aligning with
higher coffee ratings. Most of the counties have the ratings peak at 4.
Fulton County has the highest median incomes spread out against the
ratings. Clayton County is the lowest income county, with little
variation compared to the other counties.
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.
ggplot(coffee, aes(x = review_count_log, y = hhincome, color = pct_white)) +
geom_point(alpha = 0.7, size = 3) +
facet_wrap(~ county, ncol = 3, scales = "fixed") +
coord_cartesian(ylim = c(0, 250000)) +
scale_color_gradientn(colors = c("blue", "red")) +
labs(
title = "Scatterplot: Review Count vs. Household Income",
x = "Review Count (log)",
y = "Median Annual Household Income",
color = "Proportion of residents
who self-identified as white"
) +
theme_minimal()
Clayton County has the lowest income band overall, with a lower number
of reviews as well. Cob County has mid-high incomes with mid-tier log
reviews coinciding with higher incomes and more proportion of residents
identify as white. Dekalb County has a wide spread in income, with
higher incomes having more reviews. Fulton County had the most
dispersion with the relationship between reviews and income being weak.
Gwinnett County has mid-range incomes. OVerall, Dekalb Coun
metric_levels <- c(
"hhincome" = "Median Annual Household Income ($)",
"pct_pov_log" = "Residents Under Poverty Level (%; log)",
"pct_white" = "Residents who self-identify as White (%)",
"pop" = "Total Population"
)
coffee_long <- coffee %>%
select(county, review_count_log, hhincome, pct_pov_log, pct_white, pop) %>%
pivot_longer(
cols = c(hhincome, pct_pov_log, pct_white, pop),
names_to = "metric",
values_to = "value"
) %>%
mutate(metric_label = factor(recode(metric, !!!metric_levels),
levels = unname(metric_levels)))
panel_ranges <- coffee_long %>%
group_by(metric_label) %>%
summarise(
x_min = min(review_count_log, na.rm = TRUE),
x_max = max(review_count_log, na.rm = TRUE),
y_min = min(value, na.rm = TRUE),
y_max = max(value, na.rm = TRUE),
.groups = "drop"
)
panel_stats <- coffee_long %>%
group_by(metric_label) %>%
summarise(
r = cor(review_count_log, value, use = "complete.obs"),
p = cor.test(review_count_log, value)$p.value,
.groups = "drop"
) %>%
left_join(panel_ranges, by = "metric_label") %>%
mutate(
label = paste0("R = ", round(r, 2), ", p = ", signif(p, 2)),
x = x_min + 0.05 * (x_max - x_min),
y = y_max - 0.07 * (y_max - y_min)
)
ggplot(coffee_long, aes(x = review_count_log, y = value, color = county)) +
geom_point(alpha = 0.9, size = 0.5) +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~ metric_label, ncol = 2, scales = "free_y") +
labs(
title = "Scatterplot between logged review count & neighborhood characteristics",
x = "Review Count (log)",
y = "Values",
color = "County"
) +
geom_text(
data = panel_stats,
aes(x = x, y = y, label = label),
inherit.aes = FALSE,
hjust = 0, vjust = 1, size = 3.8
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Median Household Income is positivey associated with income. The slope
for Dekalb county is largely positive. For the residents under poverty
section, the relationship is negative with the reviews leaning towards
lower poverty. The residents who self-identify as white had the highest
positive association with the review counts. For total population,
there’s basically no association.