I was working on finding a meaningful metric to explain alcohol access in neighborhoods. One of the assumptions I made was that different types of alcohol outlets have different implications in social charasteristics and economic strength of a neighborhood. I am curious to find out whether this assumption is supported by statistics.
For this task, I am associating each liquor license with a few selected social economic metrics at the census block group level, i.e. assigning each liquor license an value of the indicator based on which census block group it is located; then the metrics are aggregated by census block groups, and finally, I use t-test and ANOVA to analyse the differences in these metrics between each liquor license type.
I have picked median house income and the percentage of white residents as the two indicators for the social fabric of the neighborhoods.
source("src/preparation.R")
source("src/alcohol.R")
bizz.cols <- c("CT_ID_10", "ALC_BIZ_TYPE", "ALC_TYPE")
acs.cols <- c("CT_ID_10", "MedHouseIncome", "White")
dat <- merge(bizz.alc[, bizz.cols], acs.ct[, acs.cols], all.x = TRUE)
When categorized by type of alcohol allowed, the licenses can be devided into two group: malte & wine, and all alcoholic beverages. I used a Student’s t-test to see whether census indicators for these two groups are significantly different from each other.
First, the median house income:
library(broom)
library(knitr)
dat.t <- t.test(MedHouseIncome ~ ALC_TYPE, dat) %>% tidy()
dat.t <- dat.t[, c(1:6)]
colnames(dat.t) <- c(
"mean difference",
"mean in group (All Alcohol)",
"mean in group (Malte & Wine)",
"t statistic",
"p value",
"df"
)
row.names(dat.t) <- c("value")
round(dat.t, 2) %>% t() %>% kable()
| value | |
|---|---|
| mean difference | 10765.01 |
| mean in group (All Alcohol) | 76030.47 |
| mean in group (Malte & Wine) | 65265.46 |
| t statistic | 5.76 |
| p value | 0.00 |
| df | 801.18 |
We get a t-test of 5.76, and the p value is as low as \(1.1e-8\), incidating a very strong difference. On average, a census block group containing full liquor licenses have a $10,765 higher median house income than those with malte & wine only liquor licenses. Note that each group has different number of members (liquor licenses)
This makes sense because full liquor licenses are much harder to get and have higher prices on the market. They naturally would be more distributed among rich neighborhoods.
Then let“s look at race:
dat.t <- t.test(White ~ ALC_TYPE, dat) %>% tidy()
dat.t <- dat.t[, c(1:6)]
colnames(dat.t) <- c(
"mean difference",
"mean in group (All Alcohol)",
"mean in group (Malte & Wine)",
"t statistic",
"p value",
"df"
)
row.names(dat.t) <- c("value")
round(dat.t, 2) %>% t() %>% kable()
| value | |
|---|---|
| mean difference | 0.02 |
| mean in group (All Alcohol) | 0.66 |
| mean in group (Malte & Wine) | 0.64 |
| t statistic | 1.23 |
| p value | 0.22 |
| df | 714.32 |
The p value is 0.22, not so significant. Statistically speaking, race is not related to the distribution of different alcohol types.
We have 6 different business types designated in the categorization of liquor licenses. An ANOVA test can be used to test whether the implications of these business types are statistically significant.
dat.anova <- aov(MedHouseIncome ~ ALC_BIZ_TYPE, dat)
broom::tidy(dat.anova) %>% knitr::kable()
| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| ALC_BIZ_TYPE | 5 | 17121124445 | 3424224889 | 3.806447 | 0.0020108 |
| Residuals | 1097 | 986845509189 | 899585697 | NA | NA |
With F value’s p value as low as 0.00201, it is statitically signicant that at least one business type is different than others.
And based on Tukey’s “Honest Significant Difference” method, three group combinations were identified with significant differences.
broom::tidy(TukeyHSD(dat.anova))[2:6] %>% knitr::kable()
| comparison | estimate | conf.low | conf.high | adj.p.value |
|---|---|---|---|---|
| Common Vectualler-Club | 4563.333 | -7309.544 | 16436.210 | 0.8825740 |
| Farmer Distillery-Club | -4402.655 | -66038.275 | 57232.966 | 0.9999516 |
| General On-Premises-Club | -10090.611 | -31352.054 | 11170.831 | 0.7539779 |
| Hotel-Club | 17154.345 | 1467.223 | 32841.468 | 0.0226704 |
| Tavern-Club | -12757.655 | -74393.275 | 48877.966 | 0.9916508 |
| Farmer Distillery-Common Vectualler | -8965.987 | -69573.924 | 51641.949 | 0.9982903 |
| General On-Premises-Common Vectualler | -14653.944 | -32721.074 | 3413.186 | 0.1886058 |
| Hotel-Common Vectualler | 12591.013 | 1615.674 | 23566.351 | 0.0138650 |
| Tavern-Common Vectualler | -17320.987 | -77928.924 | 43286.949 | 0.9646641 |
| General On-Premises-Farmer Distillery | -5687.957 | -68810.105 | 57434.192 | 0.9998475 |
| Hotel-Farmer Distillery | 21557.000 | -39912.037 | 83026.037 | 0.9176134 |
| Tavern-Farmer Distillery | -8355.000 | -93978.049 | 77268.049 | 0.9997743 |
| Hotel-General On-Premises | 27244.957 | 6471.373 | 48018.540 | 0.0026154 |
| Tavern-General On-Premises | -2667.043 | -65789.192 | 60455.105 | 0.9999964 |
| Tavern-Hotel | -29912.000 | -91381.037 | 31557.037 | 0.7336351 |
They are Hotel-Club, Hotel-CommonVectualler, and Hotel-GeneralOnPremises.
broom::tidy(TukeyHSD(dat.anova))[c(4, 8, 13), 2:6] %>% knitr::kable()
| comparison | estimate | conf.low | conf.high | adj.p.value | |
|---|---|---|---|---|---|
| 4 | Hotel-Club | 17154.35 | 1467.223 | 32841.47 | 0.0226704 |
| 8 | Hotel-Common Vectualler | 12591.01 | 1615.674 | 23566.35 | 0.0138650 |
| 13 | Hotel-General On-Premises | 27244.96 | 6471.373 | 48018.54 | 0.0026154 |
They are all related to hotels, which indicates that liquor licenses issues to hotels might be the outlier here, i.e., hotels offering on-premise consumption alcohol beverages tend to be located in richer neighborhoods.
The differences between hotel and the remaining two categories, Farmer Distillery and Tavern, are not statistically significant. However, this might just be because of the very small number of licenses in these two categories (with just 2 and 6 each)–the degree of freedom is really low, making it impossible to get a significant result.
Using bar chart and error bars, the mean differences of median house income between different liquor licenses can be plotted as following:
library(reshape2)
dat.melted <- melt(dat[, c("ALC_BIZ_TYPE", "MedHouseIncome")],
id.vars = "ALC_BIZ_TYPE")
dat.means <- aggregate(value ~ ALC_BIZ_TYPE + variable, dat.melted, FUN = mean)
names(dat.means)[3]<-"mean"
dat.ses <- aggregate(value ~ ALC_BIZ_TYPE + variable, dat.melted,
function(x) sd(x, na.rm = TRUE) / sqrt(length(!is.na(x)))
)
names(dat.ses)[3]<-"se"
dat.means <- merge(dat.means, dat.ses, by = c("ALC_BIZ_TYPE", "variable"))
dat.means <- transform(dat.means, lower = mean - se, upper = mean + se)
ggplot(data = dat.means,
aes(x = ALC_BIZ_TYPE, y = mean)) +
geom_bar(stat = "identity", position = "dodge", fill = "tomato2") +
geom_errorbar(
aes(ymax = upper, ymin = lower),
position = position_dodge(.9)) +
theme(axis.text.x = element_text(angle = 70, hjust = 1)) +
ylab("Mean") + xlab("Business Type")
The high income level for liquor licenses in hotels is very obvious.