I will examine the housing cost burden for renters in Minneapolis, Minnesota in 2014 compared to 2019 by census tract. In 2014, Minneapolis passed a city ordinance allowing for the construction of accessory dwelling units (ADUs). The ordinance allows single family homeowners to construct ADUs on their property. Theoretically, this should add housing supply to the city and apply downward pressure on housing costs. Comparing housing costs burden before and after the ordinance passed will provide evidence of the effects of the ordinance.
Description of data
The source of data is the U.S. Census Bureau’s American Community Survey Public Use Microdata from 2014 and 2019. I will access the data via the get_pums() function in r. The variables I will use are the following:
PUMA: Public use area microdata area code. The PUMAs for Minneapolis are 1405, 1406, and 1407. In the 2014 data, this is a 4 character variable. In the 2019 data, this is a 5 character variable.
TYPE: Type of unit, filtered for “1” which is a housing unit. This removes group quarters.
TEN: This variable is housing tenure and can have four values, 1 = “owned with mortgage,” 2 = “owned free and clear,” 3 = “rented,” and 4 = “occupied without payment of rent.”
RELP: This is the relationship of the respondent to the household reference person. For the 2014 data this must be filtered to “00” to limit to one response per household. For the 2019 data, the variable name is RELSHIPP and the filter must be set to “20”.
GRPIP: Gross rent as a percentage of household income past 12 months.
SEX: Sex
RAC1P: Race
AGEP: Age
HINCP: Household Income
HUPAC: Presence of children
CIT: Citizenship status
HHL: Household Language
Downloading data
First, let’s download and clean Minnesota PUMS data for 2014.
Code
mnpums2014 <-get_pums(variables =c("PUMA", "TYPE", "TEN", "HHT", "RELP", "GRPIP", "SEX", "RAC1P", "HISP", "AGEP", "HINCP", "HUPAC", "CIT", "HHL"),state ="MN",variables_filter =list(SPORDER =1, TYPE =1), # SPORDER = 1 gets households, TYPE = 1 gets housing units and eliminates group quarters.#puma = c(1405, 1406, 1407), # I can't figure out how to select by puma. Maybe capitalize "PUMA"?survey ="acs1",year =2014 )
Getting data from the 2014 1-year ACS Public Use Microdata Sample
This is a sample of PUMS respondents in Minnesota for 2014. Let’s check the sample size.
Code
nrow(mnpums2014)
[1] 21524
Let’s create a new dataframe with only the PUMAs that cover Minneapolis (1405, 1406, and 1407) in 2014.
In order to get just one observation per household, we need to ensure RELP == 0 for every record.
Code
tabyl(dat2014$RELP)
dat2014$RELP n percent
0 1023 1
So now we know there is only one observation per household. Let’s see the distribution of renters vs. non-renters. 1 is owned with a mortgage, 2 is owned free and clear, 3 is rented, 4 is occupied without payment of rent.
In order to get just one observation per household, we need to ensure RELSHIPP == 0 for every record.
Code
tabyl(dat2019$RELSHIPP)
dat2019$RELSHIPP n percent
20 1054 1
So now we know there is only one observation per household. Let’s see the distribution of renters vs. non-renters. 1 is owned with a mortgage, 2 is owned free and clear, 3 is rented, 4 is occupied without payment of rent.
dat2019$race_eth n percent
Hispanic 33 0.07158351
Non-hispanic Asian 30 0.06507592
Non-hispanic Black 70 0.15184382
Non-hispanic White 306 0.66377440
other 22 0.04772234
For QQ plots, the majority of points whoudl follow each line and stay within the curved 95% bootstrapped confidence bands to be considered normally distributed. Again, this shows the two samples are not normally distributed.
Levene's Test for Homogeneity of Variance (center = "mean")
Df F value Pr(>F)
group 1 14.616 0.0001406 ***
920
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
print(lev2)
Levene's Test for Homogeneity of Variance (center = "median")
Df F value Pr(>F)
group 1 7.9994 0.004781 **
920
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A significant p-value (P < 0.05) indicates that a Satterthwaite (also known as Welch’s) t-test results should be used instead of pooled t-test results.
Produce boxplots and visually check for outliers
Code
ggplot(dat, aes(x = year, y = GRPIP, fill = year)) +stat_boxplot(geom ="errorbar", width =0.5) +geom_boxplot(fill ="light blue") +stat_summary(fun=mean, geom="point", shape=10, size=3.5, color="black") +ggtitle("Boxplots of rent cost burdent in 2014 and 2019") +theme_bw() +theme(legend.position="none")
The boxplots show many outliers in 2019.
Produce descriptive statistics by group
Code
dat %>%select(GRPIP, year) %>%group_by(year) %>%summarise(n =n(), mean =mean(GRPIP, na.rm =TRUE), sd =sd(GRPIP, na.rm =TRUE),stderr = sd/sqrt(n), LCL = mean -qt(1- (0.05/2), n -1) * stderr,UCL = mean +qt(1- (0.05/2), n -1) * stderr,median=median(GRPIP, na.rm =TRUE),min=min(GRPIP, na.rm =TRUE), max=max(GRPIP, na.rm =TRUE),IQR=IQR(GRPIP, na.rm =TRUE))
# A tibble: 2 × 11
year n mean sd stderr LCL UCL median min max IQR
<fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2014 461 0.410 0.297 0.0138 0.383 0.437 0.3 0 1.01 0.35
2 2019 461 0.341 0.261 0.0122 0.317 0.365 0.26 0 1.01 0.25
cyl – This column identifies the levels of the treatment variable along with the mean differences between the levels.
n – This column identifies how many data points (cars) are in each cylinder group.
mean – The mean value for each treatment group.
sd – The standard deviation of each treatment group.
stderr – The standard error of each treatment group.
LCL, UCL – The upper and lower confidence intervals of the mean. That is to say, you can be 95% certain that the true mean falls between the lower and upper values specified for each treatment group, assuming the data is normally distributed.
median – The median value for each treatment group.
min, max – The minimum and maximum values observed for each treatment group.
IQR – The interquartile range of each treatment group. The interquartile range is the 75th percentile – 25th percentile.
Welch Two Sample t-test
data: GRPIP by year
t = 3.7674, df = 905.39, p-value = 0.0001756
alternative hypothesis: true difference in means between group 2014 and group 2019 is not equal to 0
95 percent confidence interval:
0.03321241 0.10544268
sample estimates:
mean in group 2014 mean in group 2019
0.4102820 0.3409544
t – This is the t-statistic. It is the ratio of the mean of the difference in means to the standard error of the difference. df – The appropriate degrees of freedom. This varies between each type of independent samples t-test.
p-value – This is the p-value associated with the test. The p-value = 0.0001756. Thus, we reject the null hypothesis that the mean cost burden of 2014 and 2019 are equal and conclude that there is a mean difference between groups. 95% confidence interval – The values presented here are on the mean difference for each treatment group. That is to say, you can be 95% certain that the true mean difference cost burden between 2014 and 2019 falls between 0.41 and 0.34. sample means – The sample mean of the 2014 and 2019 samples.
What to do When Assumptions are Broken or Things Go Wrong
The lack of normality or severe impact of outliers can violate independent sample t-test assumptions and ultimately the results. If this happens, there are several available options:
Perform a nonparametric Mann-Whitney U test is the most popular alternative. This is also known as the Mann-Whitney-Wilcoxon or the Wilcoxon Rank Sum test. This test is considered robust to violations of normality and outliers (among others) and tests for differences in mean ranks.
Additional options include considering permutation/randomization tests, bootstrap confidence intervals, and transforming the data but each option will have its own stipulations.
If you need to compare more than two independent groups, a one-way Analysis of Variances (ANOVA) or Kruskal-Wallis may be appropriate.
An independent samples t-test is not appropriate if you have repeated measurements taken on the same experimental unit (subject). For example, if you have a pre-test post-test study, then each subject was measured at two different time intervals. If this is the case, then a paired t-test may be a more appropriate course of action.
Mann-Whitney U test or Wilconxon Rank Sum test
The following assumptions must be met in order to run a Mann-Whitney U test:
Treatment groups are independent of one another. Experimental units only receive one treatment and they do not overlap.
The response variable of interest is ordinal or continuous.
Wilcoxon rank sum test with continuity correction
data: GRPIP by year
W = 122212, p-value = 7.892e-05
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
0.02007798 0.06992372
sample estimates:
difference in location
0.04005175
Hodges Lehmann Estimator
Code
m2$estimate
difference in location
0.04005175
Interpretation of Mann-Whitney U test
Rental cost burden in each year is not normally distributed. Outliers exist in each group. A Mann-Whitney U test is more appropriate than a traditional independent samples t-test to compare the rental cost burden between the two years.
The Mann-Whitney U test results in a two-sided test p-value = 7.892e-05. This indicates that we should reject the null hypothesis that distributions are equal and conclude that there is a significant difference in cost burdent between the two years. Descriptive statistics indicate that the median cost burden in 2014 was 30% and the median cost burden in 2019 was 26%. The difference between the median cost burdens of each year is about 4%. The Hodges-Lehmann estimate is the same at 4%. We are 95% certain that the median difference between 2014 and 2019 is between 2% and 7%. Thus, the 4% decline in rental cost burden from 2014 to 2019 is statistically significant.
Let’s try to make tables with the combined data using the mean.
`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.
Code
t2 <- t2 %>%mutate(diff =`2019`-`2014`)t2
# A tibble: 2 × 4
Sex `2014` `2019` diff
<chr> <dbl> <dbl> <dbl>
1 Female 0.453 0.371 -0.0826
2 Male 0.389 0.315 -0.0735
Code
gt(t2)
Sex
2014
2019
diff
Female
0.4532549
0.3706347
-0.08262016
Male
0.3886364
0.3151623
-0.07347414
Code
write.csv(t2, "sex.csv")
Table 3, change in cost burden for all households from 2014 to 2019, grouped by race/ethnicity
Code
t3 <- dat %>%group_by("Year"= year, "Race/Eth"= race_eth) %>%summarise(`Cost burden`=weighted.mean(GRPIP, WGTP, na.rm =TRUE)) %>%pivot_wider(names_from =`Year`, values_from =`Cost burden`)
`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.
Code
t3 <- t3 %>%mutate(diff =`2019`-`2014`)t3
# A tibble: 10 × 4
`Race/Eth` `2014` `2019` diff
<chr> <dbl> <dbl> <dbl>
1 Hispanic 0.440 0.310 -0.130
2 Non-Hispanic Asian 0.415 NA NA
3 Non-Hispanic Black 0.508 NA NA
4 Non-Hispanic White 0.372 NA NA
5 Other 0.525 NA NA
6 <NA> 0.208 NA NA
7 Non-hispanic Asian NA 0.430 NA
8 Non-hispanic Black NA 0.373 NA
9 Non-hispanic White NA 0.322 NA
10 other NA 0.412 NA
Code
gt(t3)
Race/Eth
2014
2019
diff
Hispanic
0.4397007
0.3097231
-0.1299776
Non-Hispanic Asian
0.4154056
NA
NA
Non-Hispanic Black
0.5083152
NA
NA
Non-Hispanic White
0.3720215
NA
NA
Other
0.5245358
NA
NA
NA
0.2083099
NA
NA
Non-hispanic Asian
NA
0.4302271
NA
Non-hispanic Black
NA
0.3731761
NA
Non-hispanic White
NA
0.3217897
NA
other
NA
0.4124263
NA
Code
write.csv(t3, "race_eth.csv")
Table 4, change in cost burden for all households from 2014 to 2019, grouped by age
`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.
Code
t3a <- t3a %>%mutate(diff =`2019`-`2014`)t3a
# A tibble: 10 × 4
`Race/Eth` `2014` `2019` diff
<chr> <dbl> <dbl> <dbl>
1 Hispanic 0.362 0.2 -0.162
2 Non-Hispanic Asian 0.267 NA NA
3 Non-Hispanic Black 0.383 NA NA
4 Non-Hispanic White 0.27 NA NA
5 Other 0.355 NA NA
6 <NA> 0.208 NA NA
7 Non-hispanic Asian NA 0.26 NA
8 Non-hispanic Black NA 0.28 NA
9 Non-hispanic White NA 0.23 NA
10 other NA 0.373 NA
Code
gt(t3a)
Race/Eth
2014
2019
diff
Hispanic
0.3621477
0.2000000
-0.1621477
Non-Hispanic Asian
0.2670468
NA
NA
Non-Hispanic Black
0.3826991
NA
NA
Non-Hispanic White
0.2700000
NA
NA
Other
0.3547458
NA
NA
NA
0.2083099
NA
NA
Non-hispanic Asian
NA
0.2600000
NA
Non-hispanic Black
NA
0.2800000
NA
Non-hispanic White
NA
0.2300000
NA
other
NA
0.3727961
NA
Code
write.csv(t3a, "race_eth_median.csv")
Table 4, change in cost burden for all households from 2014 to 2019, grouped by age