Inequality Research Project

Author

Brian Surratt

Published

April 28, 2023

Research Proposal

I will examine the housing cost burden for renters in Minneapolis, Minnesota in 2014 compared to 2019 by census tract. In 2014, Minneapolis passed a city ordinance allowing for the construction of accessory dwelling units (ADUs). The ordinance allows single family homeowners to construct ADUs on their property. Theoretically, this should add housing supply to the city and apply downward pressure on housing costs. Comparing housing costs burden before and after the ordinance passed will provide evidence of the effects of the ordinance.

Description of data

The source of data is the U.S. Census Bureau’s American Community Survey Public Use Microdata from 2014 and 2019. I will access the data via the get_pums() function in r. The variables I will use are the following:

PUMA: Public use area microdata area code. The PUMAs for Minneapolis are 1405, 1406, and 1407. In the 2014 data, this is a 4 character variable. In the 2019 data, this is a 5 character variable.
TYPE: Type of unit, filtered for “1” which is a housing unit. This removes group quarters.
TEN: This variable is housing tenure and can have four values, 1 = “owned with mortgage,” 2 = “owned free and clear,” 3 = “rented,” and 4 = “occupied without payment of rent.”
RELP: This is the relationship of the respondent to the household reference person. For the 2014 data this must be filtered to “00” to limit to one response per household. For the 2019 data, the variable name is RELSHIPP and the filter must be set to “20”.
GRPIP: Gross rent as a percentage of household income past 12 months.
SEX: Sex
RAC1P: Race
AGEP: Age
HINCP: Household Income
HUPAC: Presence of children
CIT: Citizenship status
HHL: Household Language

Downloading data

First, let’s download and clean Minnesota PUMS data for 2014.

Code

mnpums2014 <- get_pums(
  variables = c("PUMA", "TYPE", "TEN", "HHT", "RELP", "GRPIP", "SEX", "RAC1P", "HISP", "AGEP", "HINCP", "HUPAC", "CIT", "HHL"),
  state = "MN",
  variables_filter = list(SPORDER = 1, TYPE = 1), # SPORDER = 1 gets households, TYPE = 1 gets housing units and eliminates group quarters.
  #puma = c(1405, 1406, 1407), # I can't figure out how to select by puma.  Maybe capitalize "PUMA"?
  survey = "acs1",
  year = 2014
  )

Getting data from the 2014 1-year ACS Public Use Microdata Sample

This is a sample of PUMS respondents in Minnesota for 2014. Let’s check the sample size.

Code

nrow(mnpums2014)

[1] 21524

Let’s create a new dataframe with only the PUMAs that cover Minneapolis (1405, 1406, and 1407) in 2014.

Code

dat2014 <- mnpums2014 %>%
  filter(PUMA %in% c("1405", "1406", "1407"))

This is a sample of PUMS respondents in Minneapolis for 2014. Let’s check the sample size.

Code

nrow(dat2014)

[1] 1023

Let’s modify the PUMA variable in the 2014 dataframe so the PUMA is 5 digits.

Code

dat2014 <- dat2014 %>%
  mutate(PUMA = case_when(.$PUMA == "1405" ~ "01405",
                          .$PUMA == "1406" ~ "01406",
                          .$PUMA == "1407" ~ "01407",
                          )
         )

Let’s check the distribution by PUMA.

Code

tabyl(dat2014$PUMA)

 dat2014$PUMA   n   percent
        01405 344 0.3362659
        01406 331 0.3235582
        01407 348 0.3401760

In order to get just one observation per household, we need to ensure RELP == 0 for every record.

Code

tabyl(dat2014$RELP)

 dat2014$RELP    n percent
            0 1023       1

So now we know there is only one observation per household. Let’s see the distribution of renters vs. non-renters. 1 is owned with a mortgage, 2 is owned free and clear, 3 is rented, 4 is occupied without payment of rent.

Code

tabyl(dat2014$TEN)

 dat2014$TEN   n     percent
           1 401 0.391984360
           2 158 0.154447703
           3 461 0.450635386
           4   3 0.002932551

Let’s filter for only renters.

Code

dat2014 <- dat2014 %>% 
  filter(TEN == 3)

tabyl(dat2014$PUMA)

 dat2014$PUMA   n   percent
        01405 155 0.3362256
        01406 114 0.2472885
        01407 192 0.4164859

Let’s check the distribution of rent as a percentage of income.

Code

dat2014$GRPIP <- (as.numeric(dat2014$GRPIP))/100

hist(dat2014$GRPIP)

First, let’s download and clean Minnesota PUMS data for 2019.

Code

mnpums2019 <- get_pums(
  variables = c("PUMA", "TYPE", "TEN", "HHT", "RELSHIPP", "GRPIP", "SEX", "RAC1P", "HISP", "AGEP", "HINCP", "HUPAC", "CIT", "HHL"),
  state = "MN",
  variables_filter = list(SPORDER = 1, TYPE = 1), # SPORDER = 1 gets households, TYPE = 1 gets housing units and eliminates group quarters.
  #puma = c(1405, 1406, 1407), # I can't figure out how to select by puma.  Maybe capitalize "PUMA"?
  survey = "acs1",
  year = 2019
  )

Getting data from the 2019 1-year ACS Public Use Microdata Sample

This is a sample of PUMS respondents in Minnesota for 2019. Let’s check the sample size.

Code

nrow(mnpums2019)

[1] 22576

Let’s create a new dataframe with only the PUMAs that cover Minneapolis (1405, 1406, and 1407) in 2019.

Code

dat2019 <- mnpums2019 %>%
  filter(PUMA %in% c("01405", "01406", "01407"))

This is a sample of PUMS respondents in Minneapolis for 2019. Let’s check the sample size.

Code

nrow(dat2019)

[1] 1054

Let’s check the distribution by PUMA.

Code

tabyl(dat2019$PUMA)

 dat2019$PUMA   n   percent
        01405 350 0.3320683
        01406 336 0.3187856
        01407 368 0.3491461

In order to get just one observation per household, we need to ensure RELSHIPP == 0 for every record.

Code

tabyl(dat2019$RELSHIPP)

 dat2019$RELSHIPP    n percent
               20 1054       1

Code

tabyl(dat2019$TEN)

 dat2019$TEN   n   percent
           1 413 0.3918406
           2 174 0.1650854
           3 461 0.4373814
           4   6 0.0056926

Let’s filter for only renters.

Code

dat2019 <- dat2019 %>% 
  filter(TEN == 3)

tabyl(dat2019$PUMA)

 dat2019$PUMA   n   percent
        01405 158 0.3427332
        01406 109 0.2364425
        01407 194 0.4208243

Let’s check the distribution of rent as a percentage of income.

Code

dat2019$GRPIP <- (as.numeric(dat2019$GRPIP))/100

hist(dat2019$GRPIP)

Compare cost burden in 2014 with cost burden in 2019

Renaming “RELP” in the 2014 data to “RELSHIPP”.

Code

dat2014 <- dat2014  %>%
  rename_at('RELP', ~'RELSHIPP')

head(dat2014)

# A tibble: 6 × 19
  SERIALNO  WGTP PWGTP PUMA  TEN   HHT   RELSHIPP GRPIP SEX   RAC1P HISP  AGEP 
  <chr>    <dbl> <dbl> <chr> <chr> <chr> <chr>    <dbl> <chr> <chr> <chr> <chr>
1 2339        96    96 01405 3     6     0         0.17 2     1     1     49   
2 10042      106   106 01407 3     7     0         0.24 2     1     1     22   
3 22566      132   132 01405 3     3     0         0.31 2     1     1     44   
4 25036      350   350 01405 3     3     0         0.22 2     1     2     28   
5 27547      319   319 01406 3     3     0         0.21 2     2     1     26   
6 30186      178   178 01407 3     6     0         0.97 2     2     1     58   
# ℹ 7 more variables: HINCP <chr>, HUPAC <chr>, CIT <chr>, HHL <chr>,
#   SPORDER <chr>, TYPE <chr>, ST <chr>

I’m just going to do the work and paste it into XL.

Change in cost burden for all households from 2014 to 2019.

Median and mean GRPIP for all households in 2014.

Code

summary(dat2014$GRPIP)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.2000  0.3000  0.4103  0.5500  1.0100

Median and mean GRPIP for all households in 2019.

Code

summary(dat2019$GRPIP)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.170   0.260   0.341   0.420   1.010

Change in cost burden for all households from 2014 to 2019, grouped by sex.

Code

sex14 <- dat2014 %>% 
  group_by(SEX) %>%
  summarize('n' = n(), 'N' = sum(WGTP), "Median" = weightedMedian(GRPIP, WGTP), "Mean" = weightedMean(GRPIP, WGTP))

# For SEX, 1 = male and 2 = female

gt(sex14)

SEX	n	N	Median	Mean
1	216	42241	0.27	0.3886364
2	245	44782	0.33	0.4532549

Code

sex19 <- dat2019 %>% 
  group_by(SEX) %>%
  summarize('n' = n(), 'N' = sum(WGTP), "Median" = weightedMedian(GRPIP, WGTP), "Mean" = weightedMean(GRPIP, WGTP))

# For SEX, 1 = male and 2 = female

gt(sex19)

SEX	n	N	Median	Mean
1	228	45730	0.23	0.3151623
2	233	49234	0.27	0.3706347

Change in cost burden for all households from 2014 to 2019, grouped by race/ethnicity.

First, consolidate race and ethnicity for the 2014 data.

Code

tabyl(dat2014$RAC1P)

 dat2014$RAC1P   n     percent
             1 306 0.663774403
             2  95 0.206073753
             3   5 0.010845987
             5   2 0.004338395
             6  20 0.043383948
             8  13 0.028199566
             9  20 0.043383948

Code

tabyl(dat2014$HISP)

 dat2014$HISP   n     percent
            1 433 0.939262473
           16   1 0.002169197
           17   2 0.004338395
            2  22 0.047722343
            3   1 0.002169197
            8   2 0.004338395

Code

dat2014 <- dat2014 %>%
  mutate(race_eth = case_when(.$HISP == 01 & .$RAC1P == 1 ~"Non-Hispanic White",
                              .$HISP == 01 & .$RAC1P == 2 ~"Non-Hispanic Black",
                              .$HISP == 01 & .$RAC1P == 6 ~"Non-Hispanic Asian",
                              .$HISP == 01 & .$RAC1P %in% c(3, 4, 5, 7, 8, 9) ~"Other",
                              .$HISP %in% c(2:16) & .$RAC1P %in% c(1:9) ~ "Hispanic"
                              )
        )

tabyl(dat2014$race_eth)

   dat2014$race_eth   n     percent valid_percent
           Hispanic  26 0.056399132    0.05664488
 Non-Hispanic Asian  20 0.043383948    0.04357298
 Non-Hispanic Black  95 0.206073753    0.20697168
 Non-Hispanic White 290 0.629067245    0.63180828
              Other  28 0.060737527    0.06100218
               <NA>   2 0.004338395            NA

Code

race14 <- dat2014 %>% 
  group_by(race_eth) %>%
  summarize('n' = n(), 'N' = sum(WGTP), "Median" = weightedMedian(GRPIP, WGTP), "Mean" = weightedMean(GRPIP, WGTP))

gt(race14)

race_eth	n	N	Median	Mean
Hispanic	26	8019	0.3621477	0.4397007
Non-Hispanic Asian	20	5326	0.2670468	0.4154056
Non-Hispanic Black	95	21178	0.3826991	0.5083152
Non-Hispanic White	290	47162	0.2700000	0.3720215
Other	28	4912	0.3547458	0.5245358
NA	2	426	0.2083099	0.2083099

Consolidate race and ethnicity for the 2019 data.

Code

tabyl(dat2019$RAC1P)

 dat2019$RAC1P   n     percent
             1 321 0.696312364
             2  72 0.156182213
             3   5 0.010845987
             5   2 0.004338395
             6  30 0.065075922
             8  10 0.021691974
             9  21 0.045553145

Code

tabyl(dat2019$HISP)

 dat2019$HISP   n     percent
           01 428 0.928416486
           02  17 0.036876356
           03   1 0.002169197
           04   1 0.002169197
           07   1 0.002169197
           10   1 0.002169197
           11   1 0.002169197
           15   1 0.002169197
           17   3 0.006507592
           18   1 0.002169197
           19   1 0.002169197
           20   1 0.002169197
           23   3 0.006507592
           24   1 0.002169197

Code

dat2019 <- dat2019 %>%
  mutate(race_eth = case_when(.$HISP == '01' & .$RAC1P == '1' ~"Non-hispanic White",
                              .$HISP == '01' & .$RAC1P == '2' ~"Non-hispanic Black",
                              .$HISP == '01' & .$RAC1P == '6' ~"Non-hispanic Asian",
                              .$HISP == '01' & .$RAC1P %in% c('3', '4', '5', '7', '8', '9') ~"other",
                              .$HISP != '01' ~ "Hispanic"
                              )
        )

tabyl(dat2019$race_eth)

   dat2019$race_eth   n    percent
           Hispanic  33 0.07158351
 Non-hispanic Asian  30 0.06507592
 Non-hispanic Black  70 0.15184382
 Non-hispanic White 306 0.66377440
              other  22 0.04772234

Code

race19 <- dat2019 %>% 
  group_by(race_eth) %>%
  summarize('n' = n(), 'N' = sum(WGTP), "Median" = weightedMedian(GRPIP, WGTP), "Mean" = weightedMean(GRPIP, WGTP))

gt(race19)

race_eth	n	N	Median	Mean
Hispanic	33	8089	0.2000000	0.3097231
Non-hispanic Asian	30	4271	0.2600000	0.4302271
Non-hispanic Black	70	25355	0.2800000	0.3731761
Non-hispanic White	306	52468	0.2300000	0.3217897
other	22	4781	0.3727961	0.4124263

Change in cost burden for all households from 2014 to 2019, grouped by age groups.

First, make a new variable with age categories, 24 and under, 25 to 64, and 65 and over.

Code

dat2014 <- dat2014 %>% 
    mutate(agelvl = case_when(.$AGEP <= 24 ~"24 and under",
                              .$AGEP > 24 & .$AGEP <= 64 ~ "25 to 64",
                              .$AGEP > 64 ~ "65 and over"
                              )
        )

tabyl(dat2014$agelvl)

 dat2014$agelvl   n    percent
   24 and under  67 0.14533623
       25 to 64 361 0.78308026
    65 and over  33 0.07158351

Code

age14 <- dat2014 %>% 
  group_by(agelvl) %>%
  summarize('n' = n(), 'N' = sum(WGTP), "Median" = weightedMedian(GRPIP, WGTP), "Mean" = weightedMean(GRPIP, WGTP))

gt(age14)

agelvl	n	N	Median	Mean
24 and under	67	12275	0.5600000	0.6082305
25 to 64	361	68358	0.2700000	0.3806627
65 and over	33	6390	0.3386926	0.5049577

Code

dat2019 <- dat2019 %>% 
    mutate(agelvl = case_when(.$AGEP <= 24 ~"24 and under",
                              .$AGEP > 24 & .$AGEP <= 64 ~ "25 to 64",
                              .$AGEP > 64 ~ "65 and over"
                              )
        )

tabyl(dat2019$agelvl)

 dat2019$agelvl   n   percent
   24 and under  69 0.1496746
       25 to 64 338 0.7331887
    65 and over  54 0.1171367

Code

age19 <- dat2019 %>% 
  group_by(agelvl) %>%
  summarize('n' = n(), 'N' = sum(WGTP), "Median" = weightedMedian(GRPIP, WGTP), "Mean" = weightedMean(GRPIP, WGTP))

gt(age19)

agelvl	n	N	Median	Mean
24 and under	69	11530	0.32	0.4283608
25 to 64	338	73440	0.24	0.3343517
65 and over	54	9994	0.28	0.3168311

Change in cost burden for all households from 2014 to 2019, grouped by income levels.

Let’s see what the quartiles are.

Code

summary(as.numeric(dat2014$HINCP))

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  -3700   13800   32000   46490   62500  459000

Convert income to numeric.

Code

dat2014$HINCP <- as.numeric(dat2014$HINCP)

Code

dat2014 <- dat2014 %>% 
    mutate(inclvl = case_when(.$HINCP <= 13800 ~"1st quartile",
                              .$HINCP > 13800 & .$HINCP <= 32000 ~"2nd quartile",
                              .$HINCP > 32000 & .$HINCP <= 62500 ~"3rd quartile",
                              .$HINCP > 62500 ~"4th quartile"
                              )
        )

tabyl(dat2014$inclvl)

 dat2014$inclvl   n   percent
   1st quartile 116 0.2516269
   2nd quartile 115 0.2494577
   3rd quartile 115 0.2494577
   4th quartile 115 0.2494577

Code

inc14 <- dat2014 %>% 
  group_by(inclvl) %>%
  summarize('n' = n(), 'N' = sum(WGTP), "Median" = weightedMedian(GRPIP, WGTP), "Mean" = weightedMean(GRPIP, WGTP))

gt(inc14)

inclvl	n	N	Median	Mean
1st quartile	116	25262	0.8487671	0.7004726
2nd quartile	115	20363	0.4600000	0.4924068
3rd quartile	115	20667	0.2600000	0.2813534
4th quartile	115	20731	0.1500000	0.1532536

Code

summary(as.numeric(dat2019$HINCP))

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0   23500   49000   65912   87850  476000

Convert income to numeric.

Code

dat2019$HINCP <- as.numeric(dat2019$HINCP)

Code

dat2019 <- dat2019 %>% 
    mutate(inclvl = case_when(.$HINCP <= 23500 ~"1st quartile",
                              .$HINCP > 23500 & .$HINCP <= 49000 ~"2nd quartile",
                              .$HINCP > 49000 & .$HINCP <= 87850 ~"3rd quartile",
                              .$HINCP > 87850 ~"4th quartile"
                              )
        )

tabyl(dat2019$inclvl)

 dat2019$inclvl   n   percent
   1st quartile 116 0.2516269
   2nd quartile 115 0.2494577
   3rd quartile 115 0.2494577
   4th quartile 115 0.2494577

Code

inc14 <- dat2014 %>% 
  group_by(inclvl) %>%
  summarize('n' = n(), 'N' = sum(WGTP), "Median" = weightedMedian(GRPIP, WGTP), "Mean" = weightedMean(GRPIP, WGTP))

gt(inc14)

inclvl	n	N	Median	Mean
1st quartile	116	25262	0.8487671	0.7004726
2nd quartile	115	20363	0.4600000	0.4924068
3rd quartile	115	20667	0.2600000	0.2813534
4th quartile	115	20731	0.1500000	0.1532536

Change in cost burden for all households from 2014 to 2019, by presence of children.

Code

tabyl(dat2014$HUPAC)

 dat2014$HUPAC   n    percent
             1  30 0.06507592
             2  37 0.08026030
             3  18 0.03904555
             4 376 0.81561822

Code

dat2014 <- dat2014 %>% 
    mutate(child = (ifelse(.$HUPAC == 4, 'No children', 'Children')
                     )
           )

tabyl(dat2014$child)

 dat2014$child   n   percent
      Children  85 0.1843818
   No children 376 0.8156182

Code

child14 <- dat2014 %>% 
  group_by(child) %>%
  summarize('n' = n(), 'N' = sum(WGTP), "Median" = weightedMedian(GRPIP, WGTP), "Mean" = weightedMean(GRPIP, WGTP))

gt(child14)

child	n	N	Median	Mean
Children	85	18578	0.37	0.4749833
No children	376	68445	0.29	0.4074777

Code

tabyl(dat2019$HUPAC)

 dat2019$HUPAC   n    percent
             1  26 0.05639913
             2  26 0.05639913
             3  18 0.03904555
             4 391 0.84815618

Code

dat2019 <- dat2019 %>% 
    mutate(child = (ifelse(.$HUPAC == 4, 'No children', 'Children')
                     )
           )

tabyl(dat2019$child)

 dat2019$child   n   percent
      Children  70 0.1518438
   No children 391 0.8481562

Code

child19 <- dat2019 %>% 
  group_by(child) %>%
  summarize('n' = n(), 'N' = sum(WGTP), "Median" = weightedMedian(GRPIP, WGTP), "Mean" = weightedMean(GRPIP, WGTP))

gt(child19)

child	n	N	Median	Mean
Children	70	18656	0.3560757	0.4174496
No children	391	76308	0.2389815	0.3259456

Change in cost burden for all households from 2014 to 2019, by citizenship status.

Code

tabyl(dat2014$CIT)

 dat2014$CIT   n    percent
           1 380 0.82429501
           3   4 0.00867679
           4  35 0.07592191
           5  42 0.09110629

Code

dat2014 <- dat2014 %>% 
    mutate(citizen = (ifelse(.$CIT == 5, 'Non-citizen', 'Citizen')
                     )
           )

tabyl(dat2014$citizen)

 dat2014$citizen   n    percent
         Citizen 419 0.90889371
     Non-citizen  42 0.09110629

Code

citizen14 <- dat2014 %>%
  group_by(citizen) %>%
  summarize('n' = n(), 'N' = sum(WGTP), "Income" = weightedMedian(HINCP, WGTP),  "Median" = weightedMedian(GRPIP, WGTP), "Mean" = weightedMean(GRPIP, WGTP))

gt(citizen14)

citizen	n	N	Income	Median	Mean
Citizen	419	74351	32000.00	0.3000000	0.4198246
Non-citizen	42	12672	24040.06	0.3743038	0.4340017

Code

tabyl(dat2019$CIT)

 dat2019$CIT   n     percent
           1 381 0.826464208
           2   1 0.002169197
           3   3 0.006507592
           4  43 0.093275488
           5  33 0.071583514

Code

dat2019 <- dat2019 %>% 
    mutate(citizen = (ifelse(.$CIT == 5, 'Non-citizen', 'Citizen')
                     )
           )

tabyl(dat2019$citizen)

 dat2019$citizen   n    percent
         Citizen 428 0.92841649
     Non-citizen  33 0.07158351

Code

citizen19 <- dat2019 %>%
  group_by(citizen) %>%
  summarize('n' = n(), 'N' = sum(WGTP), "Income" = weightedMedian(HINCP, WGTP),  "Median" = weightedMedian(GRPIP, WGTP), "Mean" = weightedMean(GRPIP, WGTP))

gt(citizen19)

citizen	n	N	Income	Median	Mean
Citizen	428	87981	45000.00	0.25	0.3458514
Non-citizen	33	6983	46952.43	0.29	0.3196119

Independent samples t-test

https://stat-methods.com/home/independent-samples-t-test-r-2/

An independent samples t-test requires the following assumptions:

The response of interest is continuous and normally distributed for each treatment group.
Treatment groups are independent of one another. Experimental units only receive one treatment, and they do not overlap.
There are no major outliers.
A check for unequal variances will help determine which version of a t-test is most appropriate:
If variances are equal, then the assumptions of a pooled t-test is appropriate.
If variances are unequal, then a Satterthwaite (also known as Welch’s) t-test is appropriate.

The dependent response variable is cost burden.

The independent categorical variable is year.

Create a new year variable in each dataset.

Code

dat2014 <- dat2014 %>%
  mutate(year = "2014")

dat2019 <- dat2019 %>%
  mutate(year = "2019")

Rowbind the dataframes into one.

Code

dat <- rbind(dat2014, dat2019)

Designate year as a categorical factor.

Code

dat$year <- as.factor(dat$year)

Ensure GRPIP is numeric.

Code

class(dat$GRPIP)

[1] "numeric"

Perform the Shapiro-Wilk Test for Normality on each group.

Code

dat %>%
  group_by(year) %>%
  summarise(`W Statistic` = shapiro.test(GRPIP)$statistic,
            `p-value` = shapiro.test(GRPIP)$p.value)

# A tibble: 2 × 3
  year  `W Statistic` `p-value`
  <fct>         <dbl>     <dbl>
1 2014          0.844  6.08e-21
2 2019          0.825  4.56e-22

Because the p-value < 0.05 for both samples, we reject the assumption of normality. The distributions are not normal.

Perform QQ plots by group.

Code

ggplot(data = dat, mapping = aes(sample = GRPIP, color = year, fill = year)) +
  stat_qq_band(alpha=0.5, conf=0.95, qtype=1, bandType = "ts") +
  stat_qq_line(identity=TRUE) +
  stat_qq_point(col="black") +
  facet_wrap(~ year, scales = "free") +
  labs(x = "Theoretical Quantiles", y = "Sample Quantiles") + theme_bw()

For QQ plots, the majority of points whoudl follow each line and stay within the curved 95% bootstrapped confidence bands to be considered normally distributed. Again, this shows the two samples are not normally distributed.

Perform Levene’s Test of Equality of Variances.

Code

lev1<-leveneTest(GRPIP ~ year, data=dat, center="mean")
lev2<-leveneTest(GRPIP ~ year, data=dat, center="median")
print(lev1)

Levene's Test for Homogeneity of Variance (center = "mean")
       Df F value    Pr(>F)    
group   1  14.616 0.0001406 ***
      920                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

print(lev2)

Levene's Test for Homogeneity of Variance (center = "median")
       Df F value   Pr(>F)   
group   1  7.9994 0.004781 **
      920                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A significant p-value (P < 0.05) indicates that a Satterthwaite (also known as Welch’s) t-test results should be used instead of pooled t-test results.

Produce boxplots and visually check for outliers

Code

ggplot(dat, aes(x = year, y = GRPIP, fill = year)) +
  stat_boxplot(geom ="errorbar", width = 0.5) +
  geom_boxplot(fill = "light blue") + 
  stat_summary(fun=mean, geom="point", shape=10, size=3.5, color="black") + 
  ggtitle("Boxplots of rent cost burdent in 2014 and 2019") + 
  theme_bw() + theme(legend.position="none")

The boxplots show many outliers in 2019.

Produce descriptive statistics by group

Code

dat %>%
  select(GRPIP, year) %>%
  group_by(year) %>% 
  summarise(n = n(), 
            mean = mean(GRPIP, na.rm = TRUE), 
            sd = sd(GRPIP, na.rm = TRUE),
            stderr = sd/sqrt(n), 
            LCL = mean - qt(1 - (0.05 / 2), n - 1) * stderr,
            UCL = mean + qt(1 - (0.05 / 2), n - 1) * stderr,
            median=median(GRPIP, na.rm = TRUE),
            min=min(GRPIP, na.rm = TRUE), 
            max=max(GRPIP, na.rm = TRUE),
            IQR=IQR(GRPIP, na.rm = TRUE))

# A tibble: 2 × 11
  year      n  mean    sd stderr   LCL   UCL median   min   max   IQR
  <fct> <int> <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
1 2014    461 0.410 0.297 0.0138 0.383 0.437   0.3      0  1.01  0.35
2 2019    461 0.341 0.261 0.0122 0.317 0.365   0.26     0  1.01  0.25

cyl – This column identifies the levels of the treatment variable along with the mean differences between the levels.
n – This column identifies how many data points (cars) are in each cylinder group.
mean – The mean value for each treatment group.
sd – The standard deviation of each treatment group.
stderr – The standard error of each treatment group.
LCL, UCL – The upper and lower confidence intervals of the mean. That is to say, you can be 95% certain that the true mean falls between the lower and upper values specified for each treatment group, assuming the data is normally distributed.
median – The median value for each treatment group.
min, max – The minimum and maximum values observed for each treatment group.
IQR – The interquartile range of each treatment group. The interquartile range is the 75th percentile – 25th percentile.

Perform an Independent Samples T-test

Code

m1<-t.test(GRPIP ~ year, data=dat, var.equal=FALSE, na.rm=TRUE)
print(m1)


    Welch Two Sample t-test

data:  GRPIP by year
t = 3.7674, df = 905.39, p-value = 0.0001756
alternative hypothesis: true difference in means between group 2014 and group 2019 is not equal to 0
95 percent confidence interval:
 0.03321241 0.10544268
sample estimates:
mean in group 2014 mean in group 2019 
         0.4102820          0.3409544

t – This is the t-statistic. It is the ratio of the mean of the difference in means to the standard error of the difference. df – The appropriate degrees of freedom. This varies between each type of independent samples t-test.

p-value – This is the p-value associated with the test. The p-value = 0.0001756. Thus, we reject the null hypothesis that the mean cost burden of 2014 and 2019 are equal and conclude that there is a mean difference between groups. 95% confidence interval – The values presented here are on the mean difference for each treatment group. That is to say, you can be 95% certain that the true mean difference cost burden between 2014 and 2019 falls between 0.41 and 0.34. sample means – The sample mean of the 2014 and 2019 samples.

What to do When Assumptions are Broken or Things Go Wrong

The lack of normality or severe impact of outliers can violate independent sample t-test assumptions and ultimately the results. If this happens, there are several available options:

Perform a nonparametric Mann-Whitney U test is the most popular alternative. This is also known as the Mann-Whitney-Wilcoxon or the Wilcoxon Rank Sum test. This test is considered robust to violations of normality and outliers (among others) and tests for differences in mean ranks.

Additional options include considering permutation/randomization tests, bootstrap confidence intervals, and transforming the data but each option will have its own stipulations.

If you need to compare more than two independent groups, a one-way Analysis of Variances (ANOVA) or Kruskal-Wallis may be appropriate.

An independent samples t-test is not appropriate if you have repeated measurements taken on the same experimental unit (subject). For example, if you have a pre-test post-test study, then each subject was measured at two different time intervals. If this is the case, then a paired t-test may be a more appropriate course of action.

Mann-Whitney U test or Wilconxon Rank Sum test

The following assumptions must be met in order to run a Mann-Whitney U test:

Treatment groups are independent of one another. Experimental units only receive one treatment and they do not overlap.
The response variable of interest is ordinal or continuous.
Both samples are random.

The dependent response variable is cost burden.

The independent categorical variable is year.

Perform the Mann-Whitney U test

Code

m2<-wilcox.test(GRPIP ~ year, data=dat, na.rm=TRUE, paired=FALSE, exact=FALSE, conf.int=TRUE)
print(m2)


    Wilcoxon rank sum test with continuity correction

data:  GRPIP by year
W = 122212, p-value = 7.892e-05
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
 0.02007798 0.06992372
sample estimates:
difference in location 
            0.04005175

Hodges Lehmann Estimator

Code

m2$estimate

difference in location 
            0.04005175

Interpretation of Mann-Whitney U test

Rental cost burden in each year is not normally distributed. Outliers exist in each group. A Mann-Whitney U test is more appropriate than a traditional independent samples t-test to compare the rental cost burden between the two years.

The Mann-Whitney U test results in a two-sided test p-value = 7.892e-05. This indicates that we should reject the null hypothesis that distributions are equal and conclude that there is a significant difference in cost burdent between the two years. Descriptive statistics indicate that the median cost burden in 2014 was 30% and the median cost burden in 2019 was 26%. The difference between the median cost burdens of each year is about 4%. The Hodges-Lehmann estimate is the same at 4%. We are 95% certain that the median difference between 2014 and 2019 is between 2% and 7%. Thus, the 4% decline in rental cost burden from 2014 to 2019 is statistically significant.

Let’s try to make tables with the combined data using the mean.

Recode sex

Code

# Recode sex
dat <- dat %>%
  mutate(SEX = (ifelse(.$SEX == 1, 'Male', 'Female')))

Table 1, change in cost burden for all households from 2014 to 2019

Code

t1 <- dat %>%
  group_by("Year" = year) %>%
  summarise(`Cost burden` = weighted.mean(GRPIP, WGTP, na.rm = TRUE)) %>%
  pivot_wider(names_from = `Year`, values_from = `Cost burden`)

t1 <- t1 %>%
  mutate(diff = `2019` - `2014`)

t1

# A tibble: 1 × 3
  `2014` `2019`    diff
   <dbl>  <dbl>   <dbl>
1  0.422  0.344 -0.0780

Code

gt(t1)

2014	2019	diff
0.421889	0.3439219	-0.07796713

Code

write.csv(t1, "metro.csv")

Table 2, change in cost burden for all households from 2014 to 2019, grouped by sex

Code

t2 <- dat %>%
  group_by("Year" = year, "Sex" = SEX) %>%
  summarise(`Cost burden` = weighted.mean(GRPIP, WGTP, na.rm = TRUE)) %>%
  pivot_wider(names_from = `Year`, values_from = `Cost burden`)

`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.

Code

t2 <- t2 %>%
  mutate(diff = `2019` - `2014`)

t2

# A tibble: 2 × 4
  Sex    `2014` `2019`    diff
  <chr>   <dbl>  <dbl>   <dbl>
1 Female  0.453  0.371 -0.0826
2 Male    0.389  0.315 -0.0735

Code

gt(t2)

Sex	2014	2019	diff
Female	0.4532549	0.3706347	-0.08262016
Male	0.3886364	0.3151623	-0.07347414

Code

write.csv(t2, "sex.csv")

Table 3, change in cost burden for all households from 2014 to 2019, grouped by race/ethnicity

Code

t3 <- dat %>%
  group_by("Year" = year, "Race/Eth" = race_eth) %>%
  summarise(`Cost burden` = weighted.mean(GRPIP, WGTP, na.rm = TRUE)) %>%
  pivot_wider(names_from = `Year`, values_from = `Cost burden`)

`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.

Code

t3 <- t3 %>%
  mutate(diff = `2019` - `2014`)

t3

# A tibble: 10 × 4
   `Race/Eth`         `2014` `2019`   diff
   <chr>               <dbl>  <dbl>  <dbl>
 1 Hispanic            0.440  0.310 -0.130
 2 Non-Hispanic Asian  0.415 NA     NA    
 3 Non-Hispanic Black  0.508 NA     NA    
 4 Non-Hispanic White  0.372 NA     NA    
 5 Other               0.525 NA     NA    
 6 <NA>                0.208 NA     NA    
 7 Non-hispanic Asian NA      0.430 NA    
 8 Non-hispanic Black NA      0.373 NA    
 9 Non-hispanic White NA      0.322 NA    
10 other              NA      0.412 NA

Code

gt(t3)

Race/Eth	2014	2019	diff
Hispanic	0.4397007	0.3097231	-0.1299776
Non-Hispanic Asian	0.4154056	NA	NA
Non-Hispanic Black	0.5083152	NA	NA
Non-Hispanic White	0.3720215	NA	NA
Other	0.5245358	NA	NA
NA	0.2083099	NA	NA
Non-hispanic Asian	NA	0.4302271	NA
Non-hispanic Black	NA	0.3731761	NA
Non-hispanic White	NA	0.3217897	NA
other	NA	0.4124263	NA

Code

write.csv(t3, "race_eth.csv")

Table 4, change in cost burden for all households from 2014 to 2019, grouped by age

Code

t4 <- dat %>%
  group_by("Year" = year, "Age group" = agelvl) %>%
  summarise(`Cost burden` = weighted.mean(GRPIP, WGTP, na.rm = TRUE)) %>%
  pivot_wider(names_from = `Year`, values_from = `Cost burden`)

`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.

Code

t4 <- t4 %>%
  mutate(diff = `2019` - `2014`)

t4

# A tibble: 3 × 4
  `Age group`  `2014` `2019`    diff
  <chr>         <dbl>  <dbl>   <dbl>
1 24 and under  0.608  0.428 -0.180 
2 25 to 64      0.381  0.334 -0.0463
3 65 and over   0.505  0.317 -0.188

Code

gt(t4)

Age group	2014	2019	diff
24 and under	0.6082305	0.4283608	-0.17986975
25 to 64	0.3806627	0.3343517	-0.04631097
65 and over	0.5049577	0.3168311	-0.18812665

Code

write.csv(t4, "age.csv")

Table 5, change in cost burden for all households from 2014 to 2019, grouped by income quartile

Code

t5 <- dat %>%
  group_by("Year" = year, "Income quartile" = inclvl) %>%
  summarise(`Cost burden` = weighted.mean(GRPIP, WGTP, na.rm = TRUE)) %>%
  pivot_wider(names_from = `Year`, values_from = `Cost burden`)

`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.

Code

t5 <- t5 %>%
  mutate(diff = `2019` - `2014`)

t5

# A tibble: 4 × 4
  `Income quartile` `2014` `2019`     diff
  <chr>              <dbl>  <dbl>    <dbl>
1 1st quartile       0.700  0.596 -0.104  
2 2nd quartile       0.492  0.331 -0.161  
3 3rd quartile       0.281  0.224 -0.0569 
4 4th quartile       0.153  0.148 -0.00551

Code

gt(t5)

Income quartile	2014	2019	diff
1st quartile	0.7004726	0.5962355	-0.104237104
2nd quartile	0.4924068	0.3309767	-0.161430079
3rd quartile	0.2813534	0.2244523	-0.056901098
4th quartile	0.1532536	0.1477416	-0.005511999

Code

write.csv(t5, "income.csv")

Table 6, change in cost burden for all households from 2014 to 2019, grouped by presence of children

Code

t6 <- dat %>%
  group_by("Year" = year, "Presence of children" = child) %>%
  summarise(`Cost burden` = weighted.mean(GRPIP, WGTP, na.rm = TRUE)) %>%
  pivot_wider(names_from = `Year`, values_from = `Cost burden`)

`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.

Code

t6 <- t6 %>%
  mutate(diff = `2019` - `2014`)

t6

# A tibble: 2 × 4
  `Presence of children` `2014` `2019`    diff
  <chr>                   <dbl>  <dbl>   <dbl>
1 Children                0.475  0.417 -0.0575
2 No children             0.407  0.326 -0.0815

Code

gt(t6)

Presence of children	2014	2019	diff
Children	0.4749833	0.4174496	-0.05753370
No children	0.4074777	0.3259456	-0.08153204

Code

write.csv(t6, "children.csv")

Table 7, change in cost burden for all households from 2014 to 2019, grouped by citizenship status

Code

t7 <- dat %>%
  group_by("Year" = year, "Citizenship status" = citizen) %>%
  summarise(`Cost burden` = weighted.mean(GRPIP, WGTP, na.rm = TRUE)) %>%
  pivot_wider(names_from = `Year`, values_from = `Cost burden`)

`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.

Code

t7 <- t7 %>%
  mutate(diff = `2019` - `2014`)

t7

# A tibble: 2 × 4
  `Citizenship status` `2014` `2019`    diff
  <chr>                 <dbl>  <dbl>   <dbl>
1 Citizen               0.420  0.346 -0.0740
2 Non-citizen           0.434  0.320 -0.114

Code

gt(t7)

Citizenship status	2014	2019	diff
Citizen	0.4198246	0.3458514	-0.07397324
Non-citizen	0.4340017	0.3196119	-0.11438982

Code

write.csv(t7, "citizenship.csv")

Let’s try to make tables with the combined data using the median.

Table 1, change in cost burden for all households from 2014 to 2019

Code

t1a <- dat %>%
  group_by("Year" = year) %>%
  summarise(`Cost burden` = weightedMedian(GRPIP, WGTP, na.rm = TRUE)) %>%
  pivot_wider(names_from = `Year`, values_from = `Cost burden`)

t1a <- t1a %>%
  mutate(diff = `2019` - `2014`)

t1a

# A tibble: 1 × 3
  `2014` `2019`  diff
   <dbl>  <dbl> <dbl>
1    0.3   0.25 -0.05

Code

gt(t1a)

2014	2019	diff
0.3	0.25	-0.05

Code

write.csv(t1a, "metro_median.csv")

Table 2, change in cost burden for all households from 2014 to 2019, grouped by sex

Code

t2a <- dat %>%
  group_by("Year" = year, "Sex" = SEX) %>%
  summarise(`Cost burden` = weightedMedian(GRPIP, WGTP, na.rm = TRUE)) %>%
  pivot_wider(names_from = `Year`, values_from = `Cost burden`)

`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.

Code

t2a <- t2a %>%
  mutate(diff = `2019` - `2014`)

t2a

# A tibble: 2 × 4
  Sex    `2014` `2019`  diff
  <chr>   <dbl>  <dbl> <dbl>
1 Female   0.33   0.27 -0.06
2 Male     0.27   0.23 -0.04

Code

gt(t2a)

Sex	2014	2019	diff
Female	0.33	0.27	-0.06
Male	0.27	0.23	-0.04

Code

write.csv(t2a, "sex_median.csv")

Table 3, change in cost burden for all households from 2014 to 2019, grouped by race/ethnicity

Code

t3a <- dat %>%
  group_by("Year" = year, "Race/Eth" = race_eth) %>%
  summarise(`Cost burden` = weightedMedian(GRPIP, WGTP, na.rm = TRUE)) %>%
  pivot_wider(names_from = `Year`, values_from = `Cost burden`)

`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.

Code

t3a <- t3a %>%
  mutate(diff = `2019` - `2014`)

t3a

# A tibble: 10 × 4
   `Race/Eth`         `2014` `2019`   diff
   <chr>               <dbl>  <dbl>  <dbl>
 1 Hispanic            0.362  0.2   -0.162
 2 Non-Hispanic Asian  0.267 NA     NA    
 3 Non-Hispanic Black  0.383 NA     NA    
 4 Non-Hispanic White  0.27  NA     NA    
 5 Other               0.355 NA     NA    
 6 <NA>                0.208 NA     NA    
 7 Non-hispanic Asian NA      0.26  NA    
 8 Non-hispanic Black NA      0.28  NA    
 9 Non-hispanic White NA      0.23  NA    
10 other              NA      0.373 NA

Code

gt(t3a)

Race/Eth	2014	2019	diff
Hispanic	0.3621477	0.2000000	-0.1621477
Non-Hispanic Asian	0.2670468	NA	NA
Non-Hispanic Black	0.3826991	NA	NA
Non-Hispanic White	0.2700000	NA	NA
Other	0.3547458	NA	NA
NA	0.2083099	NA	NA
Non-hispanic Asian	NA	0.2600000	NA
Non-hispanic Black	NA	0.2800000	NA
Non-hispanic White	NA	0.2300000	NA
other	NA	0.3727961	NA

Code

write.csv(t3a, "race_eth_median.csv")

Table 4, change in cost burden for all households from 2014 to 2019, grouped by age

Code

t4a <- dat %>%
  group_by("Year" = year, "Age group" = agelvl) %>%
  summarise(`Cost burden` = weightedMedian(GRPIP, WGTP, na.rm = TRUE)) %>%
  pivot_wider(names_from = `Year`, values_from = `Cost burden`)

`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.

Code

t4a <- t4a %>%
  mutate(diff = `2019` - `2014`)

t4a

# A tibble: 3 × 4
  `Age group`  `2014` `2019`    diff
  <chr>         <dbl>  <dbl>   <dbl>
1 24 and under  0.56    0.32 -0.24  
2 25 to 64      0.27    0.24 -0.0300
3 65 and over   0.339   0.28 -0.0587

Code

gt(t4a)

Age group	2014	2019	diff
24 and under	0.5600000	0.32	-0.24000000
25 to 64	0.2700000	0.24	-0.03000000
65 and over	0.3386926	0.28	-0.05869263

Code

write.csv(t4a, "age_median.csv")

Table 5, change in cost burden for all households from 2014 to 2019, grouped by income quartile

Code

t5a <- dat %>%
  group_by("Year" = year, "Income quartile" = inclvl) %>%
  summarise(`Cost burden` = weightedMedian(GRPIP, WGTP, na.rm = TRUE)) %>%
  pivot_wider(names_from = `Year`, values_from = `Cost burden`)

`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.

Code

t5a <- t5a %>%
  mutate(diff = `2019` - `2014`)

t5a

# A tibble: 4 × 4
  `Income quartile` `2014` `2019`    diff
  <chr>              <dbl>  <dbl>   <dbl>
1 1st quartile       0.849  0.625 -0.224 
2 2nd quartile       0.46   0.29  -0.17  
3 3rd quartile       0.26   0.22  -0.04  
4 4th quartile       0.15   0.14  -0.0100

Code

gt(t5a)

Income quartile	2014	2019	diff
1st quartile	0.8487671	0.6245815	-0.2241856
2nd quartile	0.4600000	0.2900000	-0.1700000
3rd quartile	0.2600000	0.2200000	-0.0400000
4th quartile	0.1500000	0.1400000	-0.0100000

Code

write.csv(t5a, "income_median.csv")

Table 6, change in cost burden for all households from 2014 to 2019, grouped by presence of children

Code

t6a <- dat %>%
  group_by("Year" = year, "Presence of children" = child) %>%
  summarise(`Cost burden` = weightedMedian(GRPIP, WGTP, na.rm = TRUE)) %>%
  pivot_wider(names_from = `Year`, values_from = `Cost burden`)

`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.

Code

t6a <- t6a %>%
  mutate(diff = `2019` - `2014`)

t6a

# A tibble: 2 × 4
  `Presence of children` `2014` `2019`    diff
  <chr>                   <dbl>  <dbl>   <dbl>
1 Children                 0.37  0.356 -0.0139
2 No children              0.29  0.239 -0.0510

Code

gt(t6a)

Presence of children	2014	2019	diff
Children	0.37	0.3560757	-0.01392427
No children	0.29	0.2389815	-0.05101847

Code

write.csv(t6a, "children_median.csv")

Table 7, change in cost burden for all households from 2014 to 2019, grouped by citizenship status

Code

t7a <- dat %>%
  group_by("Year" = year, "Citizenship status" = citizen) %>%
  summarise(`Cost burden` = weightedMedian(GRPIP, WGTP, na.rm = TRUE)) %>%
  pivot_wider(names_from = `Year`, values_from = `Cost burden`)

`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.

Code

t7a <- t7a %>%
  mutate(diff = `2019` - `2014`)

t7a

# A tibble: 2 × 4
  `Citizenship status` `2014` `2019`    diff
  <chr>                 <dbl>  <dbl>   <dbl>
1 Citizen               0.3     0.25 -0.05  
2 Non-citizen           0.374   0.29 -0.0843

Code

gt(t7a)

Citizenship status	2014	2019	diff
Citizen	0.3000000	0.25	-0.0500000
Non-citizen	0.3743038	0.29	-0.0843038

Code

write.csv(t7a, "citizenship_median.csv")

GRPIP: Gross rent as a percentage of household income past 12 months.
SEX: Sex
RAC1P: Race
AGEP: Age
HINCP: Household Income
HUPAC: Presence of children
CIT: Citizenship status

Notes:

I need to add a count to the summaries. Done.
I need to add weights to summary statistics. Done
Tests for statistical significance with t-test. Done
Conduct Mann-Whitney test. Done
Should I remove NAs, zeros, and outliers? Would it be justified? By what criteria would I exclude outliers?
Some things I need to do in the future:
Use IPUMS because it puts the data on your computer.
Combine years early.
Explore NAs and outliers early. Learn when to remove outliers.