Inequality Research Project

Author

Brian Surratt

Published

April 28, 2023

Research Proposal

I will examine the housing cost burden for renters in Minneapolis, Minnesota in 2014 compared to 2019 by census tract. In 2014, Minneapolis passed a city ordinance allowing for the construction of accessory dwelling units (ADUs). The ordinance allows single family homeowners to construct ADUs on their property. Theoretically, this should add housing supply to the city and apply downward pressure on housing costs. Comparing housing costs burden before and after the ordinance passed will provide evidence of the effects of the ordinance.

Description of data

The source of data is the U.S. Census Bureau’s American Community Survey Public Use Microdata from 2014 and 2019. I will access the data via the get_pums() function in r. The variables I will use are the following:

  • PUMA: Public use area microdata area code. The PUMAs for Minneapolis are 1405, 1406, and 1407. In the 2014 data, this is a 4 character variable. In the 2019 data, this is a 5 character variable.
  • TYPE: Type of unit, filtered for “1” which is a housing unit. This removes group quarters.
  • TEN: This variable is housing tenure and can have four values, 1 = “owned with mortgage,” 2 = “owned free and clear,” 3 = “rented,” and 4 = “occupied without payment of rent.”
  • RELP: This is the relationship of the respondent to the household reference person. For the 2014 data this must be filtered to “00” to limit to one response per household. For the 2019 data, the variable name is RELSHIPP and the filter must be set to “20”.
  • GRPIP: Gross rent as a percentage of household income past 12 months.
  • SEX: Sex
  • RAC1P: Race
  • AGEP: Age
  • HINCP: Household Income
  • HUPAC: Presence of children
  • CIT: Citizenship status
  • HHL: Household Language

Downloading data

First, let’s download and clean Minnesota PUMS data for 2014.

Code
mnpums2014 <- get_pums(
  variables = c("PUMA", "TYPE", "TEN", "HHT", "RELP", "GRPIP", "SEX", "RAC1P", "HISP", "AGEP", "HINCP", "HUPAC", "CIT", "HHL"),
  state = "MN",
  variables_filter = list(SPORDER = 1, TYPE = 1), # SPORDER = 1 gets households, TYPE = 1 gets housing units and eliminates group quarters.
  #puma = c(1405, 1406, 1407), # I can't figure out how to select by puma.  Maybe capitalize "PUMA"?
  survey = "acs1",
  year = 2014
  )
Getting data from the 2014 1-year ACS Public Use Microdata Sample

This is a sample of PUMS respondents in Minnesota for 2014. Let’s check the sample size.

Code
nrow(mnpums2014)
[1] 21524

Let’s create a new dataframe with only the PUMAs that cover Minneapolis (1405, 1406, and 1407) in 2014.

Code
dat2014 <- mnpums2014 %>%
  filter(PUMA %in% c("1405", "1406", "1407"))

This is a sample of PUMS respondents in Minneapolis for 2014. Let’s check the sample size.

Code
nrow(dat2014)
[1] 1023

Let’s modify the PUMA variable in the 2014 dataframe so the PUMA is 5 digits.

Code
dat2014 <- dat2014 %>%
  mutate(PUMA = case_when(.$PUMA == "1405" ~ "01405",
                          .$PUMA == "1406" ~ "01406",
                          .$PUMA == "1407" ~ "01407",
                          )
         )

Let’s check the distribution by PUMA.

Code
tabyl(dat2014$PUMA)
 dat2014$PUMA   n   percent
        01405 344 0.3362659
        01406 331 0.3235582
        01407 348 0.3401760

In order to get just one observation per household, we need to ensure RELP == 0 for every record.

Code
tabyl(dat2014$RELP)
 dat2014$RELP    n percent
            0 1023       1

So now we know there is only one observation per household. Let’s see the distribution of renters vs. non-renters. 1 is owned with a mortgage, 2 is owned free and clear, 3 is rented, 4 is occupied without payment of rent.

Code
tabyl(dat2014$TEN)
 dat2014$TEN   n     percent
           1 401 0.391984360
           2 158 0.154447703
           3 461 0.450635386
           4   3 0.002932551

Let’s filter for only renters.

Code
dat2014 <- dat2014 %>% 
  filter(TEN == 3)

tabyl(dat2014$PUMA)
 dat2014$PUMA   n   percent
        01405 155 0.3362256
        01406 114 0.2472885
        01407 192 0.4164859

Let’s check the distribution of rent as a percentage of income.

Code
dat2014$GRPIP <- (as.numeric(dat2014$GRPIP))/100

hist(dat2014$GRPIP)

First, let’s download and clean Minnesota PUMS data for 2019.

Code
mnpums2019 <- get_pums(
  variables = c("PUMA", "TYPE", "TEN", "HHT", "RELSHIPP", "GRPIP", "SEX", "RAC1P", "HISP", "AGEP", "HINCP", "HUPAC", "CIT", "HHL"),
  state = "MN",
  variables_filter = list(SPORDER = 1, TYPE = 1), # SPORDER = 1 gets households, TYPE = 1 gets housing units and eliminates group quarters.
  #puma = c(1405, 1406, 1407), # I can't figure out how to select by puma.  Maybe capitalize "PUMA"?
  survey = "acs1",
  year = 2019
  )
Getting data from the 2019 1-year ACS Public Use Microdata Sample

This is a sample of PUMS respondents in Minnesota for 2019. Let’s check the sample size.

Code
nrow(mnpums2019)
[1] 22576

Let’s create a new dataframe with only the PUMAs that cover Minneapolis (1405, 1406, and 1407) in 2019.

Code
dat2019 <- mnpums2019 %>%
  filter(PUMA %in% c("01405", "01406", "01407"))

This is a sample of PUMS respondents in Minneapolis for 2019. Let’s check the sample size.

Code
nrow(dat2019)
[1] 1054

Let’s check the distribution by PUMA.

Code
tabyl(dat2019$PUMA)
 dat2019$PUMA   n   percent
        01405 350 0.3320683
        01406 336 0.3187856
        01407 368 0.3491461

In order to get just one observation per household, we need to ensure RELSHIPP == 0 for every record.

Code
tabyl(dat2019$RELSHIPP)
 dat2019$RELSHIPP    n percent
               20 1054       1

So now we know there is only one observation per household. Let’s see the distribution of renters vs. non-renters. 1 is owned with a mortgage, 2 is owned free and clear, 3 is rented, 4 is occupied without payment of rent.

Code
tabyl(dat2019$TEN)
 dat2019$TEN   n   percent
           1 413 0.3918406
           2 174 0.1650854
           3 461 0.4373814
           4   6 0.0056926

Let’s filter for only renters.

Code
dat2019 <- dat2019 %>% 
  filter(TEN == 3)

tabyl(dat2019$PUMA)
 dat2019$PUMA   n   percent
        01405 158 0.3427332
        01406 109 0.2364425
        01407 194 0.4208243

Let’s check the distribution of rent as a percentage of income.

Code
dat2019$GRPIP <- (as.numeric(dat2019$GRPIP))/100

hist(dat2019$GRPIP)

Compare cost burden in 2014 with cost burden in 2019

Renaming “RELP” in the 2014 data to “RELSHIPP”.

Code
dat2014 <- dat2014  %>%
  rename_at('RELP', ~'RELSHIPP')

head(dat2014)
# A tibble: 6 × 19
  SERIALNO  WGTP PWGTP PUMA  TEN   HHT   RELSHIPP GRPIP SEX   RAC1P HISP  AGEP 
  <chr>    <dbl> <dbl> <chr> <chr> <chr> <chr>    <dbl> <chr> <chr> <chr> <chr>
1 2339        96    96 01405 3     6     0         0.17 2     1     1     49   
2 10042      106   106 01407 3     7     0         0.24 2     1     1     22   
3 22566      132   132 01405 3     3     0         0.31 2     1     1     44   
4 25036      350   350 01405 3     3     0         0.22 2     1     2     28   
5 27547      319   319 01406 3     3     0         0.21 2     2     1     26   
6 30186      178   178 01407 3     6     0         0.97 2     2     1     58   
# ℹ 7 more variables: HINCP <chr>, HUPAC <chr>, CIT <chr>, HHL <chr>,
#   SPORDER <chr>, TYPE <chr>, ST <chr>

I’m just going to do the work and paste it into XL.

Change in cost burden for all households from 2014 to 2019.

Median and mean GRPIP for all households in 2014.

Code
summary(dat2014$GRPIP)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.2000  0.3000  0.4103  0.5500  1.0100 

Median and mean GRPIP for all households in 2019.

Code
summary(dat2019$GRPIP)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.170   0.260   0.341   0.420   1.010 

Change in cost burden for all households from 2014 to 2019, grouped by sex.

Code
sex14 <- dat2014 %>% 
  group_by(SEX) %>%
  summarize('n' = n(), 'N' = sum(WGTP), "Median" = weightedMedian(GRPIP, WGTP), "Mean" = weightedMean(GRPIP, WGTP))

# For SEX, 1 = male and 2 = female

gt(sex14)
SEX n N Median Mean
1 216 42241 0.27 0.3886364
2 245 44782 0.33 0.4532549
Code
sex19 <- dat2019 %>% 
  group_by(SEX) %>%
  summarize('n' = n(), 'N' = sum(WGTP), "Median" = weightedMedian(GRPIP, WGTP), "Mean" = weightedMean(GRPIP, WGTP))

# For SEX, 1 = male and 2 = female

gt(sex19)
SEX n N Median Mean
1 228 45730 0.23 0.3151623
2 233 49234 0.27 0.3706347

Change in cost burden for all households from 2014 to 2019, grouped by race/ethnicity.

First, consolidate race and ethnicity for the 2014 data.

Code
tabyl(dat2014$RAC1P)
 dat2014$RAC1P   n     percent
             1 306 0.663774403
             2  95 0.206073753
             3   5 0.010845987
             5   2 0.004338395
             6  20 0.043383948
             8  13 0.028199566
             9  20 0.043383948
Code
tabyl(dat2014$HISP)
 dat2014$HISP   n     percent
            1 433 0.939262473
           16   1 0.002169197
           17   2 0.004338395
            2  22 0.047722343
            3   1 0.002169197
            8   2 0.004338395
Code
dat2014 <- dat2014 %>%
  mutate(race_eth = case_when(.$HISP == 01 & .$RAC1P == 1 ~"Non-Hispanic White",
                              .$HISP == 01 & .$RAC1P == 2 ~"Non-Hispanic Black",
                              .$HISP == 01 & .$RAC1P == 6 ~"Non-Hispanic Asian",
                              .$HISP == 01 & .$RAC1P %in% c(3, 4, 5, 7, 8, 9) ~"Other",
                              .$HISP %in% c(2:16) & .$RAC1P %in% c(1:9) ~ "Hispanic"
                              )
        )

tabyl(dat2014$race_eth)
   dat2014$race_eth   n     percent valid_percent
           Hispanic  26 0.056399132    0.05664488
 Non-Hispanic Asian  20 0.043383948    0.04357298
 Non-Hispanic Black  95 0.206073753    0.20697168
 Non-Hispanic White 290 0.629067245    0.63180828
              Other  28 0.060737527    0.06100218
               <NA>   2 0.004338395            NA
Code
race14 <- dat2014 %>% 
  group_by(race_eth) %>%
  summarize('n' = n(), 'N' = sum(WGTP), "Median" = weightedMedian(GRPIP, WGTP), "Mean" = weightedMean(GRPIP, WGTP))

gt(race14)
race_eth n N Median Mean
Hispanic 26 8019 0.3621477 0.4397007
Non-Hispanic Asian 20 5326 0.2670468 0.4154056
Non-Hispanic Black 95 21178 0.3826991 0.5083152
Non-Hispanic White 290 47162 0.2700000 0.3720215
Other 28 4912 0.3547458 0.5245358
NA 2 426 0.2083099 0.2083099

Consolidate race and ethnicity for the 2019 data.

Code
tabyl(dat2019$RAC1P)
 dat2019$RAC1P   n     percent
             1 321 0.696312364
             2  72 0.156182213
             3   5 0.010845987
             5   2 0.004338395
             6  30 0.065075922
             8  10 0.021691974
             9  21 0.045553145
Code
tabyl(dat2019$HISP)
 dat2019$HISP   n     percent
           01 428 0.928416486
           02  17 0.036876356
           03   1 0.002169197
           04   1 0.002169197
           07   1 0.002169197
           10   1 0.002169197
           11   1 0.002169197
           15   1 0.002169197
           17   3 0.006507592
           18   1 0.002169197
           19   1 0.002169197
           20   1 0.002169197
           23   3 0.006507592
           24   1 0.002169197
Code
dat2019 <- dat2019 %>%
  mutate(race_eth = case_when(.$HISP == '01' & .$RAC1P == '1' ~"Non-hispanic White",
                              .$HISP == '01' & .$RAC1P == '2' ~"Non-hispanic Black",
                              .$HISP == '01' & .$RAC1P == '6' ~"Non-hispanic Asian",
                              .$HISP == '01' & .$RAC1P %in% c('3', '4', '5', '7', '8', '9') ~"other",
                              .$HISP != '01' ~ "Hispanic"
                              )
        )

tabyl(dat2019$race_eth)
   dat2019$race_eth   n    percent
           Hispanic  33 0.07158351
 Non-hispanic Asian  30 0.06507592
 Non-hispanic Black  70 0.15184382
 Non-hispanic White 306 0.66377440
              other  22 0.04772234
Code
race19 <- dat2019 %>% 
  group_by(race_eth) %>%
  summarize('n' = n(), 'N' = sum(WGTP), "Median" = weightedMedian(GRPIP, WGTP), "Mean" = weightedMean(GRPIP, WGTP))

gt(race19)
race_eth n N Median Mean
Hispanic 33 8089 0.2000000 0.3097231
Non-hispanic Asian 30 4271 0.2600000 0.4302271
Non-hispanic Black 70 25355 0.2800000 0.3731761
Non-hispanic White 306 52468 0.2300000 0.3217897
other 22 4781 0.3727961 0.4124263

Change in cost burden for all households from 2014 to 2019, grouped by age groups.

First, make a new variable with age categories, 24 and under, 25 to 64, and 65 and over.

Code
dat2014 <- dat2014 %>% 
    mutate(agelvl = case_when(.$AGEP <= 24 ~"24 and under",
                              .$AGEP > 24 & .$AGEP <= 64 ~ "25 to 64",
                              .$AGEP > 64 ~ "65 and over"
                              )
        )

tabyl(dat2014$agelvl)
 dat2014$agelvl   n    percent
   24 and under  67 0.14533623
       25 to 64 361 0.78308026
    65 and over  33 0.07158351
Code
age14 <- dat2014 %>% 
  group_by(agelvl) %>%
  summarize('n' = n(), 'N' = sum(WGTP), "Median" = weightedMedian(GRPIP, WGTP), "Mean" = weightedMean(GRPIP, WGTP))

gt(age14)
agelvl n N Median Mean
24 and under 67 12275 0.5600000 0.6082305
25 to 64 361 68358 0.2700000 0.3806627
65 and over 33 6390 0.3386926 0.5049577
Code
dat2019 <- dat2019 %>% 
    mutate(agelvl = case_when(.$AGEP <= 24 ~"24 and under",
                              .$AGEP > 24 & .$AGEP <= 64 ~ "25 to 64",
                              .$AGEP > 64 ~ "65 and over"
                              )
        )

tabyl(dat2019$agelvl)
 dat2019$agelvl   n   percent
   24 and under  69 0.1496746
       25 to 64 338 0.7331887
    65 and over  54 0.1171367
Code
age19 <- dat2019 %>% 
  group_by(agelvl) %>%
  summarize('n' = n(), 'N' = sum(WGTP), "Median" = weightedMedian(GRPIP, WGTP), "Mean" = weightedMean(GRPIP, WGTP))

gt(age19)
agelvl n N Median Mean
24 and under 69 11530 0.32 0.4283608
25 to 64 338 73440 0.24 0.3343517
65 and over 54 9994 0.28 0.3168311

Change in cost burden for all households from 2014 to 2019, grouped by income levels.

Let’s see what the quartiles are.

Code
summary(as.numeric(dat2014$HINCP))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  -3700   13800   32000   46490   62500  459000 

Convert income to numeric.

Code
dat2014$HINCP <- as.numeric(dat2014$HINCP)
Code
dat2014 <- dat2014 %>% 
    mutate(inclvl = case_when(.$HINCP <= 13800 ~"1st quartile",
                              .$HINCP > 13800 & .$HINCP <= 32000 ~"2nd quartile",
                              .$HINCP > 32000 & .$HINCP <= 62500 ~"3rd quartile",
                              .$HINCP > 62500 ~"4th quartile"
                              )
        )

tabyl(dat2014$inclvl)
 dat2014$inclvl   n   percent
   1st quartile 116 0.2516269
   2nd quartile 115 0.2494577
   3rd quartile 115 0.2494577
   4th quartile 115 0.2494577
Code
inc14 <- dat2014 %>% 
  group_by(inclvl) %>%
  summarize('n' = n(), 'N' = sum(WGTP), "Median" = weightedMedian(GRPIP, WGTP), "Mean" = weightedMean(GRPIP, WGTP))

gt(inc14)
inclvl n N Median Mean
1st quartile 116 25262 0.8487671 0.7004726
2nd quartile 115 20363 0.4600000 0.4924068
3rd quartile 115 20667 0.2600000 0.2813534
4th quartile 115 20731 0.1500000 0.1532536
Code
summary(as.numeric(dat2019$HINCP))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0   23500   49000   65912   87850  476000 

Convert income to numeric.

Code
dat2019$HINCP <- as.numeric(dat2019$HINCP)
Code
dat2019 <- dat2019 %>% 
    mutate(inclvl = case_when(.$HINCP <= 23500 ~"1st quartile",
                              .$HINCP > 23500 & .$HINCP <= 49000 ~"2nd quartile",
                              .$HINCP > 49000 & .$HINCP <= 87850 ~"3rd quartile",
                              .$HINCP > 87850 ~"4th quartile"
                              )
        )

tabyl(dat2019$inclvl)
 dat2019$inclvl   n   percent
   1st quartile 116 0.2516269
   2nd quartile 115 0.2494577
   3rd quartile 115 0.2494577
   4th quartile 115 0.2494577
Code
inc14 <- dat2014 %>% 
  group_by(inclvl) %>%
  summarize('n' = n(), 'N' = sum(WGTP), "Median" = weightedMedian(GRPIP, WGTP), "Mean" = weightedMean(GRPIP, WGTP))

gt(inc14)
inclvl n N Median Mean
1st quartile 116 25262 0.8487671 0.7004726
2nd quartile 115 20363 0.4600000 0.4924068
3rd quartile 115 20667 0.2600000 0.2813534
4th quartile 115 20731 0.1500000 0.1532536

Change in cost burden for all households from 2014 to 2019, by presence of children.

Code
tabyl(dat2014$HUPAC)
 dat2014$HUPAC   n    percent
             1  30 0.06507592
             2  37 0.08026030
             3  18 0.03904555
             4 376 0.81561822
Code
dat2014 <- dat2014 %>% 
    mutate(child = (ifelse(.$HUPAC == 4, 'No children', 'Children')
                     )
           )

tabyl(dat2014$child)
 dat2014$child   n   percent
      Children  85 0.1843818
   No children 376 0.8156182
Code
child14 <- dat2014 %>% 
  group_by(child) %>%
  summarize('n' = n(), 'N' = sum(WGTP), "Median" = weightedMedian(GRPIP, WGTP), "Mean" = weightedMean(GRPIP, WGTP))

gt(child14)
child n N Median Mean
Children 85 18578 0.37 0.4749833
No children 376 68445 0.29 0.4074777
Code
tabyl(dat2019$HUPAC)
 dat2019$HUPAC   n    percent
             1  26 0.05639913
             2  26 0.05639913
             3  18 0.03904555
             4 391 0.84815618
Code
dat2019 <- dat2019 %>% 
    mutate(child = (ifelse(.$HUPAC == 4, 'No children', 'Children')
                     )
           )

tabyl(dat2019$child)
 dat2019$child   n   percent
      Children  70 0.1518438
   No children 391 0.8481562
Code
child19 <- dat2019 %>% 
  group_by(child) %>%
  summarize('n' = n(), 'N' = sum(WGTP), "Median" = weightedMedian(GRPIP, WGTP), "Mean" = weightedMean(GRPIP, WGTP))

gt(child19)
child n N Median Mean
Children 70 18656 0.3560757 0.4174496
No children 391 76308 0.2389815 0.3259456

Change in cost burden for all households from 2014 to 2019, by citizenship status.

Code
tabyl(dat2014$CIT)
 dat2014$CIT   n    percent
           1 380 0.82429501
           3   4 0.00867679
           4  35 0.07592191
           5  42 0.09110629
Code
dat2014 <- dat2014 %>% 
    mutate(citizen = (ifelse(.$CIT == 5, 'Non-citizen', 'Citizen')
                     )
           )

tabyl(dat2014$citizen)
 dat2014$citizen   n    percent
         Citizen 419 0.90889371
     Non-citizen  42 0.09110629
Code
citizen14 <- dat2014 %>%
  group_by(citizen) %>%
  summarize('n' = n(), 'N' = sum(WGTP), "Income" = weightedMedian(HINCP, WGTP),  "Median" = weightedMedian(GRPIP, WGTP), "Mean" = weightedMean(GRPIP, WGTP))

gt(citizen14)
citizen n N Income Median Mean
Citizen 419 74351 32000.00 0.3000000 0.4198246
Non-citizen 42 12672 24040.06 0.3743038 0.4340017
Code
tabyl(dat2019$CIT)
 dat2019$CIT   n     percent
           1 381 0.826464208
           2   1 0.002169197
           3   3 0.006507592
           4  43 0.093275488
           5  33 0.071583514
Code
dat2019 <- dat2019 %>% 
    mutate(citizen = (ifelse(.$CIT == 5, 'Non-citizen', 'Citizen')
                     )
           )

tabyl(dat2019$citizen)
 dat2019$citizen   n    percent
         Citizen 428 0.92841649
     Non-citizen  33 0.07158351
Code
citizen19 <- dat2019 %>%
  group_by(citizen) %>%
  summarize('n' = n(), 'N' = sum(WGTP), "Income" = weightedMedian(HINCP, WGTP),  "Median" = weightedMedian(GRPIP, WGTP), "Mean" = weightedMean(GRPIP, WGTP))

gt(citizen19)
citizen n N Income Median Mean
Citizen 428 87981 45000.00 0.25 0.3458514
Non-citizen 33 6983 46952.43 0.29 0.3196119

Independent samples t-test

https://stat-methods.com/home/independent-samples-t-test-r-2/

An independent samples t-test requires the following assumptions:

  • The response of interest is continuous and normally distributed for each treatment group.
  • Treatment groups are independent of one another. Experimental units only receive one treatment, and they do not overlap.
  • There are no major outliers.
  • A check for unequal variances will help determine which version of a t-test is most appropriate:
  • If variances are equal, then the assumptions of a pooled t-test is appropriate.
  • If variances are unequal, then a Satterthwaite (also known as Welch’s) t-test is appropriate.

The dependent response variable is cost burden.

The independent categorical variable is year.

Create a new year variable in each dataset.

Code
dat2014 <- dat2014 %>%
  mutate(year = "2014")

dat2019 <- dat2019 %>%
  mutate(year = "2019")

Rowbind the dataframes into one.

Code
dat <- rbind(dat2014, dat2019)

Designate year as a categorical factor.

Code
dat$year <- as.factor(dat$year)

Ensure GRPIP is numeric.

Code
class(dat$GRPIP)
[1] "numeric"

Perform the Shapiro-Wilk Test for Normality on each group.

Code
dat %>%
  group_by(year) %>%
  summarise(`W Statistic` = shapiro.test(GRPIP)$statistic,
            `p-value` = shapiro.test(GRPIP)$p.value)
# A tibble: 2 × 3
  year  `W Statistic` `p-value`
  <fct>         <dbl>     <dbl>
1 2014          0.844  6.08e-21
2 2019          0.825  4.56e-22

Because the p-value < 0.05 for both samples, we reject the assumption of normality. The distributions are not normal.

Perform QQ plots by group.

Code
ggplot(data = dat, mapping = aes(sample = GRPIP, color = year, fill = year)) +
  stat_qq_band(alpha=0.5, conf=0.95, qtype=1, bandType = "ts") +
  stat_qq_line(identity=TRUE) +
  stat_qq_point(col="black") +
  facet_wrap(~ year, scales = "free") +
  labs(x = "Theoretical Quantiles", y = "Sample Quantiles") + theme_bw() 

For QQ plots, the majority of points whoudl follow each line and stay within the curved 95% bootstrapped confidence bands to be considered normally distributed. Again, this shows the two samples are not normally distributed.

Perform Levene’s Test of Equality of Variances.

Code
lev1<-leveneTest(GRPIP ~ year, data=dat, center="mean")
lev2<-leveneTest(GRPIP ~ year, data=dat, center="median")
print(lev1)
Levene's Test for Homogeneity of Variance (center = "mean")
       Df F value    Pr(>F)    
group   1  14.616 0.0001406 ***
      920                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
print(lev2)
Levene's Test for Homogeneity of Variance (center = "median")
       Df F value   Pr(>F)   
group   1  7.9994 0.004781 **
      920                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A significant p-value (P < 0.05) indicates that a Satterthwaite (also known as Welch’s) t-test results should be used instead of pooled t-test results.

Produce boxplots and visually check for outliers

Code
ggplot(dat, aes(x = year, y = GRPIP, fill = year)) +
  stat_boxplot(geom ="errorbar", width = 0.5) +
  geom_boxplot(fill = "light blue") + 
  stat_summary(fun=mean, geom="point", shape=10, size=3.5, color="black") + 
  ggtitle("Boxplots of rent cost burdent in 2014 and 2019") + 
  theme_bw() + theme(legend.position="none")

The boxplots show many outliers in 2019.

Produce descriptive statistics by group

Code
dat %>%
  select(GRPIP, year) %>%
  group_by(year) %>% 
  summarise(n = n(), 
            mean = mean(GRPIP, na.rm = TRUE), 
            sd = sd(GRPIP, na.rm = TRUE),
            stderr = sd/sqrt(n), 
            LCL = mean - qt(1 - (0.05 / 2), n - 1) * stderr,
            UCL = mean + qt(1 - (0.05 / 2), n - 1) * stderr,
            median=median(GRPIP, na.rm = TRUE),
            min=min(GRPIP, na.rm = TRUE), 
            max=max(GRPIP, na.rm = TRUE),
            IQR=IQR(GRPIP, na.rm = TRUE))
# A tibble: 2 × 11
  year      n  mean    sd stderr   LCL   UCL median   min   max   IQR
  <fct> <int> <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
1 2014    461 0.410 0.297 0.0138 0.383 0.437   0.3      0  1.01  0.35
2 2019    461 0.341 0.261 0.0122 0.317 0.365   0.26     0  1.01  0.25
  • cyl – This column identifies the levels of the treatment variable along with the mean differences between the levels.
  • n – This column identifies how many data points (cars) are in each cylinder group.
  • mean – The mean value for each treatment group.
  • sd – The standard deviation of each treatment group.
  • stderr – The standard error of each treatment group.
  • LCL, UCL – The upper and lower confidence intervals of the mean. That is to say, you can be 95% certain that the true mean falls between the lower and upper values specified for each treatment group, assuming the data is normally distributed.
  • median – The median value for each treatment group.
  • min, max – The minimum and maximum values observed for each treatment group.
  • IQR – The interquartile range of each treatment group. The interquartile range is the 75th percentile – 25th percentile.

Perform an Independent Samples T-test

Code
m1<-t.test(GRPIP ~ year, data=dat, var.equal=FALSE, na.rm=TRUE)
print(m1)

    Welch Two Sample t-test

data:  GRPIP by year
t = 3.7674, df = 905.39, p-value = 0.0001756
alternative hypothesis: true difference in means between group 2014 and group 2019 is not equal to 0
95 percent confidence interval:
 0.03321241 0.10544268
sample estimates:
mean in group 2014 mean in group 2019 
         0.4102820          0.3409544 

t – This is the t-statistic. It is the ratio of the mean of the difference in means to the standard error of the difference. df – The appropriate degrees of freedom. This varies between each type of independent samples t-test.

p-value – This is the p-value associated with the test. The p-value = 0.0001756. Thus, we reject the null hypothesis that the mean cost burden of 2014 and 2019 are equal and conclude that there is a mean difference between groups. 95% confidence interval – The values presented here are on the mean difference for each treatment group. That is to say, you can be 95% certain that the true mean difference cost burden between 2014 and 2019 falls between 0.41 and 0.34. sample means – The sample mean of the 2014 and 2019 samples.

What to do When Assumptions are Broken or Things Go Wrong

The lack of normality or severe impact of outliers can violate independent sample t-test assumptions and ultimately the results. If this happens, there are several available options:

Perform a nonparametric Mann-Whitney U test is the most popular alternative. This is also known as the Mann-Whitney-Wilcoxon or the Wilcoxon Rank Sum test. This test is considered robust to violations of normality and outliers (among others) and tests for differences in mean ranks.

Additional options include considering permutation/randomization tests, bootstrap confidence intervals, and transforming the data but each option will have its own stipulations.

If you need to compare more than two independent groups, a one-way Analysis of Variances (ANOVA) or Kruskal-Wallis may be appropriate.

An independent samples t-test is not appropriate if you have repeated measurements taken on the same experimental unit (subject). For example, if you have a pre-test post-test study, then each subject was measured at two different time intervals. If this is the case, then a paired t-test may be a more appropriate course of action.

Mann-Whitney U test or Wilconxon Rank Sum test

The following assumptions must be met in order to run a Mann-Whitney U test:

  • Treatment groups are independent of one another. Experimental units only receive one treatment and they do not overlap.
  • The response variable of interest is ordinal or continuous.
  • Both samples are random.

The dependent response variable is cost burden.

The independent categorical variable is year.

Perform the Mann-Whitney U test

Code
m2<-wilcox.test(GRPIP ~ year, data=dat, na.rm=TRUE, paired=FALSE, exact=FALSE, conf.int=TRUE)
print(m2)

    Wilcoxon rank sum test with continuity correction

data:  GRPIP by year
W = 122212, p-value = 7.892e-05
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
 0.02007798 0.06992372
sample estimates:
difference in location 
            0.04005175 

Hodges Lehmann Estimator

Code
m2$estimate
difference in location 
            0.04005175 

Interpretation of Mann-Whitney U test

Rental cost burden in each year is not normally distributed. Outliers exist in each group. A Mann-Whitney U test is more appropriate than a traditional independent samples t-test to compare the rental cost burden between the two years.

The Mann-Whitney U test results in a two-sided test p-value = 7.892e-05. This indicates that we should reject the null hypothesis that distributions are equal and conclude that there is a significant difference in cost burdent between the two years. Descriptive statistics indicate that the median cost burden in 2014 was 30% and the median cost burden in 2019 was 26%. The difference between the median cost burdens of each year is about 4%. The Hodges-Lehmann estimate is the same at 4%. We are 95% certain that the median difference between 2014 and 2019 is between 2% and 7%. Thus, the 4% decline in rental cost burden from 2014 to 2019 is statistically significant.

Let’s try to make tables with the combined data using the mean.

Recode sex

Code
# Recode sex
dat <- dat %>%
  mutate(SEX = (ifelse(.$SEX == 1, 'Male', 'Female')))

Table 1, change in cost burden for all households from 2014 to 2019

Code
t1 <- dat %>%
  group_by("Year" = year) %>%
  summarise(`Cost burden` = weighted.mean(GRPIP, WGTP, na.rm = TRUE)) %>%
  pivot_wider(names_from = `Year`, values_from = `Cost burden`)

t1 <- t1 %>%
  mutate(diff = `2019` - `2014`)

t1
# A tibble: 1 × 3
  `2014` `2019`    diff
   <dbl>  <dbl>   <dbl>
1  0.422  0.344 -0.0780
Code
gt(t1)
2014 2019 diff
0.421889 0.3439219 -0.07796713
Code
write.csv(t1, "metro.csv")

Table 2, change in cost burden for all households from 2014 to 2019, grouped by sex

Code
t2 <- dat %>%
  group_by("Year" = year, "Sex" = SEX) %>%
  summarise(`Cost burden` = weighted.mean(GRPIP, WGTP, na.rm = TRUE)) %>%
  pivot_wider(names_from = `Year`, values_from = `Cost burden`)
`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.
Code
t2 <- t2 %>%
  mutate(diff = `2019` - `2014`)

t2
# A tibble: 2 × 4
  Sex    `2014` `2019`    diff
  <chr>   <dbl>  <dbl>   <dbl>
1 Female  0.453  0.371 -0.0826
2 Male    0.389  0.315 -0.0735
Code
gt(t2)
Sex 2014 2019 diff
Female 0.4532549 0.3706347 -0.08262016
Male 0.3886364 0.3151623 -0.07347414
Code
write.csv(t2, "sex.csv")

Table 3, change in cost burden for all households from 2014 to 2019, grouped by race/ethnicity

Code
t3 <- dat %>%
  group_by("Year" = year, "Race/Eth" = race_eth) %>%
  summarise(`Cost burden` = weighted.mean(GRPIP, WGTP, na.rm = TRUE)) %>%
  pivot_wider(names_from = `Year`, values_from = `Cost burden`)
`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.
Code
t3 <- t3 %>%
  mutate(diff = `2019` - `2014`)

t3
# A tibble: 10 × 4
   `Race/Eth`         `2014` `2019`   diff
   <chr>               <dbl>  <dbl>  <dbl>
 1 Hispanic            0.440  0.310 -0.130
 2 Non-Hispanic Asian  0.415 NA     NA    
 3 Non-Hispanic Black  0.508 NA     NA    
 4 Non-Hispanic White  0.372 NA     NA    
 5 Other               0.525 NA     NA    
 6 <NA>                0.208 NA     NA    
 7 Non-hispanic Asian NA      0.430 NA    
 8 Non-hispanic Black NA      0.373 NA    
 9 Non-hispanic White NA      0.322 NA    
10 other              NA      0.412 NA    
Code
gt(t3)
Race/Eth 2014 2019 diff
Hispanic 0.4397007 0.3097231 -0.1299776
Non-Hispanic Asian 0.4154056 NA NA
Non-Hispanic Black 0.5083152 NA NA
Non-Hispanic White 0.3720215 NA NA
Other 0.5245358 NA NA
NA 0.2083099 NA NA
Non-hispanic Asian NA 0.4302271 NA
Non-hispanic Black NA 0.3731761 NA
Non-hispanic White NA 0.3217897 NA
other NA 0.4124263 NA
Code
write.csv(t3, "race_eth.csv")

Table 4, change in cost burden for all households from 2014 to 2019, grouped by age

Code
t4 <- dat %>%
  group_by("Year" = year, "Age group" = agelvl) %>%
  summarise(`Cost burden` = weighted.mean(GRPIP, WGTP, na.rm = TRUE)) %>%
  pivot_wider(names_from = `Year`, values_from = `Cost burden`)
`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.
Code
t4 <- t4 %>%
  mutate(diff = `2019` - `2014`)

t4
# A tibble: 3 × 4
  `Age group`  `2014` `2019`    diff
  <chr>         <dbl>  <dbl>   <dbl>
1 24 and under  0.608  0.428 -0.180 
2 25 to 64      0.381  0.334 -0.0463
3 65 and over   0.505  0.317 -0.188 
Code
gt(t4)
Age group 2014 2019 diff
24 and under 0.6082305 0.4283608 -0.17986975
25 to 64 0.3806627 0.3343517 -0.04631097
65 and over 0.5049577 0.3168311 -0.18812665
Code
write.csv(t4, "age.csv")

Table 5, change in cost burden for all households from 2014 to 2019, grouped by income quartile

Code
t5 <- dat %>%
  group_by("Year" = year, "Income quartile" = inclvl) %>%
  summarise(`Cost burden` = weighted.mean(GRPIP, WGTP, na.rm = TRUE)) %>%
  pivot_wider(names_from = `Year`, values_from = `Cost burden`)
`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.
Code
t5 <- t5 %>%
  mutate(diff = `2019` - `2014`)

t5
# A tibble: 4 × 4
  `Income quartile` `2014` `2019`     diff
  <chr>              <dbl>  <dbl>    <dbl>
1 1st quartile       0.700  0.596 -0.104  
2 2nd quartile       0.492  0.331 -0.161  
3 3rd quartile       0.281  0.224 -0.0569 
4 4th quartile       0.153  0.148 -0.00551
Code
gt(t5)
Income quartile 2014 2019 diff
1st quartile 0.7004726 0.5962355 -0.104237104
2nd quartile 0.4924068 0.3309767 -0.161430079
3rd quartile 0.2813534 0.2244523 -0.056901098
4th quartile 0.1532536 0.1477416 -0.005511999
Code
write.csv(t5, "income.csv")

Table 6, change in cost burden for all households from 2014 to 2019, grouped by presence of children

Code
t6 <- dat %>%
  group_by("Year" = year, "Presence of children" = child) %>%
  summarise(`Cost burden` = weighted.mean(GRPIP, WGTP, na.rm = TRUE)) %>%
  pivot_wider(names_from = `Year`, values_from = `Cost burden`)
`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.
Code
t6 <- t6 %>%
  mutate(diff = `2019` - `2014`)

t6
# A tibble: 2 × 4
  `Presence of children` `2014` `2019`    diff
  <chr>                   <dbl>  <dbl>   <dbl>
1 Children                0.475  0.417 -0.0575
2 No children             0.407  0.326 -0.0815
Code
gt(t6)
Presence of children 2014 2019 diff
Children 0.4749833 0.4174496 -0.05753370
No children 0.4074777 0.3259456 -0.08153204
Code
write.csv(t6, "children.csv")

Table 7, change in cost burden for all households from 2014 to 2019, grouped by citizenship status

Code
t7 <- dat %>%
  group_by("Year" = year, "Citizenship status" = citizen) %>%
  summarise(`Cost burden` = weighted.mean(GRPIP, WGTP, na.rm = TRUE)) %>%
  pivot_wider(names_from = `Year`, values_from = `Cost burden`)
`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.
Code
t7 <- t7 %>%
  mutate(diff = `2019` - `2014`)

t7
# A tibble: 2 × 4
  `Citizenship status` `2014` `2019`    diff
  <chr>                 <dbl>  <dbl>   <dbl>
1 Citizen               0.420  0.346 -0.0740
2 Non-citizen           0.434  0.320 -0.114 
Code
gt(t7)
Citizenship status 2014 2019 diff
Citizen 0.4198246 0.3458514 -0.07397324
Non-citizen 0.4340017 0.3196119 -0.11438982
Code
write.csv(t7, "citizenship.csv")

Let’s try to make tables with the combined data using the median.

Table 1, change in cost burden for all households from 2014 to 2019

Code
t1a <- dat %>%
  group_by("Year" = year) %>%
  summarise(`Cost burden` = weightedMedian(GRPIP, WGTP, na.rm = TRUE)) %>%
  pivot_wider(names_from = `Year`, values_from = `Cost burden`)

t1a <- t1a %>%
  mutate(diff = `2019` - `2014`)

t1a
# A tibble: 1 × 3
  `2014` `2019`  diff
   <dbl>  <dbl> <dbl>
1    0.3   0.25 -0.05
Code
gt(t1a)
2014 2019 diff
0.3 0.25 -0.05
Code
write.csv(t1a, "metro_median.csv")

Table 2, change in cost burden for all households from 2014 to 2019, grouped by sex

Code
t2a <- dat %>%
  group_by("Year" = year, "Sex" = SEX) %>%
  summarise(`Cost burden` = weightedMedian(GRPIP, WGTP, na.rm = TRUE)) %>%
  pivot_wider(names_from = `Year`, values_from = `Cost burden`)
`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.
Code
t2a <- t2a %>%
  mutate(diff = `2019` - `2014`)

t2a
# A tibble: 2 × 4
  Sex    `2014` `2019`  diff
  <chr>   <dbl>  <dbl> <dbl>
1 Female   0.33   0.27 -0.06
2 Male     0.27   0.23 -0.04
Code
gt(t2a)
Sex 2014 2019 diff
Female 0.33 0.27 -0.06
Male 0.27 0.23 -0.04
Code
write.csv(t2a, "sex_median.csv")

Table 3, change in cost burden for all households from 2014 to 2019, grouped by race/ethnicity

Code
t3a <- dat %>%
  group_by("Year" = year, "Race/Eth" = race_eth) %>%
  summarise(`Cost burden` = weightedMedian(GRPIP, WGTP, na.rm = TRUE)) %>%
  pivot_wider(names_from = `Year`, values_from = `Cost burden`)
`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.
Code
t3a <- t3a %>%
  mutate(diff = `2019` - `2014`)

t3a
# A tibble: 10 × 4
   `Race/Eth`         `2014` `2019`   diff
   <chr>               <dbl>  <dbl>  <dbl>
 1 Hispanic            0.362  0.2   -0.162
 2 Non-Hispanic Asian  0.267 NA     NA    
 3 Non-Hispanic Black  0.383 NA     NA    
 4 Non-Hispanic White  0.27  NA     NA    
 5 Other               0.355 NA     NA    
 6 <NA>                0.208 NA     NA    
 7 Non-hispanic Asian NA      0.26  NA    
 8 Non-hispanic Black NA      0.28  NA    
 9 Non-hispanic White NA      0.23  NA    
10 other              NA      0.373 NA    
Code
gt(t3a)
Race/Eth 2014 2019 diff
Hispanic 0.3621477 0.2000000 -0.1621477
Non-Hispanic Asian 0.2670468 NA NA
Non-Hispanic Black 0.3826991 NA NA
Non-Hispanic White 0.2700000 NA NA
Other 0.3547458 NA NA
NA 0.2083099 NA NA
Non-hispanic Asian NA 0.2600000 NA
Non-hispanic Black NA 0.2800000 NA
Non-hispanic White NA 0.2300000 NA
other NA 0.3727961 NA
Code
write.csv(t3a, "race_eth_median.csv")

Table 4, change in cost burden for all households from 2014 to 2019, grouped by age

Code
t4a <- dat %>%
  group_by("Year" = year, "Age group" = agelvl) %>%
  summarise(`Cost burden` = weightedMedian(GRPIP, WGTP, na.rm = TRUE)) %>%
  pivot_wider(names_from = `Year`, values_from = `Cost burden`)
`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.
Code
t4a <- t4a %>%
  mutate(diff = `2019` - `2014`)

t4a
# A tibble: 3 × 4
  `Age group`  `2014` `2019`    diff
  <chr>         <dbl>  <dbl>   <dbl>
1 24 and under  0.56    0.32 -0.24  
2 25 to 64      0.27    0.24 -0.0300
3 65 and over   0.339   0.28 -0.0587
Code
gt(t4a)
Age group 2014 2019 diff
24 and under 0.5600000 0.32 -0.24000000
25 to 64 0.2700000 0.24 -0.03000000
65 and over 0.3386926 0.28 -0.05869263
Code
write.csv(t4a, "age_median.csv")

Table 5, change in cost burden for all households from 2014 to 2019, grouped by income quartile

Code
t5a <- dat %>%
  group_by("Year" = year, "Income quartile" = inclvl) %>%
  summarise(`Cost burden` = weightedMedian(GRPIP, WGTP, na.rm = TRUE)) %>%
  pivot_wider(names_from = `Year`, values_from = `Cost burden`)
`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.
Code
t5a <- t5a %>%
  mutate(diff = `2019` - `2014`)

t5a
# A tibble: 4 × 4
  `Income quartile` `2014` `2019`    diff
  <chr>              <dbl>  <dbl>   <dbl>
1 1st quartile       0.849  0.625 -0.224 
2 2nd quartile       0.46   0.29  -0.17  
3 3rd quartile       0.26   0.22  -0.04  
4 4th quartile       0.15   0.14  -0.0100
Code
gt(t5a)
Income quartile 2014 2019 diff
1st quartile 0.8487671 0.6245815 -0.2241856
2nd quartile 0.4600000 0.2900000 -0.1700000
3rd quartile 0.2600000 0.2200000 -0.0400000
4th quartile 0.1500000 0.1400000 -0.0100000
Code
write.csv(t5a, "income_median.csv")

Table 6, change in cost burden for all households from 2014 to 2019, grouped by presence of children

Code
t6a <- dat %>%
  group_by("Year" = year, "Presence of children" = child) %>%
  summarise(`Cost burden` = weightedMedian(GRPIP, WGTP, na.rm = TRUE)) %>%
  pivot_wider(names_from = `Year`, values_from = `Cost burden`)
`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.
Code
t6a <- t6a %>%
  mutate(diff = `2019` - `2014`)

t6a
# A tibble: 2 × 4
  `Presence of children` `2014` `2019`    diff
  <chr>                   <dbl>  <dbl>   <dbl>
1 Children                 0.37  0.356 -0.0139
2 No children              0.29  0.239 -0.0510
Code
gt(t6a)
Presence of children 2014 2019 diff
Children 0.37 0.3560757 -0.01392427
No children 0.29 0.2389815 -0.05101847
Code
write.csv(t6a, "children_median.csv")

Table 7, change in cost burden for all households from 2014 to 2019, grouped by citizenship status

Code
t7a <- dat %>%
  group_by("Year" = year, "Citizenship status" = citizen) %>%
  summarise(`Cost burden` = weightedMedian(GRPIP, WGTP, na.rm = TRUE)) %>%
  pivot_wider(names_from = `Year`, values_from = `Cost burden`)
`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.
Code
t7a <- t7a %>%
  mutate(diff = `2019` - `2014`)

t7a
# A tibble: 2 × 4
  `Citizenship status` `2014` `2019`    diff
  <chr>                 <dbl>  <dbl>   <dbl>
1 Citizen               0.3     0.25 -0.05  
2 Non-citizen           0.374   0.29 -0.0843
Code
gt(t7a)
Citizenship status 2014 2019 diff
Citizen 0.3000000 0.25 -0.0500000
Non-citizen 0.3743038 0.29 -0.0843038
Code
write.csv(t7a, "citizenship_median.csv")
  • GRPIP: Gross rent as a percentage of household income past 12 months.
  • SEX: Sex
  • RAC1P: Race
  • AGEP: Age
  • HINCP: Household Income
  • HUPAC: Presence of children
  • CIT: Citizenship status

Notes:

  • I need to add a count to the summaries. Done.
  • I need to add weights to summary statistics. Done
  • Tests for statistical significance with t-test. Done
  • Conduct Mann-Whitney test. Done
  • Should I remove NAs, zeros, and outliers? Would it be justified? By what criteria would I exclude outliers?
  • Some things I need to do in the future:
  • Use IPUMS because it puts the data on your computer.
  • Combine years early.
  • Explore NAs and outliers early. Learn when to remove outliers.