In this assignment, I will be doing the following:

  1. Constructing a continuous outcome variable
  2. Constructing a continuous predictor
  3. Constructing an additional five predictors
  4. Estimate at least two (in this exercise, three) OLS regression models for the chosen outcome
  5. Compare the models to one another using F-test and the AIC
  6. Discuss the results of the best-fitting model

Loading ipums Data

Here, we load the ipums data and our required R packages:

library(haven)
library(lmtest)
## Warning: package 'lmtest' was built under R version 3.4.2
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.4.2
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(broom)
library(ggplot2)
library(car)
## Warning: package 'car' was built under R version 3.4.2
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
ipums<-read_dta("https://github.com/coreysparks/data/blob/master/usa_00045.dta?raw=true")

Constructing Variables

In the following code chunk, I’ll be constructing several variables.

First, I’ll construct my outcome variable: a household overcrowding index (“crowd” = number of household members / number of bedrooms), Then I’ll construct a continuous predictor variable: household head wages (“mywage”), Then I’ll construct five additional variables which I believe may affect household overcrowding:

  1. foreign-born status (“foreignborn” = 1 for foreign-born household heads, 0 for domestic-born),
  2. education level (“hs_or_below” = 1 if the household head is a High School grad or lower, 0 if they have any college experience or higher),
  3. minority status (“minority” = 1 if the household head is a minority, 0 if they are not),
  4. household ownership (renting) status (“renters” = 1 for renters, 0 for non-renters),
  5. poverty status (“poverty_status” = 1 if below the Federal poverty level, 0 if above)
ipums<-ipums %>% mutate(mywage= ifelse(incwage%in%c(999998,999999), NA, incwage))
ipums<-ipums %>% mutate(new_bedrooms=ifelse(bedrooms==0,NA,
                                            ifelse(bedrooms==1,0,bedrooms-1)))
new_ipums<-ipums %>% filter(relate==1 & !is.na(new_bedrooms) & !is.na(mywage)) %>%
  mutate(foreignborn=ifelse(bpl>=120,1,0)) %>% 
  mutate(edurec = case_when(.$educd %in% c(0:64)~"HS or Less", 
.$educd %in% c(65:116)~"HS Grad or More",
.$educd ==999 ~ "missing")) %>%
   mutate(race_eth = case_when(.$hispan %in% c(1:4) & .$race %in%c(1:9) ~ "hispanic", 
                          .$hispan ==0 & .$race==1 ~"0nh_white",
                         .$hispan ==0 & .$race==2 ~"nh_black",
                         .$hispan ==0 & .$race%in%c(3,7,8,9) ~"nh_other",
                         .$hispan ==0 & .$race%in%c(4:6) ~"nh_asian",
                          .$hispan==9 ~ "missing")) %>%
  mutate(minority=ifelse(race_eth=="0nh_white",0,1)) %>%
  mutate(hs_or_below=ifelse(edurec=="HS or Less",1,0))%>% 
  mutate(crowd=famsize/new_bedrooms) %>% 
  mutate(renters=ifelse(ownershp==2,1,ifelse(ownershp==1,0,NA))) %>%
  mutate(poverty_status=ifelse(poverty<100,1,ifelse(poverty>=100,0,NA)))

Now, I’ll construct my data frame (“bedroom_analysis”) by removing missing data.

bedroom_analysis<-new_ipums %>% filter(new_bedrooms>0 & race_eth!="missing" & edurec!="missing")

Now, we can construct models for predictng the “overcrowding index”:

mod1<-lm(crowd~mywage+age,data=bedroom_analysis)
mod1scale<-lm(crowd~scale(mywage)+scale(age),data=bedroom_analysis)
mod2<-lm(crowd~mywage+age+hs_or_below,data=bedroom_analysis)
mod2scale<-lm(crowd~scale(mywage)+scale(age)+scale(hs_or_below),data=bedroom_analysis)
mod3<-lm(crowd~mywage+age+hs_or_below+minority+foreignborn+renters+poverty_status,data=bedroom_analysis)
mod3scale<-lm(crowd~scale(mywage)+scale(age)+scale(hs_or_below)+scale(minority)+scale(foreignborn)+scale(renters)+scale(poverty_status),data=bedroom_analysis)
mod3transform<-lm(crowd~sqrt(mywage)+sqrt(age)+sqrt(hs_or_below)+sqrt(minority)+sqrt(foreignborn)+sqrt(renters)+sqrt(poverty_status),data=bedroom_analysis)

Now we can see the results of these different models.

First, the OLS regression models:

library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2015). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2. http://CRAN.R-project.org/package=stargazer
stargazer(mod1,mod2,mod3,type="html",style="demography",ci=T)
crowd
Model 1 Model 2 Model 3
mywage -0.00000*** -0.00000*** -0.00000***
(-0.00000, -0.00000) (-0.00000, -0.00000) (-0.00000, -0.00000)
age -0.011*** -0.011*** -0.009***
(-0.011, -0.010) (-0.011, -0.011) (-0.009, -0.009)
hs_or_below 0.141*** 0.102***
(0.135, 0.147) (0.096, 0.108)
minority 0.107***
(0.100, 0.115)
foreignborn 0.233***
(0.224, 0.243)
renters 0.149***
(0.142, 0.156)
poverty_status -0.008
(-0.017, 0.001)
Constant 1.498*** 1.465*** 1.248***
(1.487, 1.508) (1.454, 1.476) (1.236, 1.260)
N 115,071 115,071 115,071
R2 0.106 0.120 0.178
Adjusted R2 0.106 0.120 0.178
Residual Std. Error 0.514 (df = 115068) 0.510 (df = 115067) 0.493 (df = 115063)
F Statistic 6,825.561*** (df = 2; 115068) 5,254.724*** (df = 3; 115067) 3,554.962*** (df = 7; 115063)
p < .05; p < .01; p < .001

Then, the scale (z-scored) models:

stargazer(mod1scale,mod2scale,mod3scale,type="html",style="demography",ci=T)
crowd
Model 1 Model 2 Model 3
scale(mywage) -0.035*** -0.022*** -0.013***
(-0.038, -0.032) (-0.025, -0.019) (-0.016, -0.010)
scale(age) -0.181*** -0.190*** -0.151***
(-0.184, -0.178) (-0.193, -0.187) (-0.154, -0.148)
scale(hs_or_below) 0.068*** 0.049***
(0.065, 0.071) (0.046, 0.052)
scale(minority) 0.047***
(0.044, 0.051)
scale(foreignborn) 0.079***
(0.076, 0.082)
scale(renters) 0.068***
(0.065, 0.071)
scale(poverty_status) -0.003
(-0.006, 0.0004)
Constant 0.899*** 0.899*** 0.899***
(0.896, 0.902) (0.896, 0.902) (0.896, 0.902)
N 115,071 115,071 115,071
R2 0.106 0.120 0.178
Adjusted R2 0.106 0.120 0.178
Residual Std. Error 0.514 (df = 115068) 0.510 (df = 115067) 0.493 (df = 115063)
F Statistic 6,825.561*** (df = 2; 115068) 5,254.724*** (df = 3; 115067) 3,554.962*** (df = 7; 115063)
p < .05; p < .01; p < .001

Model Fit

Which model fits better? We can use several criteria, including Anova and AIC:

anova(mod1,mod2,mod3)
## Analysis of Variance Table
## 
## Model 1: crowd ~ mywage + age
## Model 2: crowd ~ mywage + age + hs_or_below
## Model 3: crowd ~ mywage + age + hs_or_below + minority + foreignborn + 
##     renters + poverty_status
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1 115068 30362                                  
## 2 115067 29871  1     490.4 2020.7 < 2.2e-16 ***
## 3 115063 27924  4    1946.9 2005.5 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here we see that Model 3 has lower residual sum of squares (RSS) than either Model 1 or Model 2, with a high degree of significance.

The AIC function should bear out a similar result:

AIC(mod1,mod2,mod3)
##      df      AIC
## mod1  4 173248.2
## mod2  5 171376.4
## mod3  9 163629.1

Here we can see that the change in AIC score between models 1, 2, and 3 is quite significant (far greater than 10 in each case) and that model 3 has the lowest AIC score. Therefore, we can conclude that Model 3 is the best fitted model, assuming it’s assumptions are true.

What about the problem of colinearity? We can apply the VIF function to Model 3 to test for that:

vif(mod3)
##         mywage            age    hs_or_below       minority    foreignborn 
##       1.160148       1.234204       1.110328       1.281907       1.212581 
##        renters poverty_status 
##       1.220513       1.140873

Here we see that, for the most part, colinearity is not a problem for the variables in this model.

Initial Model Interpretations

It appears that Model 3 is the most well-fitted of models. Assuming that the model is without errors, we can make the following judgements:

1) There is a significant, although slight, inverse relationship between wages and the overcrowing index. That is, as wages increase, the overcrowding index tends to decrease.

2) There is a significant inverse relationship between age and the overcrowding index. As people get older, the overcrowding index tends to decrease.

3) There are significant positive relationships between minority status, foreign-born status, low-education-status and renter-status to the overcrowding index. That is, being a foreign-born person, a minority, a person with a high school education or less, or a renter tends to increase the overcrowding index.

4) When these factors are accounted for, poverty no longer exercises a significant predictive effect over the overcrowding index.

Now we can check the assumptions of these various models, starting with the first model:

plot(mod1,which=1)

qqnorm(rstudent(mod1))
qqline(as.numeric(rstudent(mod1)))

Model two:

plot(mod2,which=1)

qqnorm(rstudent(mod2))
qqline(as.numeric(rstudent(mod2)))

Model three, and model three scaled:

plot(mod3,which=1)

qqnorm(rstudent(mod3))
qqline(as.numeric(rstudent(mod3)))

plot(mod3scale,which=1)

qqnorm(rstudent(mod3scale))
qqline(as.numeric(rstudent(mod3scale)))

Transforming Model 3

Clearly, all of these are heteroskedastic and non-normal to at least some degree. Can this be compensated for? We can try doing a transformation on model three using the square roots of the variables:

summary(mod3transform)
## 
## Call:
## lm(formula = crowd ~ sqrt(mywage) + sqrt(age) + sqrt(hs_or_below) + 
##     sqrt(minority) + sqrt(foreignborn) + sqrt(renters) + sqrt(poverty_status), 
##     data = bedroom_analysis)
## 
## Residuals:
## <Labelled double>
##     Min      1Q  Median      3Q     Max 
## -1.4326 -0.3031 -0.0673  0.2215  6.9941 
## 
## Labels:
##  value                    label
##      1  1 family member present
##      2 2 family members present
##      3                        3
##      4                        4
##      5                        5
##      6                        6
##      7                        7
##      8                        8
##      9                        9
##     10                       10
##     11                       11
##     12                       12
##     13                       13
##     14                       14
##     15                       15
##     16                       16
##     17                       17
##     18                       18
##     19                       19
##     20                       20
##     21                       21
##     22                       22
##     23                       23
##     24                       24
##     25                       25
##     26                       26
##     27                       27
##     28                       28
##     29                       29
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.674e+00  1.187e-02 140.981  < 2e-16 ***
## sqrt(mywage)         -6.891e-05  1.289e-05  -5.347 8.97e-08 ***
## sqrt(age)            -1.245e-01  1.445e-03 -86.157  < 2e-16 ***
## sqrt(hs_or_below)     1.003e-01  3.205e-03  31.307  < 2e-16 ***
## sqrt(minority)        1.105e-01  3.736e-03  29.569  < 2e-16 ***
## sqrt(foreignborn)     2.350e-01  4.724e-03  49.744  < 2e-16 ***
## sqrt(renters)         1.466e-01  3.549e-03  41.297  < 2e-16 ***
## sqrt(poverty_status) -1.104e-02  4.878e-03  -2.263   0.0236 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4936 on 115063 degrees of freedom
## Multiple R-squared:  0.1747, Adjusted R-squared:  0.1746 
## F-statistic:  3479 on 7 and 115063 DF,  p-value: < 2.2e-16
plot(mod3transform,which=1)

qqnorm(rstudent(mod3transform))
qqline(as.numeric(rstudent(mod3transform)))

We see that the results (and the non-normality) of Model 3 remain mostly the same, even when transformed.

Correcting for heteroskedasticity:

We can correct for heteroskedasticity using our scaled Model 3 along with White’s Standard Errors:

 coeftest(mod3scale, vcov. = hccm(mod3scale, type = "hc0"))
## 
## t test of coefficients:
## 
##                         Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)            0.8988874  0.0014522 618.9841   <2e-16 ***
## scale(mywage)         -0.0127187  0.0013056  -9.7419   <2e-16 ***
## scale(age)            -0.1507104  0.0016432 -91.7156   <2e-16 ***
## scale(hs_or_below)     0.0488556  0.0015815  30.8927   <2e-16 ***
## scale(minority)        0.0473577  0.0017711  26.7393   <2e-16 ***
## scale(foreignborn)     0.0791800  0.0021042  37.6293   <2e-16 ***
## scale(renters)         0.0678270  0.0017590  38.5592   <2e-16 ***
## scale(poverty_status) -0.0025941  0.0019475  -1.3320   0.1829    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We find that even when heteroskedasticity is accounted for, the results (and consequently, the overall interpretation) are quite similar.

Discussion and Conclusions

From this exercise, we can conclude that

  1. There is a negative relationship between age, income and the overcrowding index
  2. There is a positive relationship between migrant-status, minority-status, renter-status, low-education-status and the overcrowding index.
  3. When all other variables are accounted for, poverty status is no longer significant.
  4. The best fitting model, Model 3, is heteroskedastic (fixed using White’s Standard Errors) and non-normal (attempted to transform, with no results).
  5. The best fitting predictive model of household overcrowding remains Model 3.