In this assignment, I will be doing the following:
Here, we load the ipums data and our required R packages:
library(haven)
library(lmtest)
## Warning: package 'lmtest' was built under R version 3.4.2
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.4.2
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(broom)
library(ggplot2)
library(car)
## Warning: package 'car' was built under R version 3.4.2
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
ipums<-read_dta("https://github.com/coreysparks/data/blob/master/usa_00045.dta?raw=true")
In the following code chunk, I’ll be constructing several variables.
First, I’ll construct my outcome variable: a household overcrowding index (“crowd” = number of household members / number of bedrooms), Then I’ll construct a continuous predictor variable: household head wages (“mywage”), Then I’ll construct five additional variables which I believe may affect household overcrowding:
ipums<-ipums %>% mutate(mywage= ifelse(incwage%in%c(999998,999999), NA, incwage))
ipums<-ipums %>% mutate(new_bedrooms=ifelse(bedrooms==0,NA,
ifelse(bedrooms==1,0,bedrooms-1)))
new_ipums<-ipums %>% filter(relate==1 & !is.na(new_bedrooms) & !is.na(mywage)) %>%
mutate(foreignborn=ifelse(bpl>=120,1,0)) %>%
mutate(edurec = case_when(.$educd %in% c(0:64)~"HS or Less",
.$educd %in% c(65:116)~"HS Grad or More",
.$educd ==999 ~ "missing")) %>%
mutate(race_eth = case_when(.$hispan %in% c(1:4) & .$race %in%c(1:9) ~ "hispanic",
.$hispan ==0 & .$race==1 ~"0nh_white",
.$hispan ==0 & .$race==2 ~"nh_black",
.$hispan ==0 & .$race%in%c(3,7,8,9) ~"nh_other",
.$hispan ==0 & .$race%in%c(4:6) ~"nh_asian",
.$hispan==9 ~ "missing")) %>%
mutate(minority=ifelse(race_eth=="0nh_white",0,1)) %>%
mutate(hs_or_below=ifelse(edurec=="HS or Less",1,0))%>%
mutate(crowd=famsize/new_bedrooms) %>%
mutate(renters=ifelse(ownershp==2,1,ifelse(ownershp==1,0,NA))) %>%
mutate(poverty_status=ifelse(poverty<100,1,ifelse(poverty>=100,0,NA)))
Now, I’ll construct my data frame (“bedroom_analysis”) by removing missing data.
bedroom_analysis<-new_ipums %>% filter(new_bedrooms>0 & race_eth!="missing" & edurec!="missing")
Now, we can construct models for predictng the “overcrowding index”:
mod1<-lm(crowd~mywage+age,data=bedroom_analysis)
mod1scale<-lm(crowd~scale(mywage)+scale(age),data=bedroom_analysis)
mod2<-lm(crowd~mywage+age+hs_or_below,data=bedroom_analysis)
mod2scale<-lm(crowd~scale(mywage)+scale(age)+scale(hs_or_below),data=bedroom_analysis)
mod3<-lm(crowd~mywage+age+hs_or_below+minority+foreignborn+renters+poverty_status,data=bedroom_analysis)
mod3scale<-lm(crowd~scale(mywage)+scale(age)+scale(hs_or_below)+scale(minority)+scale(foreignborn)+scale(renters)+scale(poverty_status),data=bedroom_analysis)
mod3transform<-lm(crowd~sqrt(mywage)+sqrt(age)+sqrt(hs_or_below)+sqrt(minority)+sqrt(foreignborn)+sqrt(renters)+sqrt(poverty_status),data=bedroom_analysis)
Now we can see the results of these different models.
First, the OLS regression models:
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2015). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2. http://CRAN.R-project.org/package=stargazer
stargazer(mod1,mod2,mod3,type="html",style="demography",ci=T)
| crowd | |||
| Model 1 | Model 2 | Model 3 | |
| mywage | -0.00000*** | -0.00000*** | -0.00000*** |
| (-0.00000, -0.00000) | (-0.00000, -0.00000) | (-0.00000, -0.00000) | |
| age | -0.011*** | -0.011*** | -0.009*** |
| (-0.011, -0.010) | (-0.011, -0.011) | (-0.009, -0.009) | |
| hs_or_below | 0.141*** | 0.102*** | |
| (0.135, 0.147) | (0.096, 0.108) | ||
| minority | 0.107*** | ||
| (0.100, 0.115) | |||
| foreignborn | 0.233*** | ||
| (0.224, 0.243) | |||
| renters | 0.149*** | ||
| (0.142, 0.156) | |||
| poverty_status | -0.008 | ||
| (-0.017, 0.001) | |||
| Constant | 1.498*** | 1.465*** | 1.248*** |
| (1.487, 1.508) | (1.454, 1.476) | (1.236, 1.260) | |
| N | 115,071 | 115,071 | 115,071 |
| R2 | 0.106 | 0.120 | 0.178 |
| Adjusted R2 | 0.106 | 0.120 | 0.178 |
| Residual Std. Error | 0.514 (df = 115068) | 0.510 (df = 115067) | 0.493 (df = 115063) |
| F Statistic | 6,825.561*** (df = 2; 115068) | 5,254.724*** (df = 3; 115067) | 3,554.962*** (df = 7; 115063) |
| p < .05; p < .01; p < .001 | |||
stargazer(mod1scale,mod2scale,mod3scale,type="html",style="demography",ci=T)
| crowd | |||
| Model 1 | Model 2 | Model 3 | |
| scale(mywage) | -0.035*** | -0.022*** | -0.013*** |
| (-0.038, -0.032) | (-0.025, -0.019) | (-0.016, -0.010) | |
| scale(age) | -0.181*** | -0.190*** | -0.151*** |
| (-0.184, -0.178) | (-0.193, -0.187) | (-0.154, -0.148) | |
| scale(hs_or_below) | 0.068*** | 0.049*** | |
| (0.065, 0.071) | (0.046, 0.052) | ||
| scale(minority) | 0.047*** | ||
| (0.044, 0.051) | |||
| scale(foreignborn) | 0.079*** | ||
| (0.076, 0.082) | |||
| scale(renters) | 0.068*** | ||
| (0.065, 0.071) | |||
| scale(poverty_status) | -0.003 | ||
| (-0.006, 0.0004) | |||
| Constant | 0.899*** | 0.899*** | 0.899*** |
| (0.896, 0.902) | (0.896, 0.902) | (0.896, 0.902) | |
| N | 115,071 | 115,071 | 115,071 |
| R2 | 0.106 | 0.120 | 0.178 |
| Adjusted R2 | 0.106 | 0.120 | 0.178 |
| Residual Std. Error | 0.514 (df = 115068) | 0.510 (df = 115067) | 0.493 (df = 115063) |
| F Statistic | 6,825.561*** (df = 2; 115068) | 5,254.724*** (df = 3; 115067) | 3,554.962*** (df = 7; 115063) |
| p < .05; p < .01; p < .001 | |||
Which model fits better? We can use several criteria, including Anova and AIC:
anova(mod1,mod2,mod3)
## Analysis of Variance Table
##
## Model 1: crowd ~ mywage + age
## Model 2: crowd ~ mywage + age + hs_or_below
## Model 3: crowd ~ mywage + age + hs_or_below + minority + foreignborn +
## renters + poverty_status
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 115068 30362
## 2 115067 29871 1 490.4 2020.7 < 2.2e-16 ***
## 3 115063 27924 4 1946.9 2005.5 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here we see that Model 3 has lower residual sum of squares (RSS) than either Model 1 or Model 2, with a high degree of significance.
The AIC function should bear out a similar result:
AIC(mod1,mod2,mod3)
## df AIC
## mod1 4 173248.2
## mod2 5 171376.4
## mod3 9 163629.1
Here we can see that the change in AIC score between models 1, 2, and 3 is quite significant (far greater than 10 in each case) and that model 3 has the lowest AIC score. Therefore, we can conclude that Model 3 is the best fitted model, assuming it’s assumptions are true.
What about the problem of colinearity? We can apply the VIF function to Model 3 to test for that:
vif(mod3)
## mywage age hs_or_below minority foreignborn
## 1.160148 1.234204 1.110328 1.281907 1.212581
## renters poverty_status
## 1.220513 1.140873
Here we see that, for the most part, colinearity is not a problem for the variables in this model.
It appears that Model 3 is the most well-fitted of models. Assuming that the model is without errors, we can make the following judgements:
plot(mod1,which=1)
qqnorm(rstudent(mod1))
qqline(as.numeric(rstudent(mod1)))
plot(mod2,which=1)
qqnorm(rstudent(mod2))
qqline(as.numeric(rstudent(mod2)))
plot(mod3,which=1)
qqnorm(rstudent(mod3))
qqline(as.numeric(rstudent(mod3)))
plot(mod3scale,which=1)
qqnorm(rstudent(mod3scale))
qqline(as.numeric(rstudent(mod3scale)))
Clearly, all of these are heteroskedastic and non-normal to at least some degree. Can this be compensated for? We can try doing a transformation on model three using the square roots of the variables:
summary(mod3transform)
##
## Call:
## lm(formula = crowd ~ sqrt(mywage) + sqrt(age) + sqrt(hs_or_below) +
## sqrt(minority) + sqrt(foreignborn) + sqrt(renters) + sqrt(poverty_status),
## data = bedroom_analysis)
##
## Residuals:
## <Labelled double>
## Min 1Q Median 3Q Max
## -1.4326 -0.3031 -0.0673 0.2215 6.9941
##
## Labels:
## value label
## 1 1 family member present
## 2 2 family members present
## 3 3
## 4 4
## 5 5
## 6 6
## 7 7
## 8 8
## 9 9
## 10 10
## 11 11
## 12 12
## 13 13
## 14 14
## 15 15
## 16 16
## 17 17
## 18 18
## 19 19
## 20 20
## 21 21
## 22 22
## 23 23
## 24 24
## 25 25
## 26 26
## 27 27
## 28 28
## 29 29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.674e+00 1.187e-02 140.981 < 2e-16 ***
## sqrt(mywage) -6.891e-05 1.289e-05 -5.347 8.97e-08 ***
## sqrt(age) -1.245e-01 1.445e-03 -86.157 < 2e-16 ***
## sqrt(hs_or_below) 1.003e-01 3.205e-03 31.307 < 2e-16 ***
## sqrt(minority) 1.105e-01 3.736e-03 29.569 < 2e-16 ***
## sqrt(foreignborn) 2.350e-01 4.724e-03 49.744 < 2e-16 ***
## sqrt(renters) 1.466e-01 3.549e-03 41.297 < 2e-16 ***
## sqrt(poverty_status) -1.104e-02 4.878e-03 -2.263 0.0236 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4936 on 115063 degrees of freedom
## Multiple R-squared: 0.1747, Adjusted R-squared: 0.1746
## F-statistic: 3479 on 7 and 115063 DF, p-value: < 2.2e-16
plot(mod3transform,which=1)
qqnorm(rstudent(mod3transform))
qqline(as.numeric(rstudent(mod3transform)))
We see that the results (and the non-normality) of Model 3 remain mostly the same, even when transformed.
We can correct for heteroskedasticity using our scaled Model 3 along with White’s Standard Errors:
coeftest(mod3scale, vcov. = hccm(mod3scale, type = "hc0"))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.8988874 0.0014522 618.9841 <2e-16 ***
## scale(mywage) -0.0127187 0.0013056 -9.7419 <2e-16 ***
## scale(age) -0.1507104 0.0016432 -91.7156 <2e-16 ***
## scale(hs_or_below) 0.0488556 0.0015815 30.8927 <2e-16 ***
## scale(minority) 0.0473577 0.0017711 26.7393 <2e-16 ***
## scale(foreignborn) 0.0791800 0.0021042 37.6293 <2e-16 ***
## scale(renters) 0.0678270 0.0017590 38.5592 <2e-16 ***
## scale(poverty_status) -0.0025941 0.0019475 -1.3320 0.1829
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We find that even when heteroskedasticity is accounted for, the results (and consequently, the overall interpretation) are quite similar.
From this exercise, we can conclude that