Remember our model from last time:
\(y_i = \beta_0 +\sum_i ^p \beta_p * x_{ip} + \epsilon_i\)
We described how the coefficients of the model are estimated (least squares), how to test hypotheses about the coefficients of the model, as well as how to include qualitative predictors and how to produce standardized coefficients.
This example will illustrate how to deal with issues of model fit and comparing the fits of nested regression models.
A very real situation that exists in real data is the correlation between predictors. If two variables are perfectly correlated with one another, they are said to be collinear with one another. In general, if predictors are correlated, we have a multicollinearity condition between some of our predictors. If the correlations are small, this should have little effect, however if they are sizable, the effect could be very important for the statistical interpretation of our models.
library(readr)
library(car)
library(dplyr)
prb<-read_csv(file = "https://raw.githubusercontent.com/coreysparks/data/master/PRB2008_All.csv", col_types = read_rds(url("https://raw.githubusercontent.com/coreysparks/r_courses/master/prbspec.rds")))
names(prb)<-tolower(names(prb))
prb<-prb%>%
mutate(africa=ifelse(continent=="Africa", 1, 0))
Now we fit a multiple regression model with continuous and categorical predictors:
fit0<-lm(tfr~scale(e0total)+scale(gnippppercapitausdollars)+scale(percpoplt15)+africa, data=prb)
fit<-lm(tfr~scale(e0total)+scale(gnippppercapitausdollars)+scale(cdr)+scale(percpoplt15)+africa, data=prb)
summary(fit)
##
## Call:
## lm(formula = tfr ~ scale(e0total) + scale(gnippppercapitausdollars) +
## scale(cdr) + scale(percpoplt15) + africa, data = prb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.35013 -0.23068 0.02261 0.23777 1.50355
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.91126 0.05042 57.742 < 2e-16 ***
## scale(e0total) 0.65596 0.14099 4.652 6.57e-06 ***
## scale(gnippppercapitausdollars) 0.16348 0.05368 3.046 0.00269 **
## scale(cdr) 0.67809 0.08765 7.736 8.64e-13 ***
## scale(percpoplt15) 1.81926 0.09167 19.845 < 2e-16 ***
## africa 0.32422 0.12537 2.586 0.01055 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.486 on 170 degrees of freedom
## (33 observations deleted due to missingness)
## Multiple R-squared: 0.9129, Adjusted R-squared: 0.9103
## F-statistic: 356.4 on 5 and 170 DF, p-value: < 2.2e-16
Now we examine basic correlation among predictors:
cor(prb[, c("e0total", "gnippppercapitausdollars", "africa", "cdr", "percpoplt15")], use = "pairwise")
## e0total gnippppercapitausdollars africa
## e0total 1.0000000 0.6390417 -0.7055534
## gnippppercapitausdollars 0.6390417 1.0000000 -0.4051724
## africa -0.7055534 -0.4051724 1.0000000
## cdr -0.7680018 -0.2622849 0.5333133
## percpoplt15 -0.7839367 -0.6791153 0.6040865
## cdr percpoplt15
## e0total -0.7680018 -0.7839367
## gnippppercapitausdollars -0.2622849 -0.6791153
## africa 0.5333133 0.6040865
## cdr 1.0000000 0.3102154
## percpoplt15 0.3102154 1.0000000
So, we see everything is pretty correlated, but does this correlation translate in a collinearity problem? We use the vif()
function in the car
library to examine the variance inflation factors for the variables in the model:
vif(fit)
## scale(e0total) scale(gnippppercapitausdollars)
## 14.741080 2.134517
## scale(cdr) scale(percpoplt15)
## 5.470222 6.349326
## africa
## 2.410415
So, we see a problem because the VIF of e0total is 14.7. This suggests that it’s collinear with something else in the model. I suspect it’s the crude death rate (because they both measure mortality in a way).
One way to deal with this would be to construct an index of these two variables. While there are lots of ways to to this, a simple technique is to do an additive index of their z-scores:
prb$mort<-scale(prb$e0total)+scale(-1*prb$cdr)
fit2<-lm(tfr~mort+scale(gnippppercapitausdollars)+scale(percpoplt15)+africa, data=prb)
summary(fit2)
##
## Call:
## lm(formula = tfr ~ mort + scale(gnippppercapitausdollars) + scale(percpoplt15) +
## africa, data = prb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.11480 -0.29076 0.05765 0.28433 1.73985
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.94895 0.05500 53.619 < 2e-16 ***
## mort -0.17850 0.03240 -5.510 1.3e-07 ***
## scale(gnippppercapitausdollars) 0.26190 0.05623 4.658 6.4e-06 ***
## scale(percpoplt15) 1.39395 0.06471 21.541 < 2e-16 ***
## africa 0.24310 0.13703 1.774 0.0778 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5343 on 171 degrees of freedom
## (33 observations deleted due to missingness)
## Multiple R-squared: 0.8941, Adjusted R-squared: 0.8917
## F-statistic: 361.1 on 4 and 171 DF, p-value: < 2.2e-16
vif(fit2)
## mort scale(gnippppercapitausdollars)
## 2.206513 1.938628
## scale(percpoplt15) africa
## 2.618279 2.382869
So we no longer have that issue. The solution presented above is a crude form of variable reduction where, when multiple variables are correlated in a regression model, you may choose to reduce the total number of variables in the model by performing some sort of variable reduction. The method above is one such way, another popular method is principal components analysis.
Often times we will not be interested in only one regression model. More often, we will have multiple models that you consider for an analysis. Typically these are constructed in a nested fashion, meaning that you enter predictors into the analysis in blocks. The goal here is typically two-fold. First, we wish to make a model that is more complete, so we add more variables to the model, these typically would be cohesive in some way (add variables that measure SES for instance), Secondly, we have a particular independent variable and we are trying to “control away” the effect. You see a lot of this in the health literature.
Regardless of why you are making nested models, you need to test whether the more complicated models fit the data better than a simpler model (fewer parameters). For the linear model this is done with two tools. First is the F-test on the residual sums of squares between the two models. Second is the sd Information Criteria, which judges relative model fit, while penalizing overly complex models.
Let’s fit a couple of models using the IPUMS data:
library(haven)
library(dplyr)
ipums<-read_dta("https://github.com/coreysparks/data/blob/master/usa_00045.dta?raw=true")
newpums<-ipums%>%
filter(labforce==2, age>=18, incwage>0)%>%
mutate(mywage= ifelse(incwage%in%c(999998,999999), NA,incwage),
sexrecode=ifelse(sex==1, "male", "female"))%>%
mutate(race_eth = case_when(.$hispan %in% c(1:4) & .$race %in%c(1:9) ~ "hispanic",
.$hispan ==0 & .$race==1 ~"0nh_white",
.$hispan ==0 & .$race==2 ~"nh_black",
.$hispan ==0 & .$race%in%c(3,7,8,9) ~"nh_other",
.$hispan ==0 & .$race%in%c(4:6) ~"nh_asian",
.$hispan==9 ~ "missing"))%>%
mutate(edurec = case_when(.$educd %in% c(0:61)~"nohs",
.$educd %in% c(62:64)~"hs",
.$educd %in% c(65:100)~"somecoll",
.$educd %in% c(101:116)~"collgrad",
.$educd ==999 ~ "missing"))
#newpums$race_eth<-as.factor(newpums$race_eth)
#newpums$race_eth<-relevel(newpums$race_eth, ref = "nh_white")
Here we will fit three models. First, we need to make sure we have complete cases for all our variables, or we cannot compare the models to one another (they are not fit on the same dataset).
anova()
function to compare multiple models to one another using the F test.mod1<-lm(log(mywage)~scale(age)+scale(I(age^2))+sexrecode, data=newpums)
mod2<-lm(log(mywage)~scale(age)+scale(I(age^2))+sexrecode+race_eth, data=newpums)
mod3<-lm(log(mywage)~scale(age)+scale(I(age^2))+sexrecode+race_eth+edurec, data=newpums)
anova(mod1,mod2, mod3)
## Analysis of Variance Table
##
## Model 1: log(mywage) ~ scale(age) + scale(I(age^2)) + sexrecode
## Model 2: log(mywage) ~ scale(age) + scale(I(age^2)) + sexrecode + race_eth
## Model 3: log(mywage) ~ scale(age) + scale(I(age^2)) + sexrecode + race_eth +
## edurec
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 132478 137527
## 2 132474 135374 4 2153 592.79 < 2.2e-16 ***
## 3 132471 120284 3 15089 5539.41 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The AIC is another method that judges relative model fit. It is constructed as: \[\text{AIC} = \text{RSS} + 2k\]
where RSS is \(\sum (y_i - \hat{y_i})^2\), or the residual sums of squares and k is the number of parameters in the model. The AIC
function can get this for us. Typically we would only consider a model to fit better if the change in AIC score is greater than 10.
AICs<-AIC(mod1, mod2, mod3)
AICs$diff<-AICs$AIC-AICs$AIC[1]
AICs
## df AIC diff
## mod1 5 380928.7 0.000
## mod2 9 378846.2 -2082.449
## mod3 12 363195.3 -17733.334
So we conclude that the third model fits best of the three considered here. Not to say it’s the best model we could fit, only that it’s best out of what we’ve done here.
We can make a nice looking table (we can make it look nicer with a little code!)
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2015). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2. http://CRAN.R-project.org/package=stargazer
stargazer(mod1, mod2, mod3, type = "html", style = "demography", ci = T)
log(mywage) | |||
Model 1 | Model 2 | Model 3 | |
scale(age) | 2.540*** | 2.531*** | 2.315*** |
(2.508, 2.572) | (2.499, 2.562) | (2.285, 2.345) | |
scale(I(age2)) | -2.276*** | -2.282*** | -2.068*** |
(-2.308, -2.244) | (-2.314, -2.250) | (-2.098, -2.038) | |
sexrecodemale | 0.390*** | 0.385*** | 0.425*** |
(0.379, 0.401) | (0.374, 0.396) | (0.414, 0.435) | |
race_ethhispanic | -0.291*** | -0.082*** | |
(-0.308, -0.275) | (-0.098, -0.066) | ||
race_ethnh_asian | 0.045*** | -0.057*** | |
(0.021, 0.069) | (-0.080, -0.035) | ||
race_ethnh_black | -0.306*** | -0.203*** | |
(-0.325, -0.287) | (-0.221, -0.185) | ||
race_ethnh_other | -0.211*** | -0.140*** | |
(-0.246, -0.177) | (-0.173, -0.108) | ||
edurechs | -0.742*** | ||
(-0.756, -0.728) | |||
edurecnohs | -1.021*** | ||
(-1.042, -0.999) | |||
edurecsomecoll | -0.575*** | ||
(-0.587, -0.562) | |||
Constant | 10.097*** | 10.170*** | 10.563*** |
(10.089, 10.105) | (10.162, 10.179) | (10.553, 10.574) | |
N | 132,482 | 132,482 | 132,482 |
R2 | 0.211 | 0.224 | 0.310 |
Adjusted R2 | 0.211 | 0.224 | 0.310 |
Residual Std. Error | 1.019 (df = 132478) | 1.011 (df = 132474) | 0.953 (df = 132471) |
F Statistic | 11,842.670*** (df = 3; 132478) | 5,456.983*** (df = 7; 132474) | 5,960.811*** (df = 10; 132471) |
p < .05; p < .01; p < .001 |