DEM 7273 - Regression Analysis Part 3

Multiple Regression models
Correlation among predictors
- Detecting multicollinearity
- Variance inflation factor
Comparing model fits
- Presenting nested models

Multiple Regression models

Remember our model from last time:

\(y_i = \beta_0 +\sum_i ^p \beta_p * x_{ip} + \epsilon_i\)

We described how the coefficients of the model are estimated (least squares), how to test hypotheses about the coefficients of the model, as well as how to include qualitative predictors and how to produce standardized coefficients.

This example will illustrate how to deal with issues of model fit and comparing the fits of nested regression models.

Correlation among predictors

A very real situation that exists in real data is the correlation between predictors. If two variables are perfectly correlated with one another, they are said to be collinear with one another. In general, if predictors are correlated, we have a multicollinearity condition between some of our predictors. If the correlations are small, this should have little effect, however if they are sizable, the effect could be very important for the statistical interpretation of our models.

Detecting multicollinearity

One method is to examine the correlation coefficients among the predictor variables. If variables are highly correlated, say >.4, we may consider removing one of them. Another method is by examination of the regression coefficients of the model. When the coefficients of the model change substantially when another variable is added to the model. This is often indicative that the variables may be related or an interaction may be present between variables.

Variance inflation factor

An often used measure of multicollinearity among predictors is the variance inflation factor. This measures the degree to which the standard errors of the coefficients are inflated when predictors are correlated, compared to when predictors are uncorrelated. If the VIF = 1 then the predictors are effectively uncorrelated, as the VIF gets larger than 1, there is growing evidence of multicollinearity. As a rule of thumb, a VIF of 10 or more is serious evidence of a problem.

library(readr)
library(car)
library(dplyr)
prb<-read_csv(file = "https://raw.githubusercontent.com/coreysparks/data/master/PRB2008_All.csv", col_types = read_rds(url("https://raw.githubusercontent.com/coreysparks/r_courses/master/prbspec.rds")))
names(prb)<-tolower(names(prb))

prb<-prb%>%
  mutate(africa=ifelse(continent=="Africa", 1, 0))

Now we fit a multiple regression model with continuous and categorical predictors:

fit0<-lm(tfr~scale(e0total)+scale(gnippppercapitausdollars)+scale(percpoplt15)+africa, data=prb)
fit<-lm(tfr~scale(e0total)+scale(gnippppercapitausdollars)+scale(cdr)+scale(percpoplt15)+africa, data=prb)
summary(fit)

## 
## Call:
## lm(formula = tfr ~ scale(e0total) + scale(gnippppercapitausdollars) + 
##     scale(cdr) + scale(percpoplt15) + africa, data = prb)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.35013 -0.23068  0.02261  0.23777  1.50355 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      2.91126    0.05042  57.742  < 2e-16 ***
## scale(e0total)                   0.65596    0.14099   4.652 6.57e-06 ***
## scale(gnippppercapitausdollars)  0.16348    0.05368   3.046  0.00269 ** 
## scale(cdr)                       0.67809    0.08765   7.736 8.64e-13 ***
## scale(percpoplt15)               1.81926    0.09167  19.845  < 2e-16 ***
## africa                           0.32422    0.12537   2.586  0.01055 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.486 on 170 degrees of freedom
##   (33 observations deleted due to missingness)
## Multiple R-squared:  0.9129, Adjusted R-squared:  0.9103 
## F-statistic: 356.4 on 5 and 170 DF,  p-value: < 2.2e-16

Now we examine basic correlation among predictors:

cor(prb[, c("e0total", "gnippppercapitausdollars", "africa", "cdr", "percpoplt15")], use = "pairwise")

##                             e0total gnippppercapitausdollars     africa
## e0total                   1.0000000                0.6390417 -0.7055534
## gnippppercapitausdollars  0.6390417                1.0000000 -0.4051724
## africa                   -0.7055534               -0.4051724  1.0000000
## cdr                      -0.7680018               -0.2622849  0.5333133
## percpoplt15              -0.7839367               -0.6791153  0.6040865
##                                 cdr percpoplt15
## e0total                  -0.7680018  -0.7839367
## gnippppercapitausdollars -0.2622849  -0.6791153
## africa                    0.5333133   0.6040865
## cdr                       1.0000000   0.3102154
## percpoplt15               0.3102154   1.0000000

So, we see everything is pretty correlated, but does this correlation translate in a collinearity problem? We use the vif() function in the car library to examine the variance inflation factors for the variables in the model:

vif(fit)

##                  scale(e0total) scale(gnippppercapitausdollars) 
##                       14.741080                        2.134517 
##                      scale(cdr)              scale(percpoplt15) 
##                        5.470222                        6.349326 
##                          africa 
##                        2.410415

So, we see a problem because the VIF of e0total is 14.7. This suggests that it’s collinear with something else in the model. I suspect it’s the crude death rate (because they both measure mortality in a way).

One way to deal with this would be to construct an index of these two variables. While there are lots of ways to to this, a simple technique is to do an additive index of their z-scores:

prb$mort<-scale(prb$e0total)+scale(-1*prb$cdr)
fit2<-lm(tfr~mort+scale(gnippppercapitausdollars)+scale(percpoplt15)+africa, data=prb)
summary(fit2)

## 
## Call:
## lm(formula = tfr ~ mort + scale(gnippppercapitausdollars) + scale(percpoplt15) + 
##     africa, data = prb)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.11480 -0.29076  0.05765  0.28433  1.73985 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      2.94895    0.05500  53.619  < 2e-16 ***
## mort                            -0.17850    0.03240  -5.510  1.3e-07 ***
## scale(gnippppercapitausdollars)  0.26190    0.05623   4.658  6.4e-06 ***
## scale(percpoplt15)               1.39395    0.06471  21.541  < 2e-16 ***
## africa                           0.24310    0.13703   1.774   0.0778 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5343 on 171 degrees of freedom
##   (33 observations deleted due to missingness)
## Multiple R-squared:  0.8941, Adjusted R-squared:  0.8917 
## F-statistic: 361.1 on 4 and 171 DF,  p-value: < 2.2e-16

vif(fit2)

##                            mort scale(gnippppercapitausdollars) 
##                        2.206513                        1.938628 
##              scale(percpoplt15)                          africa 
##                        2.618279                        2.382869

So we no longer have that issue. The solution presented above is a crude form of variable reduction where, when multiple variables are correlated in a regression model, you may choose to reduce the total number of variables in the model by performing some sort of variable reduction. The method above is one such way, another popular method is principal components analysis.

Comparing model fits

Often times we will not be interested in only one regression model. More often, we will have multiple models that you consider for an analysis. Typically these are constructed in a nested fashion, meaning that you enter predictors into the analysis in blocks. The goal here is typically two-fold. First, we wish to make a model that is more complete, so we add more variables to the model, these typically would be cohesive in some way (add variables that measure SES for instance), Secondly, we have a particular independent variable and we are trying to “control away” the effect. You see a lot of this in the health literature.

Regardless of why you are making nested models, you need to test whether the more complicated models fit the data better than a simpler model (fewer parameters). For the linear model this is done with two tools. First is the F-test on the residual sums of squares between the two models. Second is the sd Information Criteria, which judges relative model fit, while penalizing overly complex models.

Let’s fit a couple of models using the IPUMS data:

library(haven)
library(dplyr)
ipums<-read_dta("https://github.com/coreysparks/data/blob/master/usa_00045.dta?raw=true")

newpums<-ipums%>%
  filter(labforce==2, age>=18, incwage>0)%>%
  mutate(mywage= ifelse(incwage%in%c(999998,999999), NA,incwage),
         sexrecode=ifelse(sex==1, "male", "female"))%>%
  mutate(race_eth = case_when(.$hispan %in% c(1:4) & .$race %in%c(1:9) ~ "hispanic", 
                          .$hispan ==0 & .$race==1 ~"0nh_white",
                         .$hispan ==0 & .$race==2 ~"nh_black",
                         .$hispan ==0 & .$race%in%c(3,7,8,9) ~"nh_other",
                         .$hispan ==0 & .$race%in%c(4:6) ~"nh_asian",
                          .$hispan==9 ~ "missing"))%>%
  mutate(edurec = case_when(.$educd %in% c(0:61)~"nohs", 
                            .$educd %in% c(62:64)~"hs",
                            .$educd %in% c(65:100)~"somecoll",
                            .$educd %in% c(101:116)~"collgrad", 
                            .$educd ==999 ~ "missing"))

#newpums$race_eth<-as.factor(newpums$race_eth)
#newpums$race_eth<-relevel(newpums$race_eth, ref = "nh_white")

Here we will fit three models. First, we need to make sure we have complete cases for all our variables, or we cannot compare the models to one another (they are not fit on the same dataset).

The first model only includes age and sex. The second model adds race/ethnicity and the third model adds education. We can use the anova() function to compare multiple models to one another using the F test.

mod1<-lm(log(mywage)~scale(age)+scale(I(age^2))+sexrecode, data=newpums)
mod2<-lm(log(mywage)~scale(age)+scale(I(age^2))+sexrecode+race_eth, data=newpums)
mod3<-lm(log(mywage)~scale(age)+scale(I(age^2))+sexrecode+race_eth+edurec, data=newpums)


anova(mod1,mod2, mod3)

## Analysis of Variance Table
## 
## Model 1: log(mywage) ~ scale(age) + scale(I(age^2)) + sexrecode
## Model 2: log(mywage) ~ scale(age) + scale(I(age^2)) + sexrecode + race_eth
## Model 3: log(mywage) ~ scale(age) + scale(I(age^2)) + sexrecode + race_eth + 
##     edurec
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1 132478 137527                                   
## 2 132474 135374  4      2153  592.79 < 2.2e-16 ***
## 3 132471 120284  3     15089 5539.41 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The AIC is another method that judges relative model fit. It is constructed as: \[\text{AIC} = \text{RSS} + 2k\]

where RSS is \(\sum (y_i - \hat{y_i})^2\), or the residual sums of squares and k is the number of parameters in the model. The AIC function can get this for us. Typically we would only consider a model to fit better if the change in AIC score is greater than 10.

AICs<-AIC(mod1, mod2, mod3)
AICs$diff<-AICs$AIC-AICs$AIC[1]
AICs

##      df      AIC       diff
## mod1  5 380928.7      0.000
## mod2  9 378846.2  -2082.449
## mod3 12 363195.3 -17733.334

So we conclude that the third model fits best of the three considered here. Not to say it’s the best model we could fit, only that it’s best out of what we’ve done here.

Presenting nested models

We can make a nice looking table (we can make it look nicer with a little code!)

library(stargazer)

## 
## Please cite as:

##  Hlavac, Marek (2015). stargazer: Well-Formatted Regression and Summary Statistics Tables.

##  R package version 5.2. http://CRAN.R-project.org/package=stargazer

stargazer(mod1, mod2, mod3, type = "html", style = "demography", ci = T)


	log(mywage)
	Model 1	Model 2	Model 3

scale(age)	2.540^***	2.531^***	2.315^***
	(2.508, 2.572)	(2.499, 2.562)	(2.285, 2.345)
scale(I(age2))	-2.276^***	-2.282^***	-2.068^***
	(-2.308, -2.244)	(-2.314, -2.250)	(-2.098, -2.038)
sexrecodemale	0.390^***	0.385^***	0.425^***
	(0.379, 0.401)	(0.374, 0.396)	(0.414, 0.435)
race_ethhispanic		-0.291^***	-0.082^***
		(-0.308, -0.275)	(-0.098, -0.066)
race_ethnh_asian		0.045^***	-0.057^***
		(0.021, 0.069)	(-0.080, -0.035)
race_ethnh_black		-0.306^***	-0.203^***
		(-0.325, -0.287)	(-0.221, -0.185)
race_ethnh_other		-0.211^***	-0.140^***
		(-0.246, -0.177)	(-0.173, -0.108)
edurechs			-0.742^***
			(-0.756, -0.728)
edurecnohs			-1.021^***
			(-1.042, -0.999)
edurecsomecoll			-0.575^***
			(-0.587, -0.562)
Constant	10.097^***	10.170^***	10.563^***
	(10.089, 10.105)	(10.162, 10.179)	(10.553, 10.574)
N	132,482	132,482	132,482
R²	0.211	0.224	0.310
Adjusted R²	0.211	0.224	0.310
Residual Std. Error	1.019 (df = 132478)	1.011 (df = 132474)	0.953 (df = 132471)
F Statistic	11,842.670^*** (df = 3; 132478)	5,456.983^*** (df = 7; 132474)	5,960.811^*** (df = 10; 132471)

p < .05; p < .01; p < .001