This file reports exercises for session 2 of PUBP-705, Advanced Statistical Methods for Policy Analysis.

This sessions covers basic regression and violations of GM and CLM assumptions.

First, load some packages which we will need to complete these exercises.

require(car)
## Loading required package: car
## Warning: package 'car' was built under R version 3.0.3
require(foreign)
## Loading required package: foreign
## Warning: package 'foreign' was built under R version 3.0.3

Read in a dataset from STATA format.

country <- read.dta("C:/Users/Administrator/Downloads/country.dta")

Start with a basic linear regression model.

model <- lm(lifeexpf ~ fertrate + docs, data=country)

summary(model)
## 
## Call:
## lm(formula = lifeexpf ~ fertrate + docs, data = country)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.704  -3.368  -0.054   4.376  14.082 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 75.76183    2.41039  31.431  < 2e-16 ***
## fertrate    -3.12587    0.41031  -7.618 7.61e-12 ***
## docs         0.38414    0.07165   5.361 4.27e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.799 on 116 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.7339, Adjusted R-squared:  0.7293 
## F-statistic:   160 on 2 and 116 DF,  p-value: < 2.2e-16
  1. Check for Multicolinearity.

Several ways to do this.

# make a smaller dataset ot use to the correlations 
# this is probably not the most efficient way to do this, but good practice
subdata <- data.frame(docs = country$docs, 
                      lifeexpf = country$lifeexpf, 
                      fertrate = country$fertrate)

#make a correlation matrix to check for multicolinearity
cor(subdata, use = "pairwise.complete.obs")
##                docs   lifeexpf   fertrate
## docs      1.0000000  0.7793893 -0.7355467
## lifeexpf  0.7793893  1.0000000 -0.8206325
## fertrate -0.7355467 -0.8206325  1.0000000
# run variance inflation factor test (VIF)
# this uses the 'car' package
vif(model)
## fertrate     docs 
## 2.178787 2.178787

Correlations and VIF tests both give a sense of whether independent variables might be excessively correlation.

Here’s why that’s a concern.

# look what can happen to a p-value with multicolinearity
# here are three sample models

male <- lm(gdp ~ lifeexpf, data=country)
female <- lm(gdp ~ lifeexpm, data=country)
both <- lm(gdp ~ lifeexpf + lifeexpm, data=country)

summary(male)
## 
## Call:
## lm(formula = gdp ~ lifeexpf, data = country)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7404.8 -3340.8  -361.6  2416.5 13765.2 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -19605.88    2494.36  -7.860 1.85e-12 ***
## lifeexpf       358.36      37.09   9.663  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4604 on 120 degrees of freedom
## Multiple R-squared:  0.4376, Adjusted R-squared:  0.4329 
## F-statistic: 93.37 on 1 and 120 DF,  p-value: < 2.2e-16
summary(female)
## 
## Call:
## lm(formula = gdp ~ lifeexpm, data = country)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7487.2 -3432.7  -626.2  2441.7 14349.9 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -20130.94    2703.75  -7.446 1.61e-11 ***
## lifeexpm       392.37      43.13   9.098 2.41e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4723 on 120 degrees of freedom
## Multiple R-squared:  0.4082, Adjusted R-squared:  0.4033 
## F-statistic: 82.77 on 1 and 120 DF,  p-value: 2.408e-15
summary(both)
## 
## Call:
## lm(formula = gdp ~ lifeexpf + lifeexpm, data = country)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7754.4 -3267.5  -556.3  2149.1 13415.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -18099.0     2720.9  -6.652  9.3e-10 ***
## lifeexpf       676.7      236.7   2.858  0.00503 ** 
## lifeexpm      -365.3      268.4  -1.361  0.17601    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4588 on 119 degrees of freedom
## Multiple R-squared:  0.4462, Adjusted R-squared:  0.4369 
## F-statistic: 47.94 on 2 and 119 DF,  p-value: 5.354e-16
# the significance of one of the variables goes away when you add the multi-colinear
# variable into the combined model
## This is misleading! 

# run another VIF
vif(both)
## lifeexpf lifeexpm 
## 41.03305 41.03305
# that's a really high number... this is bad...
# TYPICALLY >10 VIF indicates problems; however, these are just heuristics
# you need to use your judgement on these things...
  1. Tests for Normality of Residuals
# make a new model to test on
model <- lm(lifeexpf ~ log(docs) + gdp, data=country)

# plot residuals 
hist(model$residuals, prob = T)

# overlay a normal curve for comparison
curve(dnorm(x, mean=mean(model$residuals), sd=sd(model$residuals)), 
        col="darkblue", lwd=2, add=TRUE, yaxt="n")

# looks reasonably normal

A more involved type of plot is a Q-Q plot. This tests whether residuals are approximately normal at different quantiles of the DV. In other words, are you making the same sorts of prediction errors across the full range of your data. We assume that we are.

# QQ Plot - check to see if the residuals are consistent with normal at diff pctiles
qqnorm(model$residuals)

# ideally, we want a perfectly straight line at a 45 degree angle

# short-cut
plot(model)

# this will give you 4 common diagnostic plots
# 1. Residuals VS Fitted Values (are prediction errors constant across predicted #     #    values)
# 2. Normal Q-Q Plot (which we just did), with labeled outlier points
# 3. "Scale-Location" Plot. Same idea as 1, but with standardized residuals. Another   #    way to evaluate how stable prediction errors are across predicted values.
# 4. Residuals VS Leverage plot. Gives you a sense of how much 'leverage' an 
#     observation has on your regression line. Larger leverage indicates that an 
#    observation is have a large independent impact (it may be an outlier)

We can also test the residuals for normality explicitly, using the Shapiro-Wilks test. This is a hypothesis test that evaluates a variable for normality.

# Shapiro-Wilks test for normality of residuals
# Null hypothesis is that the residuals ARE normally distributed
# rejecting the Null with a conventional level of confidence is bad
shapiro.test(model$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model$residuals
## W = 0.9772, p-value = 0.0379
  1. Tests for Non-Linearity of Relationship

Generally, the best way to diagnose non-linearity is to plot relationships between your variables.

# the 'pairs' command will give you a scatterplot matrix of all your data
# it is best to subset to the variables you want to look at for better viewing
pairs(subdata)

# this should indicate that the relationship between some of these variables
# is non-linear

Another way to look at potential non-linearities in more depth.

Plot the relationship between docs and lifeexpf

plot(country$docs,country$lifeexpf)

# the relationship here doesn't look linear.... 
# HOWEVER: the regression formula will draw a straight line anyway

#load a better plotting package to get a sense of this
# NOTE: ggplot requires "Rcpp" to work
# if necessary, install it first
# install.packages("Rcpp")
require(ggplot2)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.0.3

# the code below will plot our points AND plot a linear fit line through it
ggplot(d=country,aes(docs,lifeexpf)) + geom_point() + stat_smooth(method = "glm")
## Warning: Removed 1 rows containing missing values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).

# Note that the program has done its best to generate a straight line through
# the data - it is a terrible fit, but it did it anyway
# That's all a linear regression knows how to do

What to do about non-linearity?

A common solution is to transform variables.

It looked like that relationship would benefit from taking the log of docs.

# perhaps by taking the natural log of one of our variables will help

country$log_docs <- log(country$docs)

# the code above takes the natural log of 'docs' 

# look at a scatter matrix

pairs(data.frame(country$docs, country$log_docs, country$lifeexpf))

# the log transformation looks like a much better fit
# look at models using the transformed and untransformed

model1 <- lm(lifeexpf ~ docs, data=country)
model2 <- lm(lifeexpf ~ log_docs, data=country)

summary(model1)
## 
## Call:
## lm(formula = lifeexpf ~ docs, data = country)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.3943  -4.8537   0.0452   5.1191  16.4264 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 58.24700    0.88114   66.10   <2e-16 ***
## docs         0.78324    0.05772   13.57   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.023 on 119 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.6074, Adjusted R-squared:  0.6041 
## F-statistic: 184.1 on 1 and 119 DF,  p-value: < 2.2e-16
summary(model2)
## 
## Call:
## lm(formula = lifeexpf ~ log_docs, data = country)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.7510  -2.9031   0.8198   3.9033  15.5621 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  57.1054     0.6749   84.61   <2e-16 ***
## log_docs      6.3241     0.3152   20.07   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.354 on 119 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.7719, Adjusted R-squared:  0.7699 
## F-statistic: 402.6 on 1 and 119 DF,  p-value: < 2.2e-16

If you’d only looked at the model, you might have thought that docs was a pretty good fit - the R^2 isn’t too bad.

However, by plotting, we got the sense that it’s probably not a great fit. The log_docs variable fits much better.

Interpretation:

Remember - we’re now working at a different level of measurement on the X. The relationship is still linear, but it is now linear between a log-variable and a non-log variable.

So, you have to think in log terms when interpreting your results

Conveniently, log terms can be interpreted in percentile terms: a 1% increase in your IV (docs per 10,000) is associated with a (Beta/100)% increase in your DV (female life expectency).

I think I got this right, but check on this point…

You can also reverse this - if you take the log of the DV and not the IV, a one unit change in the IV is associated with a (Beta*100)% increase in the DV.

# Example of that
model3 <- lm((log(lifeexpf)) ~ docs, data=country)
summary(model3)
## 
## Call:
## lm(formula = (log(lifeexpf)) ~ docs, data = country)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.34473 -0.07652  0.00736  0.08789  0.25440 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.0560488  0.0146901  276.11   <2e-16 ***
## docs        0.0119759  0.0009623   12.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1171 on 119 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.5655, Adjusted R-squared:  0.5619 
## F-statistic: 154.9 on 1 and 119 DF,  p-value: < 2.2e-16

What if we logged both sides? Then, we get a log-log model, which allows us to interpret relationships in elasticity terms (what impact does a %change in X have on y in %change)

A student question: how do you know what transformation to use? Amit’s answer: generally, you have to have a sense of what different functions look like.

Here’s an example of some common functions and their graphical forms.

# What do transformations look like?

# create a random vector with 100 obs
x <- runif(100, min = 1, max = 100)

# transform it in several common ways
x2 <- x^2
log_x <- log(x)
x3 <- x^3
sqrt_x <- sqrt(x)

pairs(data.frame(x,x2,x3,log_x,sqrt_x))

This might help to get a sense of what different transformations of variables look like; things to keep an eye out for in data.

NOTE: Believe class ended around this point; we picked up with log-log models at the beginning of session 3.