Problem 1

require(foreign)
## Loading required package: foreign
library(tidyverse)
## ── Attaching packages ────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.3
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## Warning: package 'ggplot2' was built under R version 3.6.2
## Warning: package 'tibble' was built under R version 3.6.2
## Warning: package 'tidyr' was built under R version 3.6.2
## Warning: package 'dplyr' was built under R version 3.6.2
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(texreg)
## Version:  1.36.23
## Date:     2017-03-03
## Author:   Philip Leifeld (University of Glasgow)
## 
## Please cite the JSS article in your publications -- see citation("texreg").
## 
## Attaching package: 'texreg'
## The following object is masked from 'package:tidyr':
## 
##     extract
library(AER)
## Loading required package: car
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
## Loading required package: lmtest
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## Loading required package: survival
## Warning: package 'survival' was built under R version 3.6.2
data = read.csv("xr523sim.csv", header = T, sep = ",", dec = ".")

1.

formula1 = CRIME ~ POLICE

mod1 = lm(formula1, data = data)

test1 = coeftest(mod1, vcov. = vcovHC(mod1, type = 'HC0'))

screenreg(test1)
## 
## ======================
##              Model 1  
## ----------------------
## (Intercept)  -7.56 ***
##              (0.94)   
## POLICE        1.68 ***
##              (0.10)   
## ======================
## *** p < 0.001, ** p < 0.01, * p < 0.05

We obtain a significant estimate, as we find that when POLICE increases by 1 unit, CRIME increases by 1.68 units. A possible explanation could be that an increase in the level of police enforcement increases the crime rate because more crime is detected.

However, one would expect this estimate to be negative. The model is therefore potentially exposed to endogeneity.

2. The election dummy might serve as a good instrument for two reasons:

  1. ELECTION is not correlated with the error term \(\varepsilon\), since random variation in CRIME (rate) will in general have no effect on the dates of elections (In the U.S. at least, where election-days are fixed).

  2. It is possible for ELECTION to be correlated with POLICE, because the size of the police force could be part of a political party’s policy.

3.

formula2 = CRIME ~ POLICE | ELECTION

mod2 = ivreg(formula2, data = data)

test2 = coeftest(mod2, vcov. = vcovHC(mod2, type = 'HC0'))

list('OLS' = test1, 'IV' = test2) %>% screenreg()
## 
## ===============================
##              OLS        IV     
## -------------------------------
## (Intercept)  -7.56 ***  18.22 *
##              (0.94)     (8.72) 
## POLICE        1.68 ***  -0.87  
##              (0.10)     (0.86) 
## ===============================
## *** p < 0.001, ** p < 0.01, * p < 0.05

In election years, if the police force is enlarged by the amount of 1 unit the average crime decreases by -0.87 units. So \(POLICE_{IV}\) is the average decrease in crime per extra unit of police.

4. We use the Durbin-Wu-Hausman test to test for regressor exogeneity. The test is based on the difference between the IV estimate and the OLS estimate.

H0: \(\mathbb{E}[x_i \varepsilon_i] = 0\) H1: \(\mathbb{E}[x_i \varepsilon_i] ≠ 0\)

options(scipen = 100, digits = 6)

# Apply OLS to the k0 model
formula3 = POLICE ~ ELECTION

mod3 = lm(formula3, data = data)

# Obtain residuals
data$v = residuals(mod3)

# Apply OLS to the auxillary regression
formula4 = v ~ POLICE

mod4 = lm(formula4, data = data)

# Compute LM stat
rsq = summary(mod4)$r.squared

LM = nrow(data)*rsq

pchisq(LM,df=1, lower.tail = F)
## [1] 0.00000000000000000000161085

We reject the null in the Hausman test, as we conclude that POLICE is indeed endogenous in the regression.

Problem 2

load("card.RData")

1. Similar to a manual Hausman test. We want to test whether there is any explanation power left in the 1st stage residual related to educ.

# 1st stage regression
formula1 = educ ~ nearc4 + exper + expersq + black + smsa + south + smsa66 + reg662 + reg663 + reg664 + reg665 + reg666 + reg667 + reg668 + reg669

mod1 = lm(formula1, data = data)

# Obtain residuals from 1st stage regression
data$v2 = residuals(mod1)

# Apply to the OLS model
formula2 = lwage ~ educ + exper + expersq + black + smsa + south + smsa66 + reg662 + reg663 + reg664 + reg665 + reg666 + reg667 + reg668 + reg669 + v2

mod2 = lm(formula2, data = data)

test2 = coeftest(mod2, vcov. = vcovHC(mod2, type = "HC0"))

test2 %>% screenreg(digits = 3)
## 
## =======================
##              Model 1   
## -----------------------
## (Intercept)   3.666 ***
##              (0.865)   
## educ          0.132 *  
##              (0.051)   
## exper         0.108 ***
##              (0.022)   
## expersq      -0.002 ***
##              (0.000)   
## black        -0.147 ** 
##              (0.050)   
## smsa          0.112 ***
##              (0.029)   
## south        -0.145 ***
##              (0.028)   
## smsa66        0.019    
##              (0.020)   
## reg662        0.101 ** 
##              (0.035)   
## reg663        0.148 ***
##              (0.034)   
## reg664        0.050    
##              (0.041)   
## reg665        0.146 ** 
##              (0.046)   
## reg666        0.163 ***
##              (0.049)   
## reg667        0.135 ** 
##              (0.048)   
## reg668       -0.083    
##              (0.055)   
## reg669        0.108 ** 
##              (0.040)   
## v2           -0.057    
##              (0.052)   
## =======================
## *** p < 0.001, ** p < 0.01, * p < 0.05

The difference in the estimate of the return to education is not statistically significant. In conclusion, we can’t reject the null, and for what we know OLS is consistent and efficienct.

Note that the manual Hausmann test doesn’t say anything about the coefficient being different. 2.

# OLS
formula3 = lwage ~ educ + exper + expersq + black + smsa + south + smsa66 + reg662 + reg663 + reg664 + reg665 + reg666 + reg667 + reg668 + reg669

mod3 = lm(formula3, data = data)

test3 = coeftest(mod3, vcov. = vcovHC(mod3, type = "HC0"))

# 2SLS: IVREG

formula4 = lwage ~ educ + exper + expersq + black + smsa + south + smsa66 + reg662 + reg663 + reg664 + reg665 + reg666 + reg667 + reg668 + reg669 | nearc2 + nearc4 + exper + expersq + black + smsa + south + smsa66 + reg662 + reg663 + reg664 + reg665 + reg666 + reg667 + reg668 + reg669 

mod4 = ivreg(formula4, data = data)

test4 = coeftest(mod4, vcov. = vcovHC(mod4, type = "HC0"))

list("OLS" = test3, "2SLS" = test4) %>% screenreg(digits = 3)
## 
## ===================================
##              OLS         2SLS      
## -----------------------------------
## (Intercept)   4.621 ***   3.237 ***
##              (0.074)     (0.882)   
## educ          0.075 ***   0.157 ** 
##              (0.004)     (0.052)   
## exper         0.085 ***   0.119 ***
##              (0.007)     (0.023)   
## expersq      -0.002 ***  -0.002 ***
##              (0.000)     (0.000)   
## black        -0.199 ***  -0.123 *  
##              (0.018)     (0.051)   
## smsa          0.136 ***   0.101 ** 
##              (0.019)     (0.031)   
## south        -0.148 ***  -0.143 ***
##              (0.028)     (0.030)   
## smsa66        0.026       0.015    
##              (0.019)     (0.021)   
## reg662        0.096 **    0.103 ** 
##              (0.035)     (0.038)   
## reg663        0.145 ***   0.150 ***
##              (0.034)     (0.037)   
## reg664        0.055       0.048    
##              (0.041)     (0.046)   
## reg665        0.128 **    0.154 ** 
##              (0.043)     (0.051)   
## reg666        0.141 **    0.173 ** 
##              (0.045)     (0.053)   
## reg667        0.118 **    0.142 ** 
##              (0.045)     (0.052)   
## reg668       -0.056      -0.095    
##              (0.050)     (0.059)   
## reg669        0.119 **    0.103 *  
##              (0.039)     (0.043)   
## ===================================
## *** p < 0.001, ** p < 0.01, * p < 0.05

The 2SLS estimate is now 0.157, thus it has become significantly larger.

Note, in the just identfied case we cannot test instrument exogeneity w/ the Hausman test.

3. To test the single overidentifying restriction from part 2., we can apply the Sargan test. The Sargan test, test for instrument exogeneity \(\mathbb E [z_i \varepsilon_i] = 0\).

The idea is to replace \(\varepsilon_i\) with \(e_{IV}\), and test whether \(z_i\) is correlated with \(e_{IV}\).

# Obtain IV residuals
data$v_iv = residuals(mod4)

# Apply to OLS
formula5 = v_iv ~ nearc2 + nearc4 + exper + expersq + black + smsa + south + smsa66 + reg662 + reg663 + reg664 + reg665 + reg666 + reg667 + reg668 + reg669

mod5 = lm(formula5, data = data)

# Compute the LM statistics:

rsq = summary(mod5)$r.squared

# Our test stat is equal to
J = NROW(data)*rsq

# Under H_0 it is distributed around zero with a chi-sq. distribution
pchisq(J, 1) # m-k dof
## [1] 0.736095

We fail to reject the over¬identifying restriction. Given that one of the instruments is valid, we can’t reject that the other is not valid.

The Sargan test is limited by the fact that it does not tell us, which of the IVs that fail the exogeneity condition.

# The easy way to obtain both the Hausman test and the Sargan test
summary(mod5, diagnostics = TRUE)
## 
## Call:
## lm(formula = formula5, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9482 -0.2514  0.0206  0.2640  1.4646 
## 
## Coefficients:
##                Estimate  Std. Error t value Pr(>|t|)
## (Intercept) -0.00032243  0.05045168   -0.01     0.99
## nearc2       0.01651888  0.01617379    1.02     0.31
## nearc4      -0.00808347  0.01834983   -0.44     0.66
## exper        0.00003122  0.00703795    0.00     1.00
## expersq     -0.00000343  0.00034468   -0.01     0.99
## black       -0.00088524  0.01961674   -0.05     0.96
## smsa         0.00066831  0.02188919    0.03     0.98
## south        0.00114482  0.02831183    0.04     0.97
## smsa66      -0.00059421  0.02234012   -0.03     0.98
## reg662      -0.00287246  0.03918066   -0.07     0.94
## reg663      -0.00008449  0.03830577    0.00     1.00
## reg664       0.00119971  0.04540818    0.03     0.98
## reg665      -0.00066915  0.04561868   -0.01     0.99
## reg666      -0.00644477  0.04967384   -0.13     0.90
## reg667      -0.00083879  0.04896140   -0.02     0.99
## reg668       0.00220791  0.05597169    0.04     0.97
## reg669      -0.00610726  0.04263188   -0.14     0.89
## 
## Residual standard error: 0.405 on 2993 degrees of freedom
## Multiple R-squared:  0.000415,   Adjusted R-squared:  -0.00493 
## F-statistic: 0.0776 on 16 and 2993 DF,  p-value: 1

Problem 3

load("geneticdata.rdata")

1.

# 1st stage regression
formula1 = alcohol ~ snp1

mod1 = lm(formula1, data = geneticdata)

test1 = coeftest(mod1, vcov. = vcovHC(mod1, "HC0"))

screenreg(test1)
## 
## ======================
##              Model 1  
## ----------------------
## (Intercept)  14.19 ***
##              (0.22)   
## snp1          2.51 ***
##              (0.31)   
## ======================
## *** p < 0.001, ** p < 0.01, * p < 0.05

snp1 is statistically significant at all conventional levels. We can interpret the result as, if you possess the gene snp1, you will consume an additional 2.51 units of alcohol.

In order for this interpretation to be valid, the model must not invalidate the OLS assumptions.

2.

#IV
formula2 = depression ~ alcohol | snp1

mod2 = ivreg(formula2, data = geneticdata)

test2 = coeftest(mod2, vcov. = vcovHC(mod2, "HC0"))

# OLS
formula3 = depression ~ alcohol

mod3 = lm(formula3, data = geneticdata)

test3 = coeftest(mod3, vcov. = vcovHC(mod3, "HC0"))

list(IV = test2, OLS = test3) %>% screenreg()
## 
## =================================
##              IV         OLS      
## ---------------------------------
## (Intercept)  27.62 ***  17.05 ***
##              (4.05)     (1.09)   
## alcohol       0.79 **    1.47 ***
##              (0.26)     (0.07)   
## =================================
## *** p < 0.001, ** p < 0.01, * p < 0.05

The model is just identified, since the number of instruments and coefficients are equal.

The reason for the OLS estimate being larger than the IV estimate might be because of endogeneity in the form of reversed causality.

OLS (potentially) caputures two effects:

If you are depressed, you’ll tend consume more alchohol. And if you consume a alot of alcohol, you tend to be to more depressed.

IV captures (if done correctly) only the wanted effect, namely: If you you consume a alot of alcohol, you tend to be to more depressed.

Since OLS is exposed to reversed causality, it tends to be biased. Our IV estimate is appropriately smaller, because good instrumentation blocks for reverse causality. That is why, we trust that the effect of alchohol is approx 0.79/unit alchohol more.

3.

# Relevance check
formula4 = alcohol ~ snp1 + snp2

mod4 = lm(formula4, data = geneticdata)

summary(mod4)$fstatistic[1]
##   value 
## 55.0085

We measure the relevance criteria with the f-stat. With an f-stat of 55, the relationship jointly meets the relevance criteria.

The ‘rule of thumb’ states that the f-stat should be at least 10.

# 2SLS w/ two instruments
formula5 = depression ~ alcohol | snp1 + snp2

mod5 = ivreg(formula5, data = geneticdata)

test5 = coeftest(mod5, vcov. = vcovHC(mod5))

list(OLS = test3, IV1 =  test2, IV2 = test5) %>% screenreg()
## 
## ============================================
##              OLS        IV1        IV2      
## --------------------------------------------
## (Intercept)  17.05 ***  27.62 ***  30.32 ***
##              (1.09)     (4.05)     (3.35)   
## alcohol       1.47 ***   0.79 **    0.61 ** 
##              (0.07)     (0.26)     (0.22)   
## ============================================
## *** p < 0.001, ** p < 0.01, * p < 0.05
# 2SLS w/ additional regression
formula6 = depression ~ alcohol + gender + age | snp1 + snp2 + gender + age

mod6 = ivreg(formula6, data = geneticdata)

test6 = coeftest(mod6, vcov. = vcovHC(mod6))

list(OLS = test3, IV1 =  test2, IV2 = test5, IV3 = test6) %>% screenreg()
## 
## =======================================================
##              OLS        IV1        IV2        IV3      
## -------------------------------------------------------
## (Intercept)  17.05 ***  27.62 ***  30.32 ***  21.06 ***
##              (1.09)     (4.05)     (3.35)     (1.63)   
## alcohol       1.47 ***   0.79 **    0.61 **    0.56 *  
##              (0.07)     (0.26)     (0.22)     (0.22)   
## gender                                         3.81 ** 
##                                               (1.31)   
## age                                            0.19 ***
##                                               (0.04)   
## =======================================================
## *** p < 0.001, ** p < 0.01, * p < 0.05

As the IV estimates are sensitive to the magnitude and direction of the instruments we naturally observe that the IV estimate is lower in the over-identified model relative to the just-identified.

The IV-estimates have an increased precision in the over-identified model (lower se), and this is mainly explained by the explanatory power that the instruments jointly brings to the estimate.

In general, more instruments and better model specifications (significant control variables) gives a better estimation, and less errors in predicting the wanted effect.

4.

# 2SLS w/ three instruments
formula7 = depression ~ alcohol | snp1 + snp2 + snp3

mod7 = ivreg(formula7, data = geneticdata)

# mod7 %>% summary(diagnostics = TRUE)

# 2SLS w/ additional regression and three instruments
formula8 = depression ~ alcohol + gender + age | snp1 + snp2 + snp3 + gender + age

mod8 = ivreg(formula8, data = geneticdata)

mod8 %>% summary(diagnostics = TRUE)
## 
## Call:
## ivreg(formula = formula8, data = geneticdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.599  -6.817  -0.811   6.257  44.968 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   21.008      1.652   12.72 < 0.0000000000000002 ***
## alcohol        0.565      0.223    2.53               0.0115 *  
## gender         3.759      1.333    2.82               0.0049 ** 
## age            0.185      0.043    4.30             0.000018 ***
## 
## Diagnostic tests:
##                  df1 df2 statistic              p-value    
## Weak instruments   3 994      69.5 < 0.0000000000000002 ***
## Wu-Hausman         1 995      22.8             0.000002 ***
## Sargan             2  NA       2.9                 0.23    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.3 on 996 degrees of freedom
## Multiple R-Squared: 0.297,   Adjusted R-squared: 0.295 
## Wald test: 90.5 on 3 and 996 DF,  p-value: <0.0000000000000002

The null-hypothesis of the Sargan test is the validity of the instruments, where the null states that the instruments are valid.

With a p-value at 0.235, we fail to reject the null hypothesis that the instruments are endogenous.

Throughout problem 3, it seems that using the Mendelian randomization to examine the causal effect of alcohol on depression, has been successfully.

With the Hausman test, we found that the the variable alcohol was endogenous. Therefore, we used the proposed instrument to IV estimate the effect of alcohol on depression.

Further, we failed to reject exogeneity of our instruments with the Sargan test. But note, that the test is weak exact, as we can never state strict exogeneity.

In conclusion, we we did not manage to reject the established causality that if an individual consumed an extra unit of alcohol, the chance of getting a depression a year later will increase by 0.565 unit.