We want to estimate the effect of owning a PC on GPA. PC is a binary variable.
1. a corr(PC, u) will cause an endogeneity problem, which is a violation of OLS. Without controlling for variables that effects both the error term \(u\) and \(PC\) will bias our results.
It is reasonable to suspect that socioeconomic status affects student performance, which will cause this implication.
The error term \(u\) contains, among other things, family income, which has a positive effect on \(GPA\) and is also very likely to be correlated with \(PC\) ownership.
corr(SOCIOECONOMIC STATUS, PC) !=0 => corr(PC, u) != 0
READ –> LINK SOCIOECONOMIC STATUS w/ STUDENT PERFORMANCE
2. Higher income –> can easier afford to buy a PC.
A good instrument is defined by two criterias:
The instrument must be relevant, such that corr(PI, PC) !=0.
The instrument must be exogenous, such that it only affects \(PC\) ownership, and not the outcome, \(GPA\). corr(PI,GPA) = 0
Criteria 1: Is surely fulfilled => corr(PI, PC) !=0. Correlated with the endogenous variable
Criteria 2: Is not:
corr(SOCIOECONOMIC STATUS, PI) !=0 => corr(SOCIOECONOMIC STATUS, PC) !=0
CONCLUSION: PI is probably not a good instrument, as it affect both GPA and PC. A violation of exogeneity.
3.
CRITERIA 1 (RELEVANCE): First, we assumen that grants will make computers more affordable.
Thus, we have a relevant variable:
corr(grants, PC) > 0. If grants are randomly assigned => corr(grants, u) = 0. RELEVANCE.
Second, Grant wont affect parents income or a students GPA, directly. ENDOGENEITY.
So, how do we construct an iv for \(PC\)?
If grants are relevant, we can proceed with criteria 2.
CRITERIA 2 (EXOGENEITY): Define a dummy, GRANT. If randomly assigned (corr(grant, u) = 0) & corr(grant, PC) > 0, meaning:
the probability of owning a PC should be significantly higher for student receiving grants.
Conclusion: If, let us say that the university gave grant priority to low-income students, grant would be negatively correlated with u, and IV would be inconsistent. BIAS.
But if the assumption is good, we have made a trustworthy estimate. But beware, it is not easy to come up with an instrument that is complete exogenous. Therefore, bias is obviously a problem, but it is possible to overcome with large samples as the estimate is consistent.
We can therefore set up a 2-stage regression with Grant as our instrument.
We are asked to design an experiment to find out how well girls from girl’s high school do in math.
1. score <– girlshs:
faminc, meduc, feduc are variables that would affect both score and girlshs. SOICIOECONOMIC STATUS, RELIGION, DEMOGRAPHIC FACTORS, household size, (ABILITY - NOT MEASURABLE)?
2. \[ score = \beta_0 + \beta_1 girlshs + \beta_2 faminc + \beta_3 meduc + \beta_4 feduc + u \]
Performing an OLS regression with these variables would bias the estimate on girlshs.
3. Note that factors, which are not defined in the equation implicitly lies in the IDIOSYNCRATIC error term \(u\) and potentially BIAS our estiamtes. FOR WHAT WE KNOW.
Parental support and motivation (psm) are likely to affect both score and girlshs. Since;
motivating parents makes you do better in math,
and motivating parents could also be clustered in the same neighbouurhood (SOCIOECONOMIC STATUS)?
–> if true, we have an endogeneity problem. Criteria 2 is violated
4. What if, we count the number of girls’ high school within a 20-miles radius of a girl’s home \(nmgs\)? Valid IV?
The relavance assumption has a good chance of being valid, living in place with a lot of high school is probably postive correlated with attending high school.
For the exogeneity assumption to hold, depends on the \(numgs\) is correlated with the error term \(u\)
e.g., parental support and motivation must not influence numgs (if psm affects \(score\))
5. The negative coefficient is counterintuitive, and thus concerning.
Before continuing, we would need a clarification on what is causing the negative coefficient.
This is general for any counterintuitve covariance.
library(tidyverse)
## ── Attaching packages ────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.3
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## Warning: package 'ggplot2' was built under R version 3.6.2
## Warning: package 'tibble' was built under R version 3.6.2
## Warning: package 'tidyr' was built under R version 3.6.2
## Warning: package 'dplyr' was built under R version 3.6.2
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(texreg)
## Version: 1.36.23
## Date: 2017-03-03
## Author: Philip Leifeld (University of Glasgow)
##
## Please cite the JSS article in your publications -- see citation("texreg").
##
## Attaching package: 'texreg'
## The following object is masked from 'package:tidyr':
##
## extract
library(AER) #applied econometrics tools
## Loading required package: car
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
## Loading required package: lmtest
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
## Loading required package: survival
## Warning: package 'survival' was built under R version 3.6.2
library(sandwich)
library(estimatr)
load("fertil2.RData")
1.
the_formula1 = children ~ educ + age + agesq
mod1 = lm(the_formula1, data = data)
mod1 %>% screenreg(digits = 6)
##
## ============================
## Model 1
## ----------------------------
## (Intercept) -4.138307 ***
## (0.240594)
## educ -0.090575 ***
## (0.005921)
## age 0.332449 ***
## (0.016549)
## agesq -0.002631 ***
## (0.000273)
## ----------------------------
## R^2 0.568724
## Adj. R^2 0.568427
## Num. obs. 4361
## RMSE 1.459746
## ============================
## *** p < 0.001, ** p < 0.01, * p < 0.05
educ shows QUALITY TRADE OFF, the ceteris paribus effect of an additional year of education is -0.09 fewer children.
Or 9 less, if 100 women receive one additional year of education.
2. An instrument uncorrelated with \(u\) is an exogenous instrument, and therefore a good instrument.
frst_st_form = educ ~ frsthalf + age + agesq
frst_mod = lm(frst_st_form, data = data)
frst_mod %>% screenreg(digits = 6, include.f = T)
##
## ============================
## Model 1
## ----------------------------
## (Intercept) 9.692864 ***
## (0.598069)
## frsthalf -0.852285 ***
## (0.112830)
## age -0.107950 *
## (0.042040)
## agesq -0.000506
## (0.000693)
## ----------------------------
## R^2 0.107651
## Adj. R^2 0.107037
## Num. obs. 4361
## F statistic 175.206760
## RMSE 3.710957
## ============================
## *** p < 0.001, ** p < 0.01, * p < 0.05
In general, to test for weak instruments, we test the joint significance of our instruments’ coefficients with the F-test. The RULE OF THUMB is that an F-statistic of more than 10 is fine. However, this is not a theorem, why you want it to be well above 10 for safety reasons.
In case you only have one instrument, the F-statistic is equivalent to the square of the t-statistic of the instrument’s coefficient in the first stage.
Conclusion: The F-statistic is well above 10, thus we conclude that our instrument frsthalf is relevant.
https://amstat.tandfonline.com/doi/abs/10.1198/073500102288618658#.YH31oi0YlQI
3.
# USING PREDICTIONS FROM OUR 1st STAGE REGRESSION
pred_educ = predict(frst_mod)
data$pred_educ = pred_educ
# 2nd STAGE REGRESSION
the_formula_iv1 = children ~ pred_educ + age + agesq
mod_iv1 = lm(the_formula_iv1, data = data)
list(mod1, mod_iv1) %>% screenreg()
##
## =====================================
## Model 1 Model 2
## -------------------------------------
## (Intercept) -4.14 *** -3.39 ***
## (0.24) (0.55)
## educ -0.09 ***
## (0.01)
## age 0.33 *** 0.32 ***
## (0.02) (0.02)
## agesq -0.00 *** -0.00 ***
## (0.00) (0.00)
## pred_educ -0.17 **
## (0.05)
## -------------------------------------
## R^2 0.57 0.55
## Adj. R^2 0.57 0.55
## Num. obs. 4361 4361
## RMSE 1.46 1.50
## =====================================
## *** p < 0.001, ** p < 0.01, * p < 0.05
# THE IVREG DO THE STEPS WITH ONE CODE
the_formula_iv2 = children ~ educ + age + agesq | frsthalf + age + agesq
mod_iv2 = ivreg(the_formula_iv2, data = data)
list(mod1, mod_iv1, mod_iv2) %>% screenreg()
##
## ==================================================
## Model 1 Model 2 Model 3
## --------------------------------------------------
## (Intercept) -4.14 *** -3.39 *** -3.39 ***
## (0.24) (0.55) (0.55)
## educ -0.09 *** -0.17 **
## (0.01) (0.05)
## age 0.33 *** 0.32 *** 0.32 ***
## (0.02) (0.02) (0.02)
## agesq -0.00 *** -0.00 *** -0.00 ***
## (0.00) (0.00) (0.00)
## pred_educ -0.17 **
## (0.05)
## --------------------------------------------------
## R^2 0.57 0.55 0.55
## Adj. R^2 0.57 0.55 0.55
## Num. obs. 4361 4361 4361
## RMSE 1.46 1.50 1.49
## ==================================================
## *** p < 0.001, ** p < 0.01, * p < 0.05
With the IV specification, the estimate is -0.08 larger in absolute value compared to the OLS estimate. So the difference is 8 fewer children born for every 100 women whom received an additional year of education.
Bias variance trade off.
4.
the_formula_iv3 = children ~ educ + age + agesq + electric + tv + bicycle | frsthalf + age + agesq + electric + tv + bicycle
mod_iv3 = ivreg(the_formula_iv3, data = data)
list(mod_iv2, mod_iv3) %>% screenreg(digits = 6)
##
## =============================================
## Model 1 Model 2
## ---------------------------------------------
## (Intercept) -3.387805 *** -3.591332 ***
## (0.548150) (0.645089)
## educ -0.171499 ** -0.163981 *
## (0.053180) (0.065527)
## age 0.323605 *** 0.328145 ***
## (0.017860) (0.019059)
## agesq -0.002672 *** -0.002722 ***
## (0.000280) (0.000277)
## electric -0.106531
## (0.165965)
## tv -0.002555
## (0.209230)
## bicycle 0.332072 ***
## (0.051526)
## ---------------------------------------------
## R^2 0.550233 0.557662
## Adj. R^2 0.549923 0.557052
## Num. obs. 4361 4356
## RMSE 1.490712 1.478886
## =============================================
## *** p < 0.001, ** p < 0.01, * p < 0.05
The coefficient tv is negativ because;
a tv is a luxury good in Botswana in 1988, which implies that owning a tv is positiv correlated with having more money.
having more money implies that people on average have fewer children.
This conclusion is far from being significant. Notice how much greater the standard error is than the estimated coefficient.
iv_reg estimatr -> iv_robust
iv_reg performs a standard regression with homoscedastic se iv_robust performs with robust se (more correct)
load("htv.RData")
1.
the_formula1 = lwage ~ educ
mod1 = lm(the_formula1, data = data)
coeftest1 = coeftest(mod1, vcov. = vcovHC(mod1, type = "HC0"))
screenreg(coeftest1)
##
## ======================
## Model 1
## ----------------------
## (Intercept) 1.09 ***
## (0.10)
## educ 0.10 ***
## (0.01)
## ======================
## *** p < 0.001, ** p < 0.01, * p < 0.05
# 95 percent confidence interval
confint(coeftest1)
## 2.5 % 97.5 %
## (Intercept) 0.89793993 1.286699
## educ 0.08666259 0.116060
2.
the_formula2 = educ ~ ctuit
mod2 = lm(the_formula2, data = data)
screenreg(mod2, include.f = T)
##
## ========================
## Model 1
## ------------------------
## (Intercept) 13.04 ***
## (0.07)
## ctuit -0.05
## (0.08)
## ------------------------
## R^2 0.00
## Adj. R^2 -0.00
## Num. obs. 1230
## F statistic 0.35
## RMSE 2.35
## ========================
## *** p < 0.001, ** p < 0.01, * p < 0.05
ctuit would not be a relevant instrument.
3.
# south and south18 is not included because of perfect multicollinearity
the_formula3 = lwage ~ educ + exper + expersq + ne + nc + west + urban + ne18 + nc18 + west18 + urban18
mod3 = lm(the_formula3, data = data)
coeftest3 = coeftest(mod3, vcov. = vcovHC(mod3, type = "HC0"))
list(coeftest1, coeftest3) %>% screenreg()
##
## =================================
## Model 1 Model 2
## ---------------------------------
## (Intercept) 1.09 *** -0.51 *
## (0.10) (0.25)
## educ 0.10 *** 0.14 ***
## (0.01) (0.01)
## exper 0.11 ***
## (0.03)
## expersq -0.00 **
## (0.00)
## ne -0.02
## (0.09)
## nc -0.02
## (0.07)
## west 0.02
## (0.09)
## urban 0.20 ***
## (0.04)
## ne18 0.16
## (0.09)
## nc18 0.01
## (0.07)
## west18 -0.03
## (0.10)
## urban18 0.13 *
## (0.05)
## =================================
## *** p < 0.001, ** p < 0.01, * p < 0.05
When now account for variation within each region, and the effect of education increases.
The result is robust towards regional variation. We can thus say that it is not the regional variation that determine the effect.
Receiving another year of education increases the wage by 14 percent.
The effect is approximately with 95% confidence between .12 percent - .16 percent (approx: 2 x se)
4. THE STRUCTURAL FORM is what your economic theory says the economic relations between the variables are (like consumption and income in the linked Keynesian example). e.g., BNP = CONSUMPTION + SAVINGS + NET EXPORT
However, getting the estimates of the model coefficients requires jumping through multiple hoops to make sure these estimates are not biased because of endogeneity problems when one endogenous variable is regressed on another. So structural form is good for intuitive explanation, and terrible to work with when the numbers come in.
THE REDUCED FORM complements the structural form in functionality. The reduced form solves for the endogenous variables (if at all possible)
the_formula4 = lwage ~ ctuit + exper + expersq + ne + nc + west + urban + ne18 + nc18 + west18 + urban18
mod4 = lm(the_formula4, data = data)
coeftest4 = coeftest(mod4, vcov. = vcovHC(mod4, type = "HC0"))
list(coeftest1, coeftest3, coeftest4) %>% screenreg()
##
## ============================================
## Model 1 Model 2 Model 3
## --------------------------------------------
## (Intercept) 1.09 *** -0.51 * 2.42 ***
## (0.10) (0.25) (0.15)
## educ 0.10 *** 0.14 ***
## (0.01) (0.01)
## exper 0.11 *** -0.01
## (0.03) (0.03)
## expersq -0.00 ** -0.00
## (0.00) (0.00)
## ne -0.02 -0.06
## (0.09) (0.10)
## nc -0.02 -0.03
## (0.07) (0.07)
## west 0.02 0.10
## (0.09) (0.10)
## urban 0.20 *** 0.20 ***
## (0.04) (0.04)
## ne18 0.16 0.24 *
## (0.09) (0.10)
## nc18 0.01 0.04
## (0.07) (0.08)
## west18 -0.03 -0.09
## (0.10) (0.10)
## urban18 0.13 * -0.01
## (0.05) (0.06)
## ctuit -0.04 *
## (0.02)
## ============================================
## *** p < 0.001, ** p < 0.01, * p < 0.05
We are less confident (at 5 percent level)
5.
the_formula5 = lwage ~ educ + exper + expersq + ne + nc + west + urban + ne18 + nc18 + west18 + urban18 | ctuit + exper + expersq + ne + nc + west + urban + ne18 + nc18 + west18 + urban18
mod5 = iv_robust(the_formula5, data = data)
list(coeftest1, coeftest3, coeftest4, mod5) %>% screenreg()
##
## ============================================================
## Model 1 Model 2 Model 3 Model 4
## ------------------------------------------------------------
## (Intercept) 1.09 *** -0.51 * 2.42 *** -2.89
## (0.10) (0.25) (0.15) [-7.91; 2.13]
## educ 0.10 *** 0.14 *** 0.25 *
## (0.01) (0.01) [ 0.01; 0.49]
## exper 0.11 *** -0.01 0.21 *
## (0.03) (0.03) [ 0.00; 0.42]
## expersq -0.00 ** -0.00 -0.00 *
## (0.00) (0.00) [-0.01; -0.00]
## ne -0.02 -0.06 0.03
## (0.09) (0.10) [-0.21; 0.27]
## nc -0.02 -0.03 0.00
## (0.07) (0.07) [-0.17; 0.18]
## west 0.02 0.10 -0.05
## (0.09) (0.10) [-0.30; 0.19]
## urban 0.20 *** 0.20 *** 0.21 *
## (0.04) (0.04) [ 0.13; 0.30]
## ne18 0.16 0.24 * 0.08
## (0.09) (0.10) [-0.20; 0.36]
## nc18 0.01 0.04 -0.02
## (0.07) (0.08) [-0.21; 0.17]
## west18 -0.03 -0.09 0.02
## (0.10) (0.10) [-0.21; 0.26]
## urban18 0.13 * -0.01 0.24
## (0.05) (0.06) [-0.02; 0.49]
## ctuit -0.04 *
## (0.02)
## ------------------------------------------------------------
## R^2 0.12
## Adj. R^2 0.11
## Num. obs. 1230
## RMSE 0.56
## ============================================================
## *** p < 0.001, ** p < 0.01, * p < 0.05 (or 0 outside the confidence interval).
With IV we get a large interval.
6. Because our instrument is weak, the estimate is the opposite of convincing. Lack of data is another fundamental problem in IV, as few data will leave us with biased estimates