Class 4
In Data Analytics there are often three types of answers
Descriptive - Aim is to aggregate and describe your current data (a snapshot)
Tables, Charts, Maps, Tableau
Predictive - Aim is to predict the dependent variable. How will change in the near future
Prescriptive - Aim is to explain the dependent variable. What is the effect of your advertising campaign? Why are workers leaving your firm?
In these notes we will cover:
Who wants to sign up to be randomly assigned in one of these experiments?
Randomized control trials can be very expensive!
“one study found 28 Phase III RCTs funded by the National Institute of Neurological Disorders and Stroke prior to 2000 with a total cost of US \(\$\) 335 million, for a mean cost of US \(\$\) 12 million per RCT.”
The most famous case is the Google Flu Trends (GFT) Algorithm.
GFT was meant to be an early warning system of flu season, at times out performing the CDC.
But, then it went bad—and failed spectacularly—missing at the peak of the 2013 flu season by 140 percent. GFT went quickly from the poster child of big data to the poster child of the foibles of big data – of big data hubris.
As Tim Harford concluded in his article:
“Big data” has arrived, but big insights have not. The challenge now is to solve new problems and gain new answers – without making the same old statistical mistakes on a grander scale than ever.
For example, I can run a supervised machine learning algorithm that shows the computer a series of cats and flowers.
The program does such a good job and predicts cats and flowers with 98% accuracy.
Then I show it a picture of a dog. What happens?
Two types of causal questions (Gelman and Imbens 2013):
We are motivated by why questions but, when conducting our analysis, we tend proceed by addressing what if questions.
Examples:
Potentially confusing examples:
In the physical sciences:
In the social sciences:
Suppose we are interested in the effect of health insurance on a person’s health
Let’s think of a treatment (getting insurance) of individual \(i\) as a binary random variable \(D_i = {0, 1}\) And potential outcomes (counterfactuals): \(Y_{0i}\), \(Y_{1i}\)
\(Y_{1i}\) = A measure of person \(i\)’s health given they have insurance (\(D_i=1\)).
\(Y_{0i}\) = A measure of person \(i\)’s health given they do not have insurance (\(D_i=0\)).
The individual treatment effect is \(Y_{1i} , Y_{0i}\)
Unfortunately, for \(i\), we only observe \(Y_{1i}\) if \(D_i\) = 1 and \(Y_{0i}\) if \(D_i\) = 0
For any individual \(i\), we only observe \(Y_i=D_iY_{1i}+(1-D_i)Y_{0i}\)
The problem is we cannot observe you as both having and not having insurance.
Solution is to look for the AVERAGE TREATMENT EFFECT (ATE)! \[E[Y_{1i}] - E[Y_{0i}]\] And a naive comparison of averages does not tell us what we want to know: \[E[Y_{1i}|D_{i} = 1] - E[Y_{0i}|D_{i} = 0] \] \[ =\begin{array}{c}\underbrace{E[Y_{1i}|D_{i} = 1] - E[Y_{0i}|D_{i} = 1] }\\ ATE\end{array}+\begin{array}{c}\underbrace{E[Y_{0i}|D_{i} = 1] - E[Y_{0i}|D_{i} = 0] }\\ Sample \, Selection \, Bias\end{array} \]
Average treatment effect (ATE) and average treatment effect on the treated (ATT) need not to be the same and the distinction is sometimes important
They will be the same only if treatment is homogeneous across groups: \[E[Y_{1i} - Y_{0i}|D_i = 1] = E[Y_{1i} - Y_{0i}|D_i = 0] = E[Y_{1i} - Y_{0i}]\]
That is, the treatment is assigned randomly.
We want to understand what would have happened to the treated in the absence of treatment and thus overcome the selection problem…
Solution : Random assignment
Random assignment makes \(D_i\) independent of potential outcomes, hence:
the selection effect is zeroed out and
the treatment effect on the treated is equal to the ATE.
Randomized experiment is designed and implemented consciously by social scientists. It entails conscious use of a treatment and control group with random assignment.
Tech Companies like Google, Facebook, and Amazon are positioned to use experiments.
They embraced the idea of “Data-Base Management” where the results of experiments were taken over the advice of HiPPO’s (Highest Paid Person’s Opinion)
THE A/B TEST: INSIDE THE TECHNOLOGY THAT’S CHANGING THE RULES OF BUSINESS, Wire Magazine 4 2012
In Praise of Data-Driven Management (AKA “Why You Should be Skeptical of HiPPO’s”)
Experiments provide a very transparent and simple empirical strategy and they solve the selection bias. However, there are a number of potential problems:
Conducted between 1974 and 1982
Randomly assigned thousands of non-elderly individuals and families to different insurance plan designs
Plans ranged from free care to $1,000 deductible (basically) with variations in between
Comparable deductible today is at least $4,000
Studied effects on health spending and health outcomes
plantype | n | pct |
---|---|---|
Catastrophic | 759 | 0.1918120 |
Deductible | 881 | 0.2226434 |
Coinsurance | 1022 | 0.2582765 |
Free | 1295 | 0.3272681 |
variable | Mean | Std. Dev. |
---|---|---|
age | 32.36 | 12.92 |
blackhisp | 0.17 | 0.38 |
educper | 12.10 | 2.88 |
female | 0.56 | 0.50 |
ghindx | 70.86 | 14.91 |
hosp | 0.12 | 0.32 |
income1cpi | 31603.21 | 18148.25 |
mhix | 75.50 | 14.75 |
response | (Intercept) | Coinsurance | Deductible | Free |
---|---|---|---|---|
age | 32.4 (0.485) | 0.966 (0.655) | 0.561 (0.676) | 0.435 (0.614) |
blackhisp | 0.172 (0.0199) | -0.0269 (0.025) | -0.0188 (0.0266) | -0.0281 (0.0245) |
educper | 12.1 (0.14) | -0.0613 (0.186) | -0.157 (0.191) | -0.263 (0.183) |
female | 0.56 (0.0118) | -0.0247 (0.0153) | -0.0231 (0.016) | -0.0379 (0.015) |
ghindx | 70.9 (0.694) | 0.211 (0.922) | -1.44 (0.952) | -1.31 (0.872) |
hosp | 0.115 (0.0117) | -0.00249 (0.0152) | 0.00449 (0.016) | 0.00117 (0.0146) |
income1cpi | 31,603 (1,073) | 970 (1,391) | -2,104 (1,386) | -976 (1,346) |
mhix | 75.5 (0.696) | 1.07 (0.872) | 0.454 (0.911) | 0.433 (0.826) |
response | (Intercept) | Cost Sharing | Deductible | Free |
---|---|---|---|---|
ftf | 2.78 (0.178) | 0.481 (0.24) | 0.193 (0.247) | 1.66 (0.248) |
inpdol_inf | 388 (44.9) | 92.5 (72.8) | 72.2 (68.6) | 116 (59.8) |
out_inf | 248 (14.8) | 59.8 (20.7) | 41.8 (20.8) | 169 (19.9) |
tot_inf | 636 (54.5) | 152 (84.6) | 114 (79.1) | 285 (72.4) |
totadm | 0.0991 (0.00785) | 0.0023 (0.0108) | 0.0159 (0.0109) | 0.0288 (0.0105) |
Experiments and Potential Outcomes MM, Chapter 1
J. Angrist, D. Lang, and P. Oreopoulos, “Incentives and Services for College Achievement: Evidence from a Randomized Trial”, American Economic Journal: Applied Economics, Jan. 2009.
A. Aron-Dine, L. Einav, and A. Finkelstein, “The RAND Health Insurance Experiment Three Decades Later”, J. of Economic Perspectives 27 (Winter 2013), 197-222.
R.H. Brook, et al., “Does Free Care Improve Adults’ Health?”, New England J. of Medicine 309 (Dec. 8, 1983), 1426-1434.
S. Taubman, et al., “Medicaid Increases Emergency-Department Use: Evidence from Oregon’s Health Insurance Experiment”, Science, Jan 2, 2014.
We saw previously that RCT’s are the ideal empirical study.
When an RCT is unavailable, then provided we observe enough covariates to eliminate all forms of selection and omitted variable bias, we can use regression to estimate accurate causal effects.
But sometimes we find ourselves in a situation where an RCT is not feasible, and it is impossible to observe all the important ways in which the treated and control units differ.
In this case, there are three additional empirical strategies typically use: - Difference in Differences - Instrumental Variables - Regression Discontinuity
Today, we will look at dif-in-dif.
Recall the potential outcome framework. When we estimate a treatment control contrast what we get is: \[E(Y|D=1)-E(Y|D=0)=\delta+E(Y_0|D=1)-E(Y_0|D=0)\] Where \(\delta\) is the ATE.
This equation says that the average of the treated group minus the average of the control group is the average treatment effect plus selection bias (AKA omitted variable bias in the regression framework).
We will now explore another way to get rid of the selection bias.
Suppose we have data on the outcome variable for our treatment and control group from the previous period. Call this \(Y_{pre}\).
Now suppose further that: \[E(Y_0|D=1)-E(Y_{pre}|D=1)=E(Y_0|D=0)-E(Y_{pre}|D=0)\] This assumption is known as the parallel trends assumptions and is crucial for getting compelling estimates in the dif-in-dif framework.
What does this assumption mean?
It says that if the treatment group had never been treated, the average change in the outcome variable would have been identical to the average change in the outcome variable for the control group.
How plausible this assumption is depends upon the given study you are examining.
For now, let’s assume it is true, and see how this can help us kill the selection bias.
Suppose instead of just comparing the average treatment outcome to the average control outcome, we use the pre-period data to compare the average change in the treatment group to the average change in the control group.
That is, we calculate: \[E(Y|D=1)-E(Y_{pre}|D=1)-[E(Y|D=0)-E(Y_{pre}|D=0)]\] This is a difference in difference or dif-in-dif estimate.
\[\begin{align*}Y&= \beta_0 + \beta_1*[Time] + \beta_2*[Intervention] \\
&+ \beta_3*[Time*Intervention] + \beta_4*[Covariates]+\epsilon\end{align*}\]
The parallel trend assumption is the most critical of the above the four assumptions to ensure internal validity of DID models and is the hardest to fulfill.
It requires that in the absence of treatment, the difference between the ‘treatment’ and ‘control’ group is constant over time.
Although there is no statistical test for this assumption, visual inspection is useful when you have observations over many time points.
It has also been proposed that the smaller the time period tested, the more likely the assumption is to hold.
Violation of parallel trend assumption will lead to biased estimation of the causal effect.
Case study: who pays for mandated childbirth coverage?
When government mandate employers to provide benefits, who is really footing the bill?
This analysis is first conducted by Jonathan Gruber in 1994, an MIT Professor who serves as the director of the Health Care Program at the National Bureau of Economic Research (NBER). To date, The Incidence of Mandated Benefits remains one of the most influential paper in healthcare economics.
Understanding the timeline is important for identifying the causal effect:
Before 1978: there was limited health care coverage for childbirth.
1975-1979: a subset of states passed laws, mandating the health care coverage of childbirth.
Starting in 1978: federal legislation mandates the health care coverage of childbirth for all states.
require(foreign) eitc<-read.dta("https://github.com/CausalReinforcer/Stata/raw/master/eitc.dta") # Create two additional dummy variables to indicate before/after # and treatment/control groups. # the EITC went into effect in the year 1994 eitc$post93 = as.numeric(eitc$year >= 1994) # The EITC only affects women with at least one child, so the # treatment group will be all women with children. eitc$anykids = as.numeric(eitc$children >= 1) # Compute the four data points needed in the DID calculation: a = sapply(subset(eitc, post93 == 0 & anykids == 0, select=work), mean) b = sapply(subset(eitc, post93 == 0 & anykids == 1, select=work), mean) c = sapply(subset(eitc, post93 == 1 & anykids == 0, select=work), mean) d = sapply(subset(eitc, post93 == 1 & anykids == 1, select=work), mean) # Compute the effect of the EITC on the employment of women with children: (d-c)-(b-a)
## work ## 0.04687313
\[work=\beta_0+\delta_0posst93+\beta_1anykids+\delta_1(anykids*post93)+\epsilon\]
reg1 = lm(work ~ post93 + anykids + post93*anykids, data = eitc) summary(reg1)
## ## Call: ## lm(formula = work ~ post93 + anykids + post93 * anykids, data = eitc) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.5755 -0.4908 0.4245 0.5092 0.5540 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.575460 0.008845 65.060 < 2e-16 *** ## post93 -0.002074 0.012931 -0.160 0.87261 ## anykids -0.129498 0.011676 -11.091 < 2e-16 *** ## post93:anykids 0.046873 0.017158 2.732 0.00631 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4967 on 13742 degrees of freedom ## Multiple R-squared: 0.0126, Adjusted R-squared: 0.01238 ## F-statistic: 58.45 on 3 and 13742 DF, p-value: < 2.2e-16
# Take average value of 'work' by year, conditional on anykids minfo = aggregate(eitc$work, list(eitc$year,eitc$anykids == 1), mean) # rename column headings (variables) names(minfo) = c("YR","Treatment","LFPR") # Attach a new column with labels minfo$Group[1:6] = "Single women, no children" minfo$Group[7:12] = "Single women, children" #minfo require(ggplot2) #package for creating nice plots qplot(YR, LFPR, data=minfo, geom=c("point","line"), colour=Group, xlab="Year", ylab="Labor Force Participation Rate")+geom_vline(xintercept = 1994)
\[ln(wage)= \beta_0+\beta_1*EDUC+\beta_2*EXP +\beta_3*EXP^2+...+u\]
What is \(u\)?
Further, we can think of the Education as function. \[ EDUC=f(GENDER,S-E, location, Ability, Motivation)\]
\[\begin{align*} ln(wage) &=\beta_0+\beta_1*EDUC(X,Ability,Motivation)+\beta_2*EXP \\ &+\beta_3*EXP^2+...+u(Ability,Motivation)\end{align*}\]
Increased Ability is associated with increases in Education and \(u\).
What looks like an effect due to an increase in Education may be an increase in Ability.
The estimate of \(\beta_1\) picks up the effect of Education and the hidden effect of Ability.
\[ln(wage)=\beta_0+\beta_1*EDUC(Z,X,Ability,Motivation) \\ +\beta_2*EXP+\beta_3*EXP^2+...+u(Ability,Motivation)\]
A variable Z is associated with an increase in Education, but does not affect the error \(u\).
An effect due to the effect of an increase of Z on Education will only be an increase in Education. For changes in Z, we can estimate changes in log wages that are caused by education.
Three important threats to internal validity are:
Instrumental variables regression can eliminate bias when \(E(u|X) \ne 0\) by using an instrumental variable, Z
Let our population regression model be
\[ Y_i = \beta_0 + \beta_1 X_i + u_i,~~i=1,\dots,n \]
and let the variable \(Z_i\) be an instrumental variable that isolates the part of \(X_i\) that is uncorrelated with \(u_i\).
If valid instrument, \(Z\), is available, we are able to estimate the coefficient \(\beta_1\) using two stage least squares (TSLS).
Then, we use this uncorrelated component to estimate \(\beta_1\): we regress \(Y_i\) on \(\hat{X}_i\) using OLS to estimate \(\beta_0^{TSLS}\) and \(\beta_1^{TSLS}\).
Suppose we wanted to estimate the price elasticity in the log-log model
\[ \ln(Q_i^{butter}) = \beta_0 + \beta_1 \ln(P_i^{butter}) + u_i \]
if we had a sample of \(n\) observations of quantity demanded and the equilibrium price, we can run an OLS estimation to estimate the elasticity coefficient \(\beta_1\).
We have cross-sectional data from 48 US states in 1995 with the variables
Before we can carry out TSLS estimation we must investigate the relevance of our instrument.
Statistical software conceals the various steps needed for IV regression, but it can be useful to demonstrate the steps here
The first state regression yields \[ \begin{alignat*}{2} \widehat{\ln(P_i^{cigarettes})} = &4.63 + &&~0.031 SalesTax_i \\ &(0.03) &&~(0.005) \end{alignat*} \]
with \(R^2 = 0.47\).
suppressMessages(library("AER")) suppressMessages(library("plm")) data("CigarettesSW")
cig.data=CigarettesSW cig.data$packpc=cig.data$pack cig.data$ravgprs <- cig.data$price/cig.data$cpi # real average price cig.data$rtax <- cig.data$tax/cig.data$cpi # real average cig tax cig.data$rtaxs <- cig.data$taxs/cig.data$cpi # real average total tax cig.data$rtaxso <- cig.data$rtaxs - cig.data$rtax # instrument
first.stage.res <- lm(log(ravgprs) ~ rtaxso, data=cig.data, subset=(year == 1995)) coeftest(first.stage.res, vcov.=vcovHAC(first.stage.res))
## ## t test of coefficients: ## ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.6165463 0.0285440 161.7343 < 2.2e-16 *** ## rtaxso 0.0307289 0.0048623 6.3198 9.588e-08 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In the second stage, \(\ln(Q_i^{cigarettes})\) is regressed on \(\widehat{\ln(P_i^{cigarettes})}\) \[ \begin{alignat*}{3} \widehat{\ln(Q_i^{cigarettes})} = &9.72 - &&~1.08 \widehat{\ln(P_i^{cigarettes})} \\ &(1.53) &&~(0.32) \end{alignat*} \]
library(AER) iv.res <- ivreg(log(packpc) ~ log(ravgprs) | rtaxso, data=cig.data, subset=(year == 1995)) coeftest(iv.res, vcov.=vcovHAC(iv.res))
## ## t test of coefficients: ## ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 9.71988 1.52719 6.3646 8.211e-08 *** ## log(ravgprs) -1.08359 0.31869 -3.4002 0.001401 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This is a strong relationship between prices and demand.
But perhaps our assumption of exogeneity might not be very valid.
Consider income: states with higher income might not need to rely on taxes for revenue and there is presumably an effect of income on consumption of cigarettes.
As stated before, when we regressed quantity demanded on prices, using sales taxes as an instrument, we might have correlation with the error since state income might be correlated with sales taxes. So, now let’s add an exogenous variable for income
\[ \begin{alignat*}{5} \widehat{\ln(Q_i^{cigarettes})} = &9.43 - &&1.14 \widehat{\ln(P_i^{cigarettes})} + &&0.21\ln(Inc_i) \\ &(1.26) &&(0.37) && (0.31) \end{alignat*} \]
cig.data$perinc <- cig.data$income/(cig.data$pop * cig.data$cpi) iv.res.2 <- ivreg(log(packpc) ~ log(ravgprs) + log(perinc) | rtaxso + log(perinc), data=cig.data, subset=(year == 1995)) coeftest(iv.res.2, vcov.=vcovHAC(iv.res.2))
## ## t test of coefficients: ## ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 9.43066 1.25464 7.5166 1.757e-09 *** ## log(ravgprs) -1.14338 0.37064 -3.0848 0.003477 ** ## log(perinc) 0.21452 0.31145 0.6888 0.494509 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now instead of using only one instrument we can use two: \(SalesTax_i\) and \(CigTax_i\), so \(m = 2\), making this model overidentified.
\[ \begin{alignat*}{5} \widehat{\ln(Q_i^{cigarettes})} = &9.89 - &&1.28 \widehat{\ln(P_i^{cigarettes})} + &&0.28\ln(Inc_i) \\ &(0.96) &&(0.25) &&(0.25) \end{alignat*} \]
iv.res.3 <- ivreg(log(packpc) ~ log(ravgprs) + log(perinc) | rtaxso + rtax + log(perinc), data=cig.data, subset=(year == 1995)) coeftest(iv.res.3, vcov.=vcovHAC(iv.res.3)) # For Robust Standard Errors
## ## t test of coefficients: ## ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 9.89496 0.96599 10.2434 2.435e-13 *** ## log(ravgprs) -1.27742 0.25299 -5.0493 7.805e-06 *** ## log(perinc) 0.28040 0.25461 1.1013 0.2766 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If the model is overidentified and we have both weak and strong instruments, it is best to drop the weak instruments
However, if the model is exactly identified, it is not possible to drop any instruments.
In this case, we should try find stronger instruments (not a very easy task) or use the weak instruments with other methods than TSLS that are less sensitive to weak instruments.
Suppose that the instrument is completely irrelevant so that \(Cov(Z_i, X_i) = 0\), then \[ s_{ZX} \overset{p}{\longrightarrow} Cov(Z_i, X_i) = 0 \]
Causing the denominator of \(Cov(Z_i,Y_i)/Cov(Z_i,X_i)\) to be zero, which makes the distribution of \(\beta_1^{TSLS}\) not normal.
A similar problem would be encountered with instruments that are not completely irrelevant but are weak.
To check for weak instruments, compute the \(F\)-statistic testing the hypothesis that the coefficients on all the instruments are zero in the first stage.
A rule of thumb is not to worry about weak instruments if the first-stage \(F\)-statistic is greater than 10.
If the instruments are not exogenous then TSLS estimators will suffer from inconsistency.
We can test for exogeneity using the \(J\)-statistic. We do this by estimating the following regression
\[ \begin{align*} \hat{u}_i^{TSLS} = &\delta_0 + \delta_1 Z_{1i} + \cdots + \delta_m Z_{mi}\\ {}+ &\delta_{m+1} W_{1i} + \cdots + \delta_{m+r} W_{ri} + e_i \end{align*} \]
and using an \(F\)-test for \(\delta_1 = \cdots = \delta_m = 0\)
In our previous TSLS we used two instruments: \(SalesTax_i\) and \(CigTax_i\), and one exogenous regressor: state income.
There are still concerns about the exogeneity of \(CigTax_i\): there could be state specific characteristics that influence both cigarette taxes and cigarette consumption.
To simplify matters we will focus on the differences between 1985 and 1995.
We regress \([\ln(Q_{i,1995}^{cigarettes}) - \ln(Q_{i,1985}^{cigarettes})]\) on \([\ln(P_{i,1995}^{cigarettes}) - \ln(P_{i,1985}^{cigarettes})]\) and \([\ln(Inc_{i,1995}) - \ln(Inc_{i,1985})]\)
panel.iv.res.1 <- plm(log(packpc) ~ log(ravgprs) + log(perinc) | rtaxso + log(perinc), data=cig.data, method="within", effect="individual", index=c("state", "year")) coeftest(panel.iv.res.1, vcov.=vcovHC(panel.iv.res.1))
## ## t test of coefficients: ## ## Estimate Std. Error t value Pr(>|t|) ## log(ravgprs) -1.072460 0.168316 -6.3717 8.011e-08 *** ## log(perinc) -0.079004 0.254929 -0.3099 0.758 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
panel.1st.stage.res.1 <- lm(log(ravgprs) ~ rtaxso, data=cig.data) lht(panel.1st.stage.res.1, "rtaxso = 0", vcov=vcovHAC)
## Linear hypothesis test ## ## Hypothesis: ## rtaxso = 0 ## ## Model 1: restricted model ## Model 2: log(ravgprs) ~ rtaxso ## ## Note: Coefficient covariance matrix supplied. ## ## Res.Df Df F Pr(>F) ## 1 95 ## 2 94 1 112.64 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
panel.iv.res.2 <- plm(log(packpc) ~ log(ravgprs) + log(perinc) | rtax + log(perinc), data=cig.data, method="within", effect="individual", index=c("state", "year")) coeftest(panel.iv.res.2, vcov.=vcovHC(panel.iv.res.2))
## ## t test of coefficients: ## ## Estimate Std. Error t value Pr(>|t|) ## log(ravgprs) -1.36316 0.16975 -8.0303 2.668e-10 *** ## log(perinc) 0.34247 0.24179 1.4164 0.1634 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
panel.1st.stage.res.2 <- lm(log(ravgprs) ~ rtax, data=cig.data) lht(panel.1st.stage.res.2, "rtax = 0", vcov=vcovHAC)
## Linear hypothesis test ## ## Hypothesis: ## rtax = 0 ## ## Model 1: restricted model ## Model 2: log(ravgprs) ~ rtax ## ## Note: Coefficient covariance matrix supplied. ## ## Res.Df Df F Pr(>F) ## 1 95 ## 2 94 1 302.12 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
panel.iv.res.3 <- plm(log(packpc) ~ log(ravgprs) + log(perinc) | rtaxso + rtax + log(perinc), data=cig.data, method="within", effect="individual", index=c("state", "year")) coeftest(panel.iv.res.3, vcov.=vcovHC(panel.iv.res.3))
## ## t test of coefficients: ## ## Estimate Std. Error t value Pr(>|t|) ## log(ravgprs) -1.26750 0.15872 -7.9858 3.103e-10 *** ## log(perinc) 0.20378 0.23261 0.8761 0.3855 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
panel.1st.stage.res.3 <- lm(log(ravgprs) ~ rtaxso + rtax, data=cig.data) lht(panel.1st.stage.res.3, c("rtaxso = 0", "rtax = 0"), vcov=vcovHAC)
## Linear hypothesis test ## ## Hypothesis: ## rtaxso = 0 ## rtax = 0 ## ## Model 1: restricted model ## Model 2: log(ravgprs) ~ rtaxso + rtax ## ## Note: Coefficient covariance matrix supplied. ## ## Res.Df Df F Pr(>F) ## 1 95 ## 2 93 2 165.2 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
j.test.reg <- lm(panel.iv.res.3$residuals ~ rtaxso + rtax + log(perinc), data=cig.data) lht(j.test.reg, c("rtaxso = 0", "rtax = 0"), vcov.=vcovHAC)
## Linear hypothesis test ## ## Hypothesis: ## rtaxso = 0 ## rtax = 0 ## ## Model 1: restricted model ## Model 2: panel.iv.res.3$residuals ~ rtaxso + rtax + log(perinc) ## ## Note: Coefficient covariance matrix supplied. ## ## Res.Df Df F Pr(>F) ## 1 94 ## 2 92 2 0.0012 0.9988
J. Angrist, “Lifetime Earnings and the Vietnam Era Draft Lottery: Evidence from Social Security Administrative Records,” American Economic Review, June 1990. 5
J. Angrist and A. Krueger, “Does Compulsory School Attendance Affect Schooling and Earnings?”, Quarterly Journal of Economics 106, November 1991.
J. Angrist, et al., “Who benefits from KIPP?”, J. of Policy Analysis and Management, Fall 2012.
J. Angrist, V. Lavy, and A. Schlosser, “Multiple Experiments for the Quantity and Quality of Children”, Journal of Labor Economics 28, October 2010.
This document replicates the Table 4.1 and Figures 4.2 4.4 4.5 found in Mastering Metrics (based on data from Carpenter and Dobkin 2009)
Will adding controls affect diff-in-diff estimates if treatment assignment was random?
When you’ve done this, you’re no longer estimating the causal effect of treatment
What are some standard falsification tests you might want to run with diff-in-diff?
Answer
If you find ex-ante differences in treated and treated, is internal validity gone?
Does the absence of a pre-trend in diff-in-diff ensure that differential trends assumption holds and causal inferences can be made?
Answer = Sadly, no. We can never prove causality with 100% confidence. It could be that trend was going to change after treatment for reasons unrelated to treatment
How are triple differences helpful and reducing concerns about violation of parallel trends assumption?
Answer = Before, an “identification policeman” would just need a story about why treated might be trending differently after event for other reasons. Now, he/she would need story about why that different trend would be particularly true for subset of firms that are more sensitive to treatment
The basic idea of regression discontinuity RDD is the following:
Researcher is interested in how this treatment affects outcome variable of interest, \(y\).
Sharp RDD
Fuzzy RDD
This subtle distinction affects exactly how you estimate the causal effect of treatment
With fuzzy RDD, the average change in y around the threshold understate causal effect [why?]
library(AER) library(foreign) library(rdd) library(stargazer) AEJfigs=read.dta("AEJfigs.dta") # All = all deaths AEJfigs$age = AEJfigs$agecell - 21 AEJfigs$over21 = ifelse(AEJfigs$agecell >= 21,1,0) reg.1=RDestimate(all~agecell,data=AEJfigs,cutpoint = 21) plot(reg.1) title(main="All Causes of Death", xlab="AGE", ylab="Mortality rate from all causes (per 100,000)")
## ## Call: ## RDestimate(formula = all ~ agecell, data = AEJfigs, cutpoint = 21) ## ## Type: ## sharp ## ## Estimates: ## Bandwidth Observations Estimate Std. Error z value ## LATE 1.6561 40 9.001 1.480 6.080 ## Half-BW 0.8281 20 9.579 1.914 5.004 ## Double-BW 3.3123 48 7.953 1.278 6.223 ## Pr(>|z|) ## LATE 1.199e-09 *** ## Half-BW 5.609e-07 *** ## Double-BW 4.882e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## F-statistics: ## F Num. DoF Denom. DoF p ## LATE 33.08 3 36 3.799e-10 ## Half-BW 29.05 3 16 2.078e-06 ## Double-BW 32.54 3 44 6.129e-11
## ## Call: ## RDestimate(formula = mva ~ agecell, data = AEJfigs, cutpoint = 21) ## ## Type: ## sharp ## ## Estimates: ## Bandwidth Observations Estimate Std. Error z value ## LATE 1.2109 30 4.977 1.0590 4.700 ## Half-BW 0.6054 14 4.956 1.3767 3.600 ## Double-BW 2.4218 48 4.566 0.7086 6.444 ## Pr(>|z|) ## LATE 2.607e-06 *** ## Half-BW 3.182e-04 *** ## Double-BW 1.162e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## F-statistics: ## F Num. DoF Denom. DoF p ## LATE 13.32 3 26 3.692e-05 ## Half-BW 12.76 3 10 1.879e-03 ## Double-BW 26.99 3 44 9.322e-10
## ## Call: ## RDestimate(formula = internal ~ agecell, data = AEJfigs, cutpoint = 21) ## ## Type: ## sharp ## ## Estimates: ## Bandwidth Observations Estimate Std. Error z value ## LATE 0.8809 22 1.4128 0.8206 1.722 ## Half-BW 0.4405 10 1.8691 1.0203 1.832 ## Double-BW 1.7618 42 0.7652 0.6179 1.239 ## Pr(>|z|) ## LATE 0.08513 . ## Half-BW 0.06698 . ## Double-BW 0.21553 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## F-statistics: ## F Num. DoF Denom. DoF p ## LATE 6.830 3 18 5.734e-03 ## Half-BW 1.765 3 6 5.068e-01 ## Double-BW 22.695 3 38 2.750e-08
include_graphics("RDtable5aaa.png")
include_graphics("RDtable5aa.png")
include_graphics("RDtable5a.png")
include_graphics("RDtable5.png")
C. Carpenter and C. Dobkin, “The Effect of Alcohol Consumption on Mortality: Regression Discontinuity Evidence from the MLDA”, American Economic Journal: Applied Economics 1 (2009), 164-182.
A. Abdulkadiroglu, et al., “The Elite Illusion: Achievement Effects at Boston and New York Exam Schools”, Econometrica, 2014.