- A new econometrics course
- Designed to go at the start of the sequence
- Focusing on two things: an introduction to programming, and causal inference/research design
2/7/2020
Notable exclusions:
Everyone can do it their own way, of course. What I do:
“Causal Inference in the News”
“Research Design”
Kinds of homework and exam questions?
The first ~half of the class (a little less) is focused on learning to use R, with a focus on working close to the data. I won’t teach R in this workshop and will assume you attended the first workshop. So class goals:
%>%for loop and create fake data to perform a Monte Carlo (!)for loop.data( and see what pops up. Or see here. Ecdat is great, as are wooldridge and AERread.csvdplyrmutate(), select(), filter(), arrange()group_by() with mutate() or summarize() \(\leftarrow\) how we do a lot of our causal inferencecut(), var(), mean(), quantile(), stargazer(), pull(), sample(), rnorm(), runif(), quantile()Common programming teaching approach is “I do it, we do it, you do it”
Example:
cut(breaks = 5) with group_by() and mutate() to get the proportion of variance of percbelowpoverty explained by popdensity (me), percollege (us), and perchsd (you)library(tidyverse) data(midwest) midwest <- midwest %>% mutate(den_cuts = cut(popdensity, breaks = 5)) %>% group_by(den_cuts) %>% mutate(pov.r = percbelowpoverty - mean(percbelowpoverty)) %>% ungroup() 1 - var(midwest$pov.r)/var(midwest$percbelowpoverty)
## [1] 0.01590158
The pipe %>% takes the thing on the left and makes it the first argument of the thing on the right. Students have difficult with nested parentheses and this HELPS.
Me-we-you:
popdensity in inmetro areaspopdensity is above its median, then get median percollege by above/below median, sorted to be below then abovepercollege and popdensity. Just among observations with popdensity above its 75th percentile, what is the mean percollege?data("midwest")
midwest %>% filter(inmetro == 1) %>% pull(popdensity) %>% mean()
midwest %>%
mutate(abovemed = popdensity > median(popdensity)) %>%
group_by(abovemed) %>%
summarize(perc = median(percollege)) %>%
arrange(abovemed)
midwest %>%
select(percollege, popdensity) %>%
filter(popdensity > quantile(popdensity, .75)) %>%
pull(percollege) %>%
mean()
cor(), in general we look at the relationship between two variables by groups-within binsmutate(x.cut=cut(x)) any continuous explanatory variable, group_by(x.cut), and either mutate(y = mean(y)) to get a new variable of explained values, or mutate(y = y - mean(y)) for residuals, or summarize(y = mean(y)) to get aggregate means-within-bins.ungroup()!geom_density() graph, overlay summary stats to demonstratestargazer() in the stargazer package for a table of summary statsggplot in ggplot2labs(), that’s it.Potential homework assignment:
log() of popdensity and create a new variable: lpopcut() with breaks = 5 to cut lpop and make a new variable den_cutspov.exp, which is percbelowpoverty explained by den_cutspercbelowpoverty (y-axis) against lpop (x-axis) on a scatterplot. Add a second scatterplot on top which has pov.exp as the y-axis and is in red. Label axes appropriately.data(midwest) midwest <- midwest %>% mutate(lpop = log(popdensity)) %>% mutate(den_cuts = cut(lpop, breaks = 5),) %>% group_by(den_cuts) %>% mutate(pov.exp = mean(percbelowpoverty)) %>% ungroup() ggplot(midwest, aes(x = lpop, y = percbelowpoverty)) + geom_point() + geom_point(aes(x = lpop, y = pov.exp), color = 'red') + labs(x = "Log Population Density", y = "Percent Below Poverty")
store_results <- c()
for (i in 1:500) {
# w -> x and w -> y and x -> y
df <- tibble(w = rnorm(1000)) %>%
mutate(x = rnorm(1000) + 2*w > 0) %>%
mutate(y = x - 3*w + rnorm(1000))
# Get effect of x on y
analysis <- df %>%
group_by(x) %>%
summarize(y = mean(y))
effect <- analysis$y[2] - analysis$y[1]
store_results[i] <- effect
}
# True effect is +1. But what do we estimate?
mean(store_results)
## [1] -3.27781
Goals:
X causes Y if…X without changing anything else…Y would also change as a resultExamples of causal relationships!
Some obvious:
Some less obvious:
Examples of non-zero correlations that are not causal (or may be causal in the wrong direction!)
Some obvious:
Some less obvious:
Y would have been if X had been differentX=0 and one has X=1X causes Y to increase by 1df <- data.frame(Y.without.X = rnorm(1000),X=sample(c(0,1),1000,replace=T)) %>% mutate(Y.with.X = Y.without.X + 1) %>% #Now assign who actually gets X mutate(Observed.Y = ifelse(X==1,Y.with.X,Y.without.X)) df %>% group_by(X) %>% summarize(Y = mean(Observed.Y))
## # A tibble: 2 x 2 ## X Y ## <dbl> <dbl> ## 1 0 0.0450 ## 2 1 1.03
df <- data.frame(Z = runif(10000)) %>% mutate(Y.without.X = rnorm(10000) + Z, Y.with.X = Y.without.X + 1) %>% #Now assign who actually gets X mutate(X = Z > .7,Observed.Y = ifelse(X==1,Y.with.X,Y.without.X)) df %>% group_by(X) %>% summarize(Y = mean(Observed.Y))
## # A tibble: 2 x 2 ## X Y ## <lgl> <dbl> ## 1 FALSE 0.369 ## 2 TRUE 1.84
#But if we properly model the process and compare apples to apples... df2 <- df %>% filter(abs(Z-.7)<.01) %>% group_by(X) %>% summarize(Y = mean(Observed.Y)) c(df2$Y[1], df2$Y[2])
## [1] 0.7084716 1.7060520
Some class concepts:
W” doesn’t mean “add a control W in a regression model.” It means “remove the variation in treatment and outcome explained by W”WW \(\leftarrow\) this can lead us to control for colliders by accident! (“selection bias”)I won’t go over causal diagrams, and will assume you attended the last workshop. But some helpful pedagogical tips:
X and Y contains the front-door causal associations we want, but also the back-door associations we don’tW in Y~X, get means of Y and X within categories/bins of W, subtract them outX/Y relationship explained by Wdata(mtcars)
# Relationship between hp and wt, controlling for disp
mtcars <- mtcars %>%
mutate(disp_bins = cut(disp, breaks = 3)) %>%
group_by(disp_bins) %>%
mutate(hp.res = hp - mean(hp),
wt.res = wt - mean(wt))
cor(mtcars$hp, mtcars$wt)
## [1] 0.6587479
cor(mtcars$hp.res, mtcars$wt.res)
## [1] 0.1641105
library(Ecdat); data(Wages)
Wages <- Wages %>% mutate(ed.coarse = cut(ed,breaks=3),
exp.coarse = cut(exp,breaks=3))
#Split up the treated and untreated
union <- Wages %>% filter(union=='yes')
nonunion <- Wages %>% filter(union=='no') %>%
#For every potential complete-match, let's get the average Y
group_by(ed.coarse,exp.coarse,bluecol,
ind,south,smsa,married,sex,black) %>%
summarize(untreated.lwage = mean(lwage))
results <- union %>% inner_join(nonunion) %>%
summarize(union.mean = mean(lwage),nonunion.mean=mean(untreated.lwage))
results
## union.mean nonunion.mean ## 1 6.687606 6.571178
#Create our data
diddata <- tibble(year = sample(2002:2010,10000,replace=T),
group = sample(c('TreatedGroup','UntreatedGroup'),10000,replace=T)) %>%
mutate(after = (year >= 2007)) %>%
#Only let the treatment be applied to the treated group
mutate(D = after*(group=='TreatedGroup')) %>%
mutate(Y = 2*D + .5*year + rnorm(10000))
#Now, get before-after differences for both groups
means <- diddata %>% group_by(group,after) %>% summarize(Y=mean(Y))
#Before-after difference for untreated, has time effect only
bef.aft.untreated <- means$Y[4] - means$Y[3]
#Before-after for treated, has time and treatment effect
bef.aft.treated <- means$Y[2] - means$Y[1]
#Difference-in-Difference! Take the Time + Treatment effect, and remove the Time effect
DID <- bef.aft.treated - bef.aft.untreated
DID
## [1] 2.002812
above is an IV for treatment (no actual IV yet so discuss this conceptually)rdd.data <- tibble(test = runif(1000)*100) %>% mutate(GATE = test >= 75) %>% mutate(earn = runif(1000)*40+10*GATE+test/2) #Choose a "bandwidth" of how wide around the cutoff to look #(arbitrary in our example) #Bandwidth of 2 with a cutoff of 75 means we look from 75-2 to 75+2 bandwidth <- 2 #Just look within the bandwidth rdd <- rdd.data %>% filter(abs(75-test) < bandwidth) %>% #Create a variable indicating we're above the cutoff mutate(above = test >= 75) %>% #And compare our outcome just below the cutoff to just above group_by(above) %>% summarize(earn = mean(earn)) #Our effect looks just about right (10 is the truth) rdd$earn[2] - rdd$earn[1]
## [1] 14.09107
Z to Y to go through X, after closing back doors with controlsZ and X (weak inst problem)X and Y with Z and throwing that part out, ONLY KEEP that part!group_by() and summarize() for Wald estimator with binary instrumentcor() for results with continuous instrumentdf <- tibble(R = sample(c(0,1),500,replace=T)) %>% #We tell them whether or not to get treated mutate(X = R) %>% #But some of them don't listen! 20% do the OPPOSITE! mutate(X = ifelse(runif(500) > .8,1-R,R)) %>% mutate(Y = 5*X + rnorm(500))
Actually calculate it now:
iv <- df %>% group_by(R) %>% summarize(Y = mean(Y), X = mean(X)) iv
## # A tibble: 2 x 3 ## R Y X ## <dbl> <dbl> <dbl> ## 1 0 0.828 0.160 ## 2 1 3.94 0.779
#Remember, since our instrument is binary, we want the slope (iv$Y[2] - iv$Y[1])/(iv$X[2]-iv$X[1])
## [1] 5.03231
library(AER)
data(CigarettesSW)
CigarettesSW <- CigarettesSW %>%
mutate(cigtax = taxs-tax) %>%
mutate(price = price/cpi,
cigtax = cigtax/cpi) %>%
group_by(cut(cigtax,breaks=7)) %>%
summarize(priceexp = mean(price),
packsexp = mean(packs)) %>%
ungroup()
cor(CigarettesSW$priceexp,CigarettesSW$packsexp)
## [1] -0.9711096