Problem 1

\[ Y_{i,t} = T_{i,t} \beta + \alpha_i + \delta_t + \varepsilon_{i,t} \]

Taking the first difference:

\[ \Delta Y_{i,t} = Y_{i,2} - Y_{i,1} \]

\[ \Delta Y_{i} = Y_{i,2} - Y_{i,1} = (T_{i,2} \beta + \alpha_i + \delta_2 + \varepsilon_{i,2}) - (T_{i,1} \beta + \alpha_i + \delta_1 + \varepsilon_{i,1 }) \]

Simplification:

\[ \Delta Y_{i} = (T_{i,2} - T_{i,1})\beta + (\delta_2 - \delta_1) + (\varepsilon_{i,2} - \varepsilon_{i,1}) \]

\[ \Delta Y_{i} = D_i\beta + \delta_t + \varepsilon_i \] Interpretation:

\(D_i = T_{i,2} - T_{i,1}\), potential outcome on whether you’re treated between the two periods

\(\delta_2 - \delta_1\), is the linear trend (or change) from t=1 to t=2.

\(\varepsilon_i = \varepsilon_{i,2} - \varepsilon_{i,1}\), arbitrary noise.

The regression provides consistent estimates, if explanatory variables are strictly exogenous after removing unobserved effects \(\alpha_i\) - equivalent to OLS framework.

–> Strict exogeneity

Problem 2

1.

Let FL be a binary variable equal to one if a person lives in Florida, and zero otherwise. Let y90 be a year dummy variable for 1990. Then we have the linear probability model

\[ arrest = \beta_0 + \delta_{0}y90 + \beta_{1}FL + \delta_{1}y90FL + \varepsilon \]

\(\beta_0\) is the intercept.

\(\delta_{0}y90\) allows for aggregate trends in drunk driving arrests that would affect both states.

\(FL\) allows for systematic differences between Florida and Georgia in either drunk driving behavior or law enforcement.

\(\delta_1\) is the measured effect of the law, which is the change in the probability of drunk driving arrest due to the new law in Florida.

2. RELEVANCE! Any factor that leads to different overall trends in both states could be relevant (e.g. different changes in the respective populations).

It could be that the populations of drivers in the two states change in different ways over time. E.g., age, race, or gender distributions may have changed. It could be that the levels of education across the two states may have changed. As these factors might affect whether someone is arrested for drunk driving, it could be important to control for them.

However, we are not concerned about including things that are different across the states but are constant over time, \(FL\) will account for these differences.

At a minimum, there is the possibility of obtaining a more precise estimator of \(\delta_1\) by reducing the error variance. Essentially, any explanatory variable that affects arrest can be used for this purpose.

Problem 3

library(tidyverse)
## ── Attaching packages ────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.3
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## Warning: package 'ggplot2' was built under R version 3.6.2
## Warning: package 'tibble' was built under R version 3.6.2
## Warning: package 'tidyr' was built under R version 3.6.2
## Warning: package 'dplyr' was built under R version 3.6.2
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(texreg)
## Version:  1.36.23
## Date:     2017-03-03
## Author:   Philip Leifeld (University of Glasgow)
## 
## Please cite the JSS article in your publications -- see citation("texreg").
## 
## Attaching package: 'texreg'
## The following object is masked from 'package:tidyr':
## 
##     extract
require(foreign)
## Loading required package: foreign
data = read.csv("Michigan.csv", header = TRUE, sep = ",")

1.

# Michigan dummy
data = data %>%
  mutate(MI = ifelse(state == 26, 1, 0))

# HIKE dummy
data = data %>%
  mutate(HIKE = ifelse(month >= 33, 1, 0))
# Mean calculation
tmp = data %>%
  group_by(HIKE, MI)  %>% #instructing in which group to calculate mean
  summarise(mean_smoking_rate = mean(smoked))
## `summarise()` regrouping output by 'HIKE' (override with `.groups` argument)
# DiD calculation
DiMI = tmp[4,3] - tmp[3,3]
DiOTH = tmp[2,3] - tmp[1,3]

DiD = DiMI - DiOTH

There are an average of 0.052 pct. less smokers in Michigan than the other states.

2.

the_formula1 = smoked ~ MI + HIKE + I(MI*HIKE)
mod1 = lm(the_formula1, data = data)
mod1 %>% screenreg(digits = 6)
## 
## ==============================
##               Model 1         
## ------------------------------
## (Intercept)       0.195000 ***
##                  (0.002409)   
## MI                0.000728    
##                  (0.003798)   
## HIKE             -0.012208 ** 
##                  (0.003734)   
## I(MI * HIKE)     -0.005200    
##                  (0.005867)   
## ------------------------------
## R^2               0.000339    
## Adj. R^2          0.000299    
## Num. obs.     76026           
## RMSE              0.391694    
## ==============================
## *** p < 0.001, ** p < 0.01, * p < 0.05

The results from the regression are equal to the manual DiD calculation, but the estimate is not statistically significant.

3.

the_formula2 = smoked ~ MI + HIKE + I(MI*HIKE) + as.factor(month)
mod2 = lm(the_formula2, data = data)

list(mod1, mod2) %>% screenreg(digits = 6, omit.coef = "month")
## 
## ================================================
##               Model 1           Model 2         
## ------------------------------------------------
## (Intercept)       0.195000 ***      0.178957 ***
##                  (0.002409)        (0.010397)   
## MI                0.000728          0.000881    
##                  (0.003798)        (0.003799)   
## HIKE             -0.012208 **       0.000556    
##                  (0.003734)        (0.015527)   
## I(MI * HIKE)     -0.005200         -0.005460    
##                  (0.005867)        (0.005867)   
## ------------------------------------------------
## R^2               0.000339          0.001398    
## Adj. R^2          0.000299          0.000649    
## Num. obs.     76026             76026           
## RMSE              0.391694          0.391625    
## ================================================
## *** p < 0.001, ** p < 0.01, * p < 0.05
# use class(data$month) to check the type 

Controlling monthly trend does not change the fact, that the treatment effect is not statistically significant.

4. We will now make our estimates are robust by controlling for potential factors that might affect smoked between the two periods.

the_formula3 = smoked ~ MI + HIKE + I(MI*HIKE) + as.factor(month) + as.factor(mrace3) + as.factor(ageg) + as.factor(meduc6) + as.factor(parity) + hispanic + married

mod3 = lm(the_formula3, data = data)

list(mod1, mod2, mod3) %>% screenreg(digits = 6, omit.coef = "month")
## 
## ========================================================================
##                     Model 1           Model 2           Model 3         
## ------------------------------------------------------------------------
## (Intercept)             0.195000 ***      0.178957 ***      0.225215 ***
##                        (0.002409)        (0.010397)        (0.013215)   
## MI                      0.000728          0.000881          0.007352 *  
##                        (0.003798)        (0.003799)        (0.003573)   
## HIKE                   -0.012208 **       0.000556          0.004773    
##                        (0.003734)        (0.015527)        (0.014511)   
## I(MI * HIKE)           -0.005200         -0.005460         -0.014675 ** 
##                        (0.005867)        (0.005867)        (0.005484)   
## as.factor(mrace3)2                                         -0.156572 ***
##                                                            (0.004187)   
## as.factor(mrace3)3                                         -0.083296 ***
##                                                            (0.009018)   
## as.factor(ageg)2                                            0.123477 ***
##                                                            (0.005334)   
## as.factor(ageg)3                                            0.139785 ***
##                                                            (0.005761)   
## as.factor(ageg)4                                            0.129834 ***
##                                                            (0.006206)   
## as.factor(ageg)5                                            0.122455 ***
##                                                            (0.007177)   
## as.factor(ageg)6                                            0.107228 ***
##                                                            (0.012236)   
## as.factor(meduc6)2                                          0.148552 ***
##                                                            (0.008985)   
## as.factor(meduc6)3                                          0.012825    
##                                                            (0.008584)   
## as.factor(meduc6)4                                         -0.081404 ***
##                                                            (0.008846)   
## as.factor(meduc6)5                                         -0.158022 ***
##                                                            (0.009039)   
## as.factor(meduc6)6                                          0.000546    
##                                                            (0.014307)   
## as.factor(parity)2                                          0.036279 ***
##                                                            (0.003270)   
## as.factor(parity)3                                          0.067525 ***
##                                                            (0.004102)   
## as.factor(parity)4                                          0.089832 ***
##                                                            (0.005099)   
## hispanic                                                   -0.165301 ***
##                                                            (0.007253)   
## married                                                    -0.200761 ***
##                                                            (0.003708)   
## ------------------------------------------------------------------------
## R^2                     0.000339          0.001398          0.128706    
## Adj. R^2                0.000299          0.000649          0.127857    
## Num. obs.           76026             76026             76026           
## RMSE                    0.391694          0.391625          0.365852    
## ========================================================================
## *** p < 0.001, ** p < 0.01, * p < 0.05

When adding control variables in DiD framework, you make the result robust (making sure that nothing else changed between the two periods without controlling for it).

Adding the control variables has increased the magnitude of the treatment variable and it has become significant. We can interpret the estimate, as higher taxed imposed affects the smoking rate in Michigan, negatively.

5. We’re basically testing whether there was parralel tendencies before the hike. Hence, we test if the coefficients on the month dummy before the tax hike are equal.

# It's cumbersome to filter factors. Hence, we reload the data.
data = read.csv("Michigan.csv", header = TRUE, sep = ",")
data = data %>%
  mutate(MI = ifelse(state == 26, 1, 0))
data = data %>%
  mutate(HIKE = ifelse(month >= 33, 1, 0))

# Restricting the sample to the 32 months prior to the taxe hike
data0 = data %>% filter(month < 33)


# Formula for parallel tendencies. First of all a formula for MI, because we believe that the mean is different, control variable and for the monthly effects, because if they're different, the monthly tendencies are different.

# Model with MI -> check whether the monthly tendencies in MI are different.
the_formula4 = smoked ~ MI + as.factor(month) + as.factor(mrace3) + as.factor(ageg) + as.factor(meduc6) + as.factor(parity) + hispanic + married

# Model with MI*month.
the_formula5 = smoked ~ MI*as.factor(month) + as.factor(mrace3) + as.factor(ageg) + as.factor(parity) + as.factor(meduc6) + hispanic + married

mod4 = lm(the_formula4, data = data0)
mod5 = lm(the_formula5, data = data0)

# F-test
anova(mod4, mod5)
## Analysis of Variance Table
## 
## Model 1: smoked ~ MI + as.factor(month) + as.factor(mrace3) + as.factor(ageg) + 
##     as.factor(meduc6) + as.factor(parity) + hispanic + married
## Model 2: smoked ~ MI * as.factor(month) + as.factor(mrace3) + as.factor(ageg) + 
##     as.factor(parity) + as.factor(meduc6) + hispanic + married
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1  44181 6069.0                              
## 2  44150 6062.9 31    6.0541 1.4221 0.06007 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-test reveals a difference between the two models down to a 10 percent level of significance. Hence, we can reject common trends.

# Calculating the critical value for the F distribution with the quantile function

qf(0.95, 6063, 31)
## [1] 1.609124

There’s a difference of 31 degrees of freedom of the f-test. The critical value must be above 1.609124, but our F-statistic is 1.4221, and does not exceed the critical value for the 95 percent interval.

6. At last, we have to compare the difference beteween MI and IOWA and MI and Pennsylvania. We denote: A: MI & Iowa B: MI & Pennsylvania

# Data A
dataA = data0 %>% filter(state != 42)

# Estimating A
mod4A = lm(the_formula4, data = dataA)
mod5A = lm(the_formula5, data = dataA)
testA = anova(mod4A, mod5A)
# Data B
dataB = data0 %>% filter(state != 19)

# Estimating A
mod4B = lm(the_formula4, data = dataB)
mod5B = lm(the_formula5, data = dataB)
testB = anova(mod4B, mod5B)

Why can’t we reject the null now? A sure reason is that the estimates are not sufficiently efficient to reject the null. Perhabs more data would help to clarify the the different trends.