Suppose we wish to understand the impact of a program on an outcome Y. For example; Does increase in wages increase productivity of a worker?

To measure attribution of a decision or a program to an outcome, one of the best way is through a difference-in-diference (DiD) estimator. DiD attempts to mimic an experimental design. What the reseacher does is have at two groups measured at two different points in time e.g. baseline and endline. One group is exposed to a treatment while the group is not. The parameter is then estimated by getting differences in the outcome variable for two groups at two points in time.

In my example, I generated a dummy data for a vocational youth training program. The program offered youths with business and life skills. At the end of one year, I wish to know whether the program had an effect on the income of youths who attended the training. One of the ways to measure causal effect is to do a Before-After evaluation where you use the same people (beneficaries of the program). Any increase in their income will be attributed to the program. However, this is not ideal as many factors might play around e.g. a boom in economy that leads to creation of temporary job opportunities in the economy. Another way will be to design a randomized control trial at baseline and endline. RCT avoid confounding and isolate treatment effect. However, RCTs are sometimes difficult to implement and ethical issues are often raised. Without RCTs, other alternatives are used e.g. DiD, regression discontinuty, instrumental variables, match scoring, etc.


DiD Model

The DiD model is represented below by OLS equation: \[\begin{equation} Y_{i}=\alpha+\beta T_{}i+\gamma t_{i}+\delta (T_{i}.t_{i})+\epsilon_{i} \end{equation}\]

Where:
\(\alpha\)= Constant term
\(\beta\)= Treatment group effect
\(\gamma\)=Time trend common to control and treatment group
\(\delta\)= True effect

From my data, I would wish to know whether youths that had attended the programme experienced an increase in income than those who did not attend.

We can first load the data and necessary packages.

library(readxl)
library(tidyverse)
library(dplyr)
data<-read_excel("D:/My projects/data.xlsx")

We then display the data for first 10 cases. The data has three variables, Time, Attended training and Income. Time is binary (0,1) where 0 is baseline and 1 was at the end of one year. Attended training is also binary, where 0 is a No and 1 is a Yes. Income is continuous.

head(data,10)
## # A tibble: 10 x 3
##     Time Attend.Training Income
##    <dbl>           <dbl>  <dbl>
##  1     1               1   8000
##  2     1               1  10500
##  3     1               1   8000
##  4     0               0   7000
##  5     0               0   6000
##  6     1               0   6000
##  7     1               0   6000
##  8     1               1   8000
##  9     1               1   6000
## 10     0               1   5500

We create an interaction term between time covariate and training attendance. We will call this interaction DiD.

DiD=data$Time*data$Attend.Training #Interaction term

Now run an OLS model with the interaction term and other covariates as independent variables, and income as dependent variable. Based on the output of the regression model, the estimate of DiD statistic is 2387.6 and is significant at 5% level i.e. since P-value=0.000108. In other words, those who attended training witnessed a significant increase in income by KES 2,387 over 1 year period, when compared to those who did not attend the training.

fit=lm(Income~Time+Attend.Training+DiD,data=data)
summary(fit)
## 
## Call:
## lm(formula = Income ~ Time + Attend.Training + DiD, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2833.3  -773.8  -214.3   476.2  3166.7 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       5714.3      354.8  16.107  < 2e-16 ***
## Time               119.0      493.3   0.241  0.81019    
## Attend.Training    152.4      493.3   0.309  0.75856    
## DiD               2387.6      691.6   3.452  0.00108 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1327 on 55 degrees of freedom
## Multiple R-squared:  0.4325, Adjusted R-squared:  0.4016 
## F-statistic: 13.97 on 3 and 55 DF,  p-value: 6.902e-07