library(MatchIt)
library(tidyverse)
library(stargazer)
library(ggeffects)
data(lalonde)

The following exercise is based on the assumption that age and re75 are confounding variables between the treatment and the dependent variable. As a result, I have to show that mean age and re75 are statistically different in treatment and the control group. Since age and re75 are not comparable in the same graph, I standardize these two variables so that both variables have a 0 mean and 1 standard deviation. The first step is to record the mean and standard deviation of these variables in the sample so that we can use them on both the original and the matched dataset.

Age.Mean=mean(lalonde$age)
Age.SD=sd(lalonde$age)
re75.Mean=mean(lalonde$re75)
re75.SD=sd(lalonde$re75)

Now I will calculate the mean and the confidence intervals for age and re75 based on the treatment status.

Prematch = lalonde  %>% 
  select(treat, age, re75) %>% 
  mutate(age=(age-Age.Mean)/Age.SD, re75=(re75-re75.Mean)/re75.SD)  %>%  #I standardize the variables here.
  pivot_longer(
    cols =! "treat",
    names_to = "Variable",
    values_to = "Value") %>% 
  group_by(treat, Variable) %>% # Group by the variable name and the treatment status
  summarize(Mean=mean(Value), #Step 2: Calculate the mean, Lower and Higher 95% confidence intervals
            LCI=mean(Value)-1.96*plotrix::std.error(Value), 
            HCI=mean(Value)+1.96*plotrix::std.error(Value)) %>% 
  mutate(type="1. Pre-Matched Data")

Let’s see if we need to match:

ggplot(data=Prematch) +
  geom_linerange(aes(x=Variable, ymin=LCI, ymax=HCI,color=factor(treat)), size=1,position = position_dodge(width = 0.1))+
  geom_point(size=2, aes(x=Variable,y=Mean,color=factor(treat)),position = position_dodge(width = 0.1))+
  coord_flip()+
  theme_bw()+
  labs(title="Covariate Balance", x="", y="Mean Standarized Value", color="Treatment")

As we can see, confidence intervals for re75 do not overlap. People with lower income are more likely to receive the treatment than those with higher income. This is a problem as past income (re75) determines both the treatment status and is likely to affect the current income. Hence, the effect of the treatment might be overstated. Similarly, the confidence intervals for age also do not overlap. Younger people are more likely to receive the treatment than older people. Therefore, to fix these problems, the next step is to run the matching model and getting the matched dataset.

match_model = matchit(treat ~ age  + re75,
                      method="nearest",
                      data = lalonde)
newdata = match.data(match_model)

Now I will calculate the mean and the confidence intervals for age and re75 based on the treatment status in the matched data as follows:

PostMatch = newdata  %>% 
  select(treat, age, re75) %>% mutate(age=(age-Age.Mean)/Age.SD, re75=(re75-re75.Mean)/re75.SD)  %>%  #Step 1: Standardize the variables so that we can show all the variables in the same graph
  pivot_longer(
    cols =! "treat",
    names_to = "Variable",
    values_to = "Value") %>% 
  group_by(treat, Variable) %>% 
  summarize(Mean=mean(Value), #Step 2: Calculate the mean, Lower and Higher 95% confidence intervals
            LCI=mean(Value)-1.96*plotrix::std.error(Value), 
            HCI=mean(Value)+1.96*plotrix::std.error(Value)) %>% 
  mutate(type="Matched Data")

Let’s see if matching fixed the problem:

ggplot(data=PostMatch) +
  geom_linerange(aes(x=Variable, ymin=LCI, ymax=HCI,color=factor(treat)), size=1,position = position_dodge(width = 0.1))+
  geom_point(size=2, aes(x=Variable,y=Mean,color=factor(treat)),position = position_dodge(width = 0.1))+
  coord_flip()+
  theme_bw()+
  labs(title="Covariate Balance", x="", y="Mean Standarized Value", color="Treatment")

Yes, it did. The confidence intervals overlap and there are no statistically identifiable difference between the treatment and the control group in age and re75 variables.