Propensity Score Matching for Early Childhood Longitudinal Study

Required Libraries

#install.packages(c("MatchIt","dplyr","ggplot2"))
#install.packages("skimr")
library(MatchIt)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(skimr)

Setting work directory

setwd("C:/Users/charl/OneDrive/Desktop/Data Analysis For Social Science in R")

Loading Data into R

library(readr)
Labdata <- read_csv("Labdata.csv")

## New names:
## Rows: 614 Columns: 10
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): ...1, race dbl (8): treat, age, educ, married, nodegree, re74, re75, re78
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

View(Labdata)
dim(Labdata)

## [1] 614  10

names(Labdata)

##  [1] "...1"     "treat"    "age"      "educ"     "race"     "married" 
##  [7] "nodegree" "re74"     "re75"     "re78"

Data Description

The initial dataset used for this labwork consist of 10 columns and 614 observations.There are 2 character type variables and 8 numeric variables in the dataset.The mean age reported from the initial dataset is 27.36 and a standard devaition of 9.8811872.As shown below, there are no missing cases within the initial dataset. Below is a detailed summary generated using the R statistical software.

skim(Labdata)

Data summary
Name	Labdata
Number of rows	614
Number of columns	10
_______________________
Column type frequency:
character	2
numeric	8
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
…1	0	1	4	7	0	614	0
race	0	1	5	6	0	3	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
treat	1	0.30	0.46	0	0.00	0.00	1.00	1.00	▇▁▁▁▃
age	1	27.36	9.88	16	20.00	25.00	32.00	55.00	▇▅▂▂▁
educ	1	10.27	2.63	0	9.00	11.00	12.00	18.00	▁▂▆▇▁
married	1	0.42	0.49	0	0.00	0.00	1.00	1.00	▇▁▁▁▆
nodegree	1	0.63	0.48	0	0.00	1.00	1.00	1.00	▅▁▁▁▇
re74	1	4557.55	6477.96	0	0.00	1042.33	7888.50	35040.07	▇▂▁▁▁
re75	1	2184.94	3295.68	0	0.00	601.55	3248.99	25142.24	▇▁▁▁▁
re78	1	6792.83	7470.73	0	238.28	4759.02	10893.59	60307.93	▇▂▁▁▁

Exploratory Data Analysis

Visualization of the the variables such as treat, age ,married, nodegree and other variables are conducted to see how they are distributed.

##treat and race
ggplot(Labdata, aes(x=treat, fill= race )) + geom_bar()

##age
ggplot(Labdata, aes(x=age )) + geom_bar()

##married
ggplot(Labdata, aes(x=married )) + geom_bar()

## nodegree
ggplot(Labdata, aes(x=nodegree, fill= race )) + geom_bar()

### re74
ggplot(Labdata, aes(x=re74)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

### re75
ggplot(Labdata, aes(x=re75)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

### re78
ggplot(Labdata, aes(x=re78)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## race
Labdata |> ggplot(aes(race)) + geom_bar()

Data Wrangling

## Change the raw data into a wide data frame based on "race"
Labdata$white <- ifelse(Labdata$race == 'white', 1, 0)
Labdata$black <- ifelse(Labdata$race == 'black', 1, 0)
Labdata$hispan <-ifelse(Labdata$race == 'hispan', 1, 0)

Labdata <- subset(Labdata,select=-c(race))

colnames(Labdata)[1]= "X"
head(Labdata)

## # A tibble: 6 × 12
##   X     treat   age  educ married nodegree  re74  re75   re78 white black hispan
##   <chr> <dbl> <dbl> <dbl>   <dbl>    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl>
## 1 NSW1      1    37    11       1        1     0     0  9930.     0     1      0
## 2 NSW2      1    22     9       0        1     0     0  3596.     0     0      1
## 3 NSW3      1    30    12       0        0     0     0 24909.     0     1      0
## 4 NSW4      1    27    11       0        1     0     0  7506.     0     1      0
## 5 NSW5      1    33     8       0        1     0     0   290.     0     1      0
## 6 NSW6      1    22     9       0        1     0     0  4056.     0     1      0

skim(Labdata)

Data summary
Name	Labdata
Number of rows	614
Number of columns	12
_______________________
Column type frequency:
character	1
numeric	11
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
X	0	1	4	7	0	614	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
treat	1	0.30	0.46	0	0.00	0.00	1.00	1.00	▇▁▁▁▃
age	1	27.36	9.88	16	20.00	25.00	32.00	55.00	▇▅▂▂▁
educ	1	10.27	2.63	0	9.00	11.00	12.00	18.00	▁▂▆▇▁
married	1	0.42	0.49	0	0.00	0.00	1.00	1.00	▇▁▁▁▆
nodegree	1	0.63	0.48	0	0.00	1.00	1.00	1.00	▅▁▁▁▇
re74	1	4557.55	6477.96	0	0.00	1042.33	7888.50	35040.07	▇▂▁▁▁
re75	1	2184.94	3295.68	0	0.00	601.55	3248.99	25142.24	▇▁▁▁▁
re78	1	6792.83	7470.73	0	238.28	4759.02	10893.59	60307.93	▇▂▁▁▁
white	1	0.49	0.50	0	0.00	0.00	1.00	1.00	▇▁▁▁▇
black	1	0.40	0.49	0	0.00	0.00	1.00	1.00	▇▁▁▁▅
hispan	1	0.12	0.32	0	0.00	0.00	0.00	1.00	▇▁▁▁▁

Pre-analysis using non-matched data with boxplot or summary

## Standardizing re78 (outcome variable)
Labdata$re78_std <- scale(Labdata$re78)
head(Labdata)

## # A tibble: 6 × 13
##   X     treat   age  educ married nodegree  re74  re75   re78 white black hispan
##   <chr> <dbl> <dbl> <dbl>   <dbl>    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl>
## 1 NSW1      1    37    11       1        1     0     0  9930.     0     1      0
## 2 NSW2      1    22     9       0        1     0     0  3596.     0     0      1
## 3 NSW3      1    30    12       0        0     0     0 24909.     0     1      0
## 4 NSW4      1    27    11       0        1     0     0  7506.     0     1      0
## 5 NSW5      1    33     8       0        1     0     0   290.     0     1      0
## 6 NSW6      1    22     9       0        1     0     0  4056.     0     1      0
## # … with 1 more variable: re78_std <dbl[,1]>

boxplot(re78_std ~ treat, data= Labdata,
        xlab= "treat",
        ylab= "Standardized Real Earning in 78")

Labdata %>%
  group_by(treat) %>%
  summarise(treatment_status= n(),
            mean_re78 = mean(re78_std),
            std_error = sd(re78_std)/ sqrt(treatment_status))

## # A tibble: 2 × 4
##   treat treatment_status mean_re78 std_error
##   <dbl>            <int>     <dbl>     <dbl>
## 1     0              429    0.0256    0.0471
## 2     1              185   -0.0594    0.0774

Labdata %>%  
  mutate(test= (re78 - mean(re78)) / sd(re78)) %>%
  group_by(treat) %>%
  summarise(mean_re78 = mean(test))

## # A tibble: 2 × 2
##   treat mean_re78
##   <dbl>     <dbl>
## 1     0    0.0256
## 2     1   -0.0594

#Use t.test to check the difference-in-means: pre-treatment #covariates (confounding variables) with treatment (variable)

with(Labdata, t.test(re78_std ~ treat))

## 
##  Welch Two Sample t-test
## 
## data:  re78_std by treat
## t = 0.93773, df = 326.41, p-value = 0.3491
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.09332313  0.26332690
## sample estimates:
## mean in group 0 mean in group 1 
##      0.02561132     -0.05939057

Labdata_cov <- c('age','educ','black','hispan','married','nodegree')

##Difference in mean pre-treatment covariates
library(dplyr)
Labdata %>%
  group_by(treat) %>%
select(one_of(Labdata_cov)) %>%
  summarise_all(funs(mean(., na.rm = T)))

## Adding missing grouping variables: `treat`

## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
## 
## # Simple named list: list(mean = mean, median = median)
## 
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
## 
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))

## # A tibble: 2 × 7
##   treat   age  educ black hispan married nodegree
##   <dbl> <dbl> <dbl> <dbl>  <dbl>   <dbl>    <dbl>
## 1     0  28.0  10.2 0.203 0.142    0.513    0.597
## 2     1  25.8  10.3 0.843 0.0595   0.189    0.708

#outcome variable with treatment (variable)
with(Labdata, t.test(re78_std ~ treat))

## 
##  Welch Two Sample t-test
## 
## data:  re78_std by treat
## t = 0.93773, df = 326.41, p-value = 0.3491
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.09332313  0.26332690
## sample estimates:
## mean in group 0 mean in group 1 
##      0.02561132     -0.05939057

From the above statistical output, it can be suggested that the p-values is greater than 0.05 and not close to zero , hence making the null hypothesis not to be rejected. This means that the mean of standard real earning is not significantly different.

#pre-treatment covariates (confounding variables) with treatment variable
with(Labdata, t.test(age~ treat))

## 
##  Welch Two Sample t-test
## 
## data:  age by treat
## t = 2.9911, df = 510.57, p-value = 0.002914
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  0.7598127 3.6683610
## sample estimates:
## mean in group 0 mean in group 1 
##        28.03030        25.81622

with(Labdata, t.test(educ~ treat))

## 
##  Welch Two Sample t-test
## 
## data:  educ by treat
## t = -0.54676, df = 485.37, p-value = 0.5848
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.5076687  0.2866393
## sample estimates:
## mean in group 0 mean in group 1 
##        10.23543        10.34595

with(Labdata, t.test(black~ treat))

## 
##  Welch Two Sample t-test
## 
## data:  black by treat
## t = -19.344, df = 382.86, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.7055419 -0.5753502
## sample estimates:
## mean in group 0 mean in group 1 
##       0.2027972       0.8432432

with(Labdata, t.test(hispan~ treat))

## 
##  Welch Two Sample t-test
## 
## data:  hispan by treat
## t = 3.4091, df = 501.34, p-value = 0.0007042
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  0.03505288 0.13041048
## sample estimates:
## mean in group 0 mean in group 1 
##      0.14219114      0.05945946

with(Labdata, t.test(nodegree~ treat))

## 
##  Welch Two Sample t-test
## 
## data:  nodegree by treat
## t = -2.7127, df = 374.01, p-value = 0.006982
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.19210040 -0.03064263
## sample estimates:
## mean in group 0 mean in group 1 
##       0.5967366       0.7081081

with(Labdata, t.test(married~ treat))

## 
##  Welch Two Sample t-test
## 
## data:  married by treat
## t = 8.5961, df = 439.29, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  0.2496379 0.3976248
## sample estimates:
## mean in group 0 mean in group 1 
##       0.5128205       0.1891892

##Use glm() to estimate propensity score (probability)
m_ps <- glm(treat ~ age + educ + black + hispan + nodegree + married, family=binomial(), data= Labdata)

summary(m_ps)

## 
## Call:
## glm(formula = treat ~ age + educ + black + hispan + nodegree + 
##     married, family = binomial(), data = Labdata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7709  -0.4606  -0.2963   0.7766   2.6384  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.67874    1.02120  -4.582 4.61e-06 ***
## age          0.01030    0.01329   0.775  0.43857    
## educ         0.15161    0.06568   2.308  0.02098 *  
## black        3.12657    0.28514  10.965  < 2e-16 ***
## hispan       0.99947    0.42191   2.369  0.01784 *  
## nodegree     0.78719    0.33507   2.349  0.01881 *  
## married     -0.92969    0.27128  -3.427  0.00061 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 751.49  on 613  degrees of freedom
## Residual deviance: 494.70  on 607  degrees of freedom
## AIC: 508.7
## 
## Number of Fisher Scoring iterations: 5

prs_df <- data.frame(pr_score = predict(m_ps, type= "response"),
                     treat = m_ps$model$treat)

prs_df### propensity score and the treatment status

##       pr_score treat
## 1   0.58760225     1
## 2   0.21399311     1
## 3   0.64016828     1
## 4   0.76508840     1
## 5   0.68734337     1
## 6   0.69552346     1
## 7   0.62340207     1
## 8   0.77421538     1
## 9   0.75029124     1
## 10  0.03079057     1
## 11  0.68894224     1
## 12  0.65362946     1
## 13  0.65323034     1
## 14  0.52484866     1
## 15  0.61570543     1
## 16  0.72046942     1
## 17  0.66748129     1
## 18  0.72868884     1
## 19  0.66353055     1
## 20  0.63062650     1
## 21  0.75760489     1
## 22  0.10583814     1
## 23  0.10567147     1
## 24  0.75949086     1
## 25  0.71839092     1
## 26  0.56750619     1
## 27  0.76136678     1
## 28  0.25012930     1
## 29  0.71630304     1
## 30  0.75949086     1
## 31  0.71630304     1
## 32  0.58315483     1
## 33  0.55736985     1
## 34  0.61612236     1
## 35  0.62822468     1
## 36  0.73162583     1
## 37  0.56230268     1
## 38  0.39515851     1
## 39  0.49797505     1
## 40  0.73071983     1
## 41  0.61855490     1
## 42  0.08855667     1
## 43  0.65089413     1
## 44  0.08139694     1
## 45  0.75188687     1
## 46  0.55736985     1
## 47  0.65089413     1
## 48  0.68451172     1
## 49  0.56230268     1
## 50  0.62340207     1
## 51  0.67617353     1
## 52  0.57255379     1
## 53  0.74802499     1
## 54  0.62822468     1
## 55  0.57003181     1
## 56  0.71630304     1
## 57  0.72721338     1
## 58  0.33938497     1
## 59  0.51829034     1
## 60  0.11948986     1
## 61  0.65089413     1
## 62  0.67842415     1
## 63  0.75241758     1
## 64  0.76508840     1
## 65  0.53496517     1
## 66  0.70844720     1
## 67  0.76508840     1
## 68  0.06956293     1
## 69  0.38385095     1
## 70  0.67203638     1
## 71  0.70631574     1
## 72  0.70939834     1
## 73  0.58177011     1
## 74  0.62822468     1
## 75  0.73308619     1
## 76  0.12277845     1
## 77  0.62098150     1
## 78  0.69333841     1
## 79  0.78828971     1
## 80  0.75570889     1
## 81  0.62822468     1
## 82  0.61123991     1
## 83  0.07812550     1
## 84  0.66748129     1
## 85  0.67391473     1
## 86  0.78483229     1
## 87  0.19118372     1
## 88  0.76323263     1
## 89  0.06641267     1
## 90  0.66937289     1
## 91  0.57255379     1
## 92  0.71630304     1
## 93  0.76136678     1
## 94  0.61855490     1
## 95  0.79508211     1
## 96  0.06901255     1
## 97  0.68673124     1
## 98  0.66811285     1
## 99  0.73274119     1
## 100 0.22999382     1
## 101 0.73071983     1
## 102 0.73475288     1
## 103 0.76136678     1
## 104 0.74802499     1
## 105 0.74996090     1
## 106 0.73929621     1
## 107 0.66748129     1
## 108 0.68451172     1
## 109 0.77059500     1
## 110 0.52228006     1
## 111 0.69114465     1
## 112 0.20546062     1
## 113 0.61612236     1
## 114 0.74802499     1
## 115 0.40504435     1
## 116 0.06641267     1
## 117 0.63302188     1
## 118 0.61612236     1
## 119 0.72046942     1
## 120 0.62340207     1
## 121 0.70454221     1
## 122 0.71839092     1
## 123 0.68894224     1
## 124 0.03358729     1
## 125 0.11522373     1
## 126 0.48696898     1
## 127 0.62098150     1
## 128 0.51456773     1
## 129 0.16478070     1
## 130 0.75188687     1
## 131 0.68451172     1
## 132 0.76693406     1
## 133 0.55990876     1
## 134 0.75188687     1
## 135 0.55482794     1
## 136 0.71478659     1
## 137 0.03319803     1
## 138 0.71839092     1
## 139 0.63779287     1
## 140 0.12389182     1
## 141 0.69986714     1
## 142 0.62822468     1
## 143 0.73071983     1
## 144 0.71536662     1
## 145 0.07308941     1
## 146 0.74996090     1
## 147 0.65555919     1
## 148 0.76508840     1
## 149 0.55990876     1
## 150 0.72253851     1
## 151 0.73874713     1
## 152 0.62581650     1
## 153 0.65555919     1
## 154 0.62340207     1
## 155 0.52555857     1
## 156 0.66289466     1
## 157 0.68673124     1
## 158 0.61855490     1
## 159 0.73675488     1
## 160 0.66019446     1
## 161 0.69552346     1
## 162 0.54008477     1
## 163 0.51713946     1
## 164 0.74072962     1
## 165 0.63779287     1
## 166 0.10158152     1
## 167 0.12364229     1
## 168 0.71478659     1
## 169 0.51199523     1
## 170 0.69333841     1
## 171 0.71630304     1
## 172 0.40256532     1
## 173 0.21054951     1
## 174 0.72046942     1
## 175 0.73475288     1
## 176 0.76693406     1
## 177 0.07155338     1
## 178 0.77601032     1
## 179 0.06705399     1
## 180 0.25401196     1
## 181 0.42001629     1
## 182 0.47455702     1
## 183 0.50756003     1
## 184 0.46969468     1
## 185 0.57758632     1
## 186 0.02988193     0
## 187 0.02871074     0
## 188 0.05092178     0
## 189 0.06174451     0
## 190 0.48182539     0
## 191 0.51270652     0
## 192 0.03048476     0
## 193 0.61612236     0
## 194 0.11244981     0
## 195 0.05197976     0
## 196 0.02555614     0
## 197 0.04044053     0
## 198 0.03703217     0
## 199 0.02958486     0
## 200 0.46686071     0
## 201 0.03263214     0
## 202 0.21923502     0
## 203 0.04037220     0
## 204 0.03821442     0
## 205 0.04325918     0
## 206 0.11633379     0
## 207 0.06697793     0
## 208 0.03172595     0
## 209 0.50054936     0
## 210 0.06056214     0
## 211 0.02983088     0
## 212 0.51013354     0
## 213 0.07512548     0
## 214 0.03767030     0
## 215 0.02958486     0
## 216 0.18516118     0
## 217 0.04531759     0
## 218 0.02899929     0
## 219 0.04129022     0
## 220 0.02871074     0
## 221 0.02899929     0
## 222 0.03666671     0
## 223 0.02842497     0
## 224 0.04295187     0
## 225 0.04949892     0
## 226 0.39270005     0
## 227 0.02175863     0
## 228 0.02842497     0
## 229 0.02730928     0
## 230 0.10741283     0
## 231 0.38779952     0
## 232 0.04936658     0
## 233 0.40009126     0
## 234 0.02958486     0
## 235 0.02703707     0
## 236 0.06827646     0
## 237 0.03292519     0
## 238 0.05156078     0
## 239 0.05197976     0
## 240 0.03079057     0
## 241 0.03523926     0
## 242 0.05034209     0
## 243 0.03018188     0
## 244 0.42755967     0
## 245 0.02924059     0
## 246 0.03230959     0
## 247 0.03804537     0
## 248 0.03919108     0
## 249 0.05105954     0
## 250 0.02018487     0
## 251 0.19600623     0
## 252 0.03464083     0
## 253 0.09000014     0
## 254 0.79152742     0
## 255 0.04667302     0
## 256 0.06417421     0
## 257 0.07102397     0
## 258 0.07505017     0
## 259 0.04531759     0
## 260 0.03852935     0
## 261 0.16907643     0
## 262 0.09084705     0
## 263 0.09818370     0
## 264 0.13418061     0
## 265 0.03319803     0
## 266 0.02786170     0
## 267 0.07363299     0
## 268 0.04407389     0
## 269 0.30083235     0
## 270 0.03301595     0
## 271 0.10158152     0
## 272 0.03676740     0
## 273 0.06514711     0
## 274 0.08609474     0
## 275 0.04713334     0
## 276 0.12520676     0
## 277 0.05083667     0
## 278 0.10443570     0
## 279 0.57576753     0
## 280 0.07363299     0
## 281 0.77471264     0
## 282 0.39762229     0
## 283 0.38779952     0
## 284 0.63541070     0
## 285 0.14933355     0
## 286 0.01330297     0
## 287 0.06452278     0
## 288 0.06641267     0
## 289 0.06479540     0
## 290 0.06641267     0
## 291 0.05714097     0
## 292 0.02676751     0
## 293 0.03975742     0
## 294 0.04237076     0
## 295 0.61612236     0
## 296 0.15464186     0
## 297 0.08170128     0
## 298 0.10364014     0
## 299 0.08943579     0
## 300 0.04985207     0
## 301 0.02958486     0
## 302 0.02786170     0
## 303 0.75188687     0
## 304 0.03464083     0
## 305 0.03401656     0
## 306 0.08093203     0
## 307 0.05034064     0
## 308 0.02842497     0
## 309 0.06452278     0
## 310 0.05399191     0
## 311 0.02899929     0
## 312 0.11522373     0
## 313 0.02894970     0
## 314 0.02786170     0
## 315 0.03646633     0
## 316 0.06514711     0
## 317 0.04443488     0
## 318 0.03682987     0
## 319 0.74996090     0
## 320 0.04314150     0
## 321 0.26534086     0
## 322 0.06641267     0
## 323 0.05197976     0
## 324 0.05936878     0
## 325 0.68894224     0
## 326 0.07730027     0
## 327 0.06758988     0
## 328 0.02786170     0
## 329 0.03141114     0
## 330 0.06452278     0
## 331 0.04048263     0
## 332 0.04290731     0
## 333 0.04487416     0
## 334 0.08832762     0
## 335 0.26937482     0
## 336 0.06514711     0
## 337 0.03604389     0
## 338 0.06641267     0
## 339 0.06835388     0
## 340 0.03740114     0
## 341 0.07572989     0
## 342 0.61570543     0
## 343 0.71839092     0
## 344 0.63541070     0
## 345 0.09718052     0
## 346 0.19030572     0
## 347 0.02080498     0
## 348 0.04253056     0
## 349 0.16033560     0
## 350 0.02988193     0
## 351 0.03073802     0
## 352 0.05083667     0
## 353 0.02899929     0
## 354 0.03919222     0
## 355 0.08787020     0
## 356 0.09971737     0
## 357 0.08101273     0
## 358 0.68673124     0
## 359 0.02842497     0
## 360 0.03222070     0
## 361 0.07501228     0
## 362 0.11627767     0
## 363 0.06705399     0
## 364 0.74802499     0
## 365 0.10347652     0
## 366 0.02814197     0
## 367 0.11522373     0
## 368 0.09971737     0
## 369 0.09971737     0
## 370 0.10064561     0
## 371 0.30360421     0
## 372 0.74802499     0
## 373 0.23131591     0
## 374 0.63062650     0
## 375 0.02504825     0
## 376 0.61123991     0
## 377 0.07792070     0
## 378 0.07632946     0
## 379 0.04667302     0
## 380 0.02012864     0
## 381 0.74802499     0
## 382 0.05248953     0
## 383 0.08787020     0
## 384 0.25401196     0
## 385 0.06521319     0
## 386 0.03604495     0
## 387 0.09523658     0
## 388 0.07525015     0
## 389 0.03469982     0
## 390 0.09879674     0
## 391 0.06452278     0
## 392 0.19030572     0
## 393 0.08609474     0
## 394 0.09879674     0
## 395 0.05866913     0
## 396 0.06577707     0
## 397 0.03190115     0
## 398 0.07489008     0
## 399 0.04985207     0
## 400 0.04998710     0
## 401 0.07026804     0
## 402 0.76908274     0
## 403 0.68451172     0
## 404 0.06503986     0
## 405 0.11154668     0
## 406 0.06514711     0
## 407 0.03301595     0
## 408 0.03750377     0
## 409 0.13916818     0
## 410 0.18334855     0
## 411 0.76001051     0
## 412 0.25936276     0
## 413 0.54193484     0
## 414 0.06514711     0
## 415 0.10654530     0
## 416 0.74802499     0
## 417 0.03626679     0
## 418 0.02703707     0
## 419 0.64895225     0
## 420 0.03353013     0
## 421 0.02899929     0
## 422 0.08855667     0
## 423 0.51341775     0
## 424 0.26134570     0
## 425 0.22949003     0
## 426 0.15781646     0
## 427 0.08347734     0
## 428 0.08787020     0
## 429 0.03777364     0
## 430 0.07645381     0
## 431 0.03429813     0
## 432 0.06630348     0
## 433 0.10664857     0
## 434 0.01520582     0
## 435 0.03529923     0
## 436 0.08609474     0
## 437 0.06503986     0
## 438 0.69333841     0
## 439 0.02758415     0
## 440 0.12817496     0
## 441 0.65089413     0
## 442 0.03254335     0
## 443 0.10064561     0
## 444 0.02814197     0
## 445 0.39762229     0
## 446 0.08609474     0
## 447 0.03222070     0
## 448 0.09568108     0
## 449 0.18181174     0
## 450 0.18489542     0
## 451 0.68451172     0
## 452 0.08787020     0
## 453 0.68228375     0
## 454 0.65827716     0
## 455 0.08609474     0
## 456 0.25936276     0
## 457 0.08654379     0
## 458 0.04854032     0
## 459 0.11522373     0
## 460 0.07224049     0
## 461 0.04229931     0
## 462 0.72046942     0
## 463 0.61326608     0
## 464 0.02953430     0
## 465 0.09879674     0
## 466 0.09971737     0
## 467 0.09971737     0
## 468 0.03319803     0
## 469 0.02958486     0
## 470 0.09971737     0
## 471 0.04295187     0
## 472 0.04314150     0
## 473 0.05892590     0
## 474 0.10158152     0
## 475 0.11522373     0
## 476 0.57675089     0
## 477 0.03286911     0
## 478 0.41500782     0
## 479 0.04985207     0
## 480 0.25936276     0
## 481 0.03646633     0
## 482 0.04827830     0
## 483 0.04867197     0
## 484 0.14137427     0
## 485 0.43653545     0
## 486 0.02899929     0
## 487 0.04759797     0
## 488 0.03719690     0
## 489 0.03401656     0
## 490 0.10163136     0
## 491 0.08727488     0
## 492 0.03469982     0
## 493 0.54973563     0
## 494 0.02758415     0
## 495 0.05133584     0
## 496 0.03940719     0
## 497 0.09208529     0
## 498 0.07521005     0
## 499 0.11948986     0
## 500 0.07308941     0
## 501 0.06514711     0
## 502 0.04487416     0
## 503 0.03172595     0
## 504 0.03729878     0
## 505 0.03319803     0
## 506 0.05769828     0
## 507 0.06770104     0
## 508 0.11137213     0
## 509 0.03880630     0
## 510 0.10064561     0
## 511 0.71326527     0
## 512 0.18361259     0
## 513 0.03426223     0
## 514 0.10080525     0
## 515 0.75029124     0
## 516 0.39762229     0
## 517 0.08704837     0
## 518 0.58427345     0
## 519 0.20884301     0
## 520 0.61368399     0
## 521 0.07572989     0
## 522 0.61368399     0
## 523 0.02197889     0
## 524 0.22100271     0
## 525 0.06770104     0
## 526 0.21054951     0
## 527 0.10252515     0
## 528 0.05826069     0
## 529 0.03268878     0
## 530 0.68451172     0
## 531 0.05363842     0
## 532 0.03489085     0
## 533 0.15918990     0
## 534 0.03295973     0
## 535 0.03584764     0
## 536 0.09971737     0
## 537 0.62581650     0
## 538 0.11627767     0
## 539 0.39609852     0
## 540 0.68894224     0
## 541 0.43962921     0
## 542 0.11522373     0
## 543 0.15895414     0
## 544 0.05034209     0
## 545 0.03405221     0
## 546 0.07738708     0
## 547 0.07433847     0
## 548 0.03794153     0
## 549 0.03646633     0
## 550 0.07489008     0
## 551 0.20378472     0
## 552 0.71839092     0
## 553 0.59506735     0
## 554 0.07489008     0
## 555 0.68228375     0
## 556 0.09208529     0
## 557 0.69114465     0
## 558 0.61612236     0
## 559 0.74802499     0
## 560 0.60986329     0
## 561 0.65089413     0
## 562 0.08609474     0
## 563 0.04759797     0
## 564 0.07945325     0
## 565 0.42252678     0
## 566 0.65362946     0
## 567 0.05442951     0
## 568 0.06514711     0
## 569 0.12277517     0
## 570 0.15918990     0
## 571 0.61612236     0
## 572 0.60056478     0
## 573 0.74996090     0
## 574 0.74802499     0
## 575 0.04136004     0
## 576 0.09174614     0
## 577 0.71630304     0
## 578 0.03236474     0
## 579 0.03703217     0
## 580 0.09000014     0
## 581 0.04780736     0
## 582 0.08609474     0
## 583 0.65089413     0
## 584 0.70023721     0
## 585 0.69612601     0
## 586 0.71630304     0
## 587 0.07489008     0
## 588 0.03574957     0
## 589 0.09415642     0
## 590 0.10080525     0
## 591 0.71420585     0
## 592 0.06770104     0
## 593 0.60986329     0
## 594 0.09971737     0
## 595 0.13937946     0
## 596 0.71839092     0
## 597 0.69273401     0
## 598 0.04188412     0
## 599 0.05695681     0
## 600 0.03021364     0
## 601 0.64960054     0
## 602 0.18028497     0
## 603 0.02899929     0
## 604 0.59064436     0
## 605 0.27094058     0
## 606 0.04556456     0
## 607 0.08609474     0
## 608 0.73675488     0
## 609 0.09122799     0
## 610 0.11522373     0
## 611 0.03158373     0
## 612 0.15014405     0
## 613 0.35272116     0
## 614 0.08609474     0

labs <- paste("Actual Treatment Status:", c("treat1", "treat2"))
test <- prs_df %>% 
  mutate(treat= ifelse(treat == 1, labs[1], labs[2]))
head(test,2)

##    pr_score                           treat
## 1 0.5876022 Actual Treatment Status: treat1
## 2 0.2139931 Actual Treatment Status: treat1

library(ggplot2)

labs <- paste("Actual Treat Status:", c("treat1", "treat2"))
prs_df %>% 
  mutate(treat= ifelse(treat == 1, labs[1], labs[2])) %>%
  ggplot(aes(x= pr_score)) +
  geom_histogram(color= "white") +
  facet_wrap(~treat) + 
  xlab("Probability of getting treatment status")+ 
  theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Use matchit() to execute a matching algorithm
Labdata_nomiss <- Labdata %>%
  select(re78_std, treat, one_of(Labdata_cov)) %>%
           na.omit()

mod_match <- matchit(treat ~ age + educ + black + hispan + nodegree + married, method = "nearest", data = Labdata_nomiss)

dta_m <- match.data(mod_match)

dim(dta_m)

## [1] 370  11

fn_bal <- function(dta, variable) {
  dta$variable <- dta[ , variable]
  dta$treat <- as.factor(dta$treat)
  ggplot(dta, aes(x= distance, y= variable, color= treat)) +
    geom_point(alpha = 0.2, size= 1.5) +
    geom_smooth(method= "loess", se= F) +
    xlab("Propensity score") +
    ylab(variable) +
    theme_bw()
}

## checking balance
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

grid.arrange(grobs = list(
  fn_bal(as.data.frame(dta_m), "age") ,
fn_bal(as.data.frame(dta_m), "educ") + theme(legend.position = "none"),
fn_bal(as.data.frame(dta_m), "married") + theme(legend.position = "none"),
fn_bal(as.data.frame(dta_m), "black"),
fn_bal(as.data.frame(dta_m),"hispan") + theme(legend.position = "none"),
 fn_bal(as.data.frame(dta_m),"nodegree")
),
fn_bal = 4, widths= c(1,0.85)
)

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

dta_m  <- subset(dta_m,select=-c(subclass))
#Use t.test to check the difference-in-means based on the matched #data: outcome variable with treatment (variable)
dta_m %>%
  group_by(treat) %>%
  summarise_all(funs(mean))

## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
## 
## # Simple named list: list(mean = mean, median = median)
## 
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
## 
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))

## # A tibble: 2 × 10
##   treat re78_std   age  educ black hispan married nodegree distance weights
##   <dbl>    <dbl> <dbl> <dbl> <dbl>  <dbl>   <dbl>    <dbl>    <dbl>   <dbl>
## 1     0  -0.126   25.1  10.4 0.470 0.227    0.205    0.686    0.361       1
## 2     1  -0.0594  25.8  10.3 0.843 0.0595   0.189    0.708    0.573       1

#Use t.test to check the difference-in-means based on the matched data
with(dta_m, t.test(re78_std ~ treat))

## 
##  Welch Two Sample t-test
## 
## data:  re78_std by treat
## t = -0.66533, df = 353.47, p-value = 0.5063
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.2627845  0.1299301
## sample estimates:
## mean in group 0 mean in group 1 
##     -0.12581781     -0.05939057

#lapply(Labdata_cov, function (v) {t.test(unlist(dta_m[,v]) ~unlist(dta_m[,"married"]))})

with(dta_m, t.test(age ~ treat))

## 
##  Welch Two Sample t-test
## 
## data:  age by treat
## t = -0.78123, df = 334.6, p-value = 0.4352
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -2.472045  1.066640
## sample estimates:
## mean in group 0 mean in group 1 
##        25.11351        25.81622

with(dta_m, t.test(educ ~ treat))

## 
##  Welch Two Sample t-test
## 
## data:  educ by treat
## t = 0.24865, df = 348.64, p-value = 0.8038
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.4108521  0.5297710
## sample estimates:
## mean in group 0 mean in group 1 
##        10.40541        10.34595

with(dta_m, t.test(black ~ treat))

## 
##  Welch Two Sample t-test
## 
## data:  black by treat
## t = -8.1932, df = 336.37, p-value = 5.38e-15
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.4625173 -0.2834286
## sample estimates:
## mean in group 0 mean in group 1 
##       0.4702703       0.8432432

with(dta_m, t.test(hispan ~ treat))

## 
##  Welch Two Sample t-test
## 
## data:  hispan by treat
## t = 4.7251, df = 290.46, p-value = 3.588e-06
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  0.0977695 0.2373656
## sample estimates:
## mean in group 0 mean in group 1 
##      0.22702703      0.05945946

with(dta_m, t.test(nodegree ~ treat))

## 
##  Welch Two Sample t-test
## 
## data:  nodegree by treat
## t = -0.45153, df = 367.85, p-value = 0.6519
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.11578506  0.07254182
## sample estimates:
## mean in group 0 mean in group 1 
##       0.6864865       0.7081081

with(dta_m, t.test(married ~ treat))

## 
##  Welch Two Sample t-test
## 
## data:  married by treat
## t = 0.39093, df = 367.65, p-value = 0.6961
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.06535445  0.09778688
## sample estimates:
## mean in group 0 mean in group 1 
##       0.2054054       0.1891892

The above shows 6 t-test results which reports whether the used variables lead to a significant difference in treat. From the above results only black and hispan shows significant difference in treat since their p-value is less 0.05 and very close to zero.

##Use lm() to estimate treatment effects 
lm_treat1 <- lm(re78_std ~ treat, data= dta_m)
summary(lm_treat1)

## 
## Call:
## lm(formula = re78_std ~ treat, data = dta_m)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8499 -0.7758 -0.2867  0.4345  7.2227 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -0.12582    0.07060  -1.782   0.0755 .
## treat        0.06643    0.09984   0.665   0.5063  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9602 on 368 degrees of freedom
## Multiple R-squared:  0.001201,   Adjusted R-squared:  -0.001513 
## F-statistic: 0.4427 on 1 and 368 DF,  p-value: 0.5063

The above result suggest that the p-value of treat is 0.5063, which is more than 0.05, making it statistically insignificant. There is also a report on the standard error and the t-values of the intercept and variable treat.

#(2) lm() used for pre-treatment covariates (confounding variables) with treatment (variable)
lm_treat2 <- lm(treat ~ age + educ + black + hispan + nodegree + married, data= dta_m)
summary(lm_treat2)

## 
## Call:
## lm(formula = treat ~ age + educ + black + hispan + nodegree + 
##     married, data = dta_m)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72194 -0.29375  0.07263  0.36464  0.88678 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.031832   0.231541  -0.137    0.891    
## age          0.001949   0.002961   0.658    0.511    
## educ         0.015931   0.014858   1.072    0.284    
## black        0.417256   0.063062   6.617 1.32e-10 ***
## hispan      -0.013738   0.085281  -0.161    0.872    
## nodegree     0.085579   0.072981   1.173    0.242    
## married     -0.075270   0.063768  -1.180    0.239    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4623 on 363 degrees of freedom
## Multiple R-squared:  0.1614, Adjusted R-squared:  0.1476 
## F-statistic: 11.65 on 6 and 363 DF,  p-value: 6.134e-12

From the above statistical results, it can be suggested that only variable black is statistically significant at 0.001 level with a p-value approximately close to zero. All other variables are however deem to statisitically insignificant.

QUESTION1

besides the confounding variables(‘age’,‘educ’,‘black’,‘hispan’,‘married’,‘nodegree’,‘), can variables’ white’,‘re74’,‘re75’ be involved in the propensity score matching?

Answer:

The Inclusion of ’ white’,‘re74’,‘re75’ in the propensity score matching will hinder the chances of reducing bias in our results since propensity score matching are more effective at reducing bias with less covariates(confounding variables)

QUESTION 2

lm() is used to estimate treatment effects based on the matched data, can we use glm() instead of lm()?