#Reading in the data
msmoking_df <- read_dta("msmoking (2).dta")

##Part 1 ### 1
#####The first order of business is to go through the codebook, decide on the relevant variables, and process the data. This involves several steps:

(a) Fix missing values. In the data set several variables take on a value of, say, 9999 if missing. I have already checked for missing observations for about 2/3 of the variables. You need to check the last 15 in the variable list (i.e. from ‘cardiac’ to ‘wgain’) and fix them by replacing all missing values with dots (‘.’). Refer to the codebook for missing value codes.

Using R instead of Stata, I replaced missing values with NA rather than “.”. From the codebook, the following responses need to be assigned as missing for the remaining variables.

8 & 9 – herpes - Medical Risk Factors (cardiac:preterm) - specifically Herpes 99 – wgain 9 – alcohol, tobacco 99 – cigar - cigarettes 6 – cigar6 – per day records 99- Drink - unkown 5 – Drink5 - unknown_

#count value of unknowns
med_risk_df <- data.frame(msmoking_df$cardiac, 
                          msmoking_df$lung,
                          msmoking_df$diabetes,
                          msmoking_df$herpes,
                          msmoking_df$chyper,
                          msmoking_df$phyper,
                          msmoking_df$pre4000,
                          msmoking_df$preterm)
sum(med_risk_df == 8)
## [1] 6
sum(med_risk_df == 9)
## [1] 0
sum(msmoking_df$tobacco == 9 )
## [1] 65
sum(msmoking_df$alcohol == 9 )
## [1] 105
sum(msmoking_df$cigar == 99)
## [1] 1077
sum(msmoking_df$drink == 99)
## [1] 2116
sum(msmoking_df$wgain == 99)
## [1] 3048
sum(msmoking_df$cigar6 == 6 )
## [1] 1077
sum(msmoking_df$drink5 == 5 )
## [1] 2116
#Replace unknowns with 'NA' to reflect as missing data
msmoking_missing_df <- msmoking_df %>% 
  mutate(
  across(c(cardiac:preterm), ~ ifelse(. == 8, NA, .)),
  across(.cols=c("tobacco","alcohol"),~ifelse(. == 9, NA, .)),
  across(.cols=c("cigar","drink","wgain"),~ifelse(. == 99, NA, .)),
  cigar6 = ifelse(cigar6 == 6, NA, cigar6),
  drink5 = ifelse(drink5 == 5, NA, drink5)
  )
         
#Validate data change
sum(msmoking_missing_df$herpes == 8)
## [1] NA
sum(is.na(msmoking_missing_df$herpes))
## [1] 6
sum(msmoking_missing_df$tobacco == 9 )
## [1] NA
sum(msmoking_missing_df$alcohol == 9 )
## [1] NA
sum(is.na(msmoking_missing_df$tobacco))
## [1] 65
sum(is.na(msmoking_missing_df$alcohol))
## [1] 105
sum(msmoking_missing_df$cigar == 99)
## [1] NA
sum(msmoking_missing_df$drink == 99)
## [1] NA
sum(msmoking_missing_df$wgain == 99)
## [1] NA
sum(is.na(msmoking_missing_df$cigar))
## [1] 1077
sum(is.na(msmoking_missing_df$drink))
## [1] 2116
sum(is.na(msmoking_missing_df$wgain))
## [1] 3048
sum(msmoking_missing_df$cigar6 == 6 )
## [1] NA
sum(is.na(msmoking_missing_df$cigar6))
## [1] 1077
sum(msmoking_missing_df$drink5 == 5 )
## [1] NA
sum(is.na(msmoking_missing_df$drink5))
## [1] 2116

1(b)

There are several noncardinal variables that need to be recoded before including them in a regression. They are: cntocpop, stresfip, mrace3, adequancy, birmon, weekday, delmeth5, ormoth, orfath. Generate dummy variables for each categorical value. For example, for cntocpop, generate a dummy variable named cntocpop_0 that takes the value of one if cntocpop == 0 (This means the observation is in county 0, while the actual name of the county is filtered for confidentiality reasons). Save the data with changes both in (a) and (b) as msmoking_clean.dta.

msmoking_dummy_df <- dummy_cols(msmoking_df, select_columns = c("cntocpop", "stresfip", "mrace3", "adequacy", "birmon", "weekday", "delmeth5", "ormoth", "orfath"))
msmoking_dummy_df_only <-msmoking_dummy_df[,49:130] 
str(msmoking_dummy_df_only)
## tibble [120,461 × 82] (S3: tbl_df/tbl/data.frame)
##  $ cntocpop_0 : int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ cntocpop_1 : int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ cntocpop_2 : int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ cntocpop_3 : int [1:120461] 1 1 1 1 1 1 1 1 1 1 ...
##  $ stresfip_0 : int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stresfip_4 : int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stresfip_6 : int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stresfip_8 : int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stresfip_9 : int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stresfip_10: int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stresfip_11: int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stresfip_12: int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stresfip_13: int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stresfip_17: int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stresfip_19: int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stresfip_21: int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stresfip_23: int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stresfip_24: int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stresfip_25: int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stresfip_26: int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stresfip_27: int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stresfip_29: int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stresfip_31: int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stresfip_32: int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stresfip_34: int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stresfip_36: int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stresfip_37: int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stresfip_38: int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stresfip_39: int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stresfip_40: int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stresfip_42: int [1:120461] 0 1 1 1 1 1 1 1 1 1 ...
##  $ stresfip_44: int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stresfip_45: int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stresfip_46: int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stresfip_47: int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stresfip_48: int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stresfip_51: int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stresfip_53: int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stresfip_54: int [1:120461] 1 0 0 0 0 0 0 0 0 0 ...
##  $ stresfip_55: int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ mrace3_1   : int [1:120461] 1 1 1 1 1 1 1 1 1 1 ...
##  $ mrace3_2   : int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ mrace3_3   : int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ adequacy_1 : int [1:120461] 1 1 1 1 1 1 0 1 1 1 ...
##  $ adequacy_2 : int [1:120461] 0 0 0 0 0 0 1 0 0 0 ...
##  $ adequacy_3 : int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ birmon_1   : int [1:120461] 1 1 1 1 1 1 1 1 1 1 ...
##  $ birmon_2   : int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ birmon_3   : int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ birmon_4   : int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ birmon_5   : int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ birmon_6   : int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ birmon_7   : int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ birmon_8   : int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ birmon_9   : int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ birmon_10  : int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ birmon_11  : int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ birmon_12  : int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_1  : int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_2  : int [1:120461] 0 1 0 0 0 1 0 0 0 0 ...
##  $ weekday_3  : int [1:120461] 1 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_4  : int [1:120461] 0 0 0 1 0 0 1 1 0 0 ...
##  $ weekday_5  : int [1:120461] 0 0 1 0 0 0 0 0 0 0 ...
##  $ weekday_6  : int [1:120461] 0 0 0 0 1 0 0 0 1 1 ...
##  $ weekday_7  : int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ delmeth5_1 : int [1:120461] 0 1 0 1 0 0 1 0 1 1 ...
##  $ delmeth5_2 : int [1:120461] 1 0 0 0 0 0 0 0 0 0 ...
##  $ delmeth5_3 : int [1:120461] 0 0 1 0 1 0 0 1 0 0 ...
##  $ delmeth5_4 : int [1:120461] 0 0 0 0 0 1 0 0 0 0 ...
##  $ delmeth5_5 : int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ ormoth_0   : int [1:120461] 1 1 1 1 1 1 1 1 1 1 ...
##  $ ormoth_1   : int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ ormoth_2   : int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ ormoth_3   : int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ ormoth_4   : int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ ormoth_5   : int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ orfath_0   : int [1:120461] 1 1 1 1 1 1 1 1 1 1 ...
##  $ orfath_1   : int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ orfath_2   : int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ orfath_3   : int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ orfath_4   : int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  $ orfath_5   : int [1:120461] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, ".internal.selfref")=<externalptr>
#combine data frames into clean df
msmoking_clean_df <- cbind(msmoking_missing_df,msmoking_dummy_df_only)

#Write files for submission
write.csv(msmoking_clean_df, "msmoking_clean.csv", row.names = FALSE)
write_dta(msmoking_clean_df, "msmoking_clean.dta")

1(c)

If this were a real research project you would want to consider what observations do you have to drop because of missing data.

a. If there is correlation between which observations have missing data then it is possible that removing those observations may result in underrepresenting certain explanatory characteristics in the modeling of impact. Or rather, it could introduce bias. For example, if when the place of birth (PLDEL3) is a residence correlates with a higher likelihood of a smoking habits (tobacco, cigar and cigar6) going unreported then there may be an underrepresentation of the effects of homebirthing on the results.

#b.1
msmoking_clean_df$tobacco_missing <- ifelse(is.na(msmoking_clean_df$tobacco),1,0) 
sum(msmoking_clean_df$tobacco_missing==1)
## [1] 65
#b.2

tobacco_missing_correlation <- glm(tobacco_missing ~ ., data = msmoking_clean_df, family = binomial)
## Warning: glm.fit: algorithm did not converge
summary(tobacco_missing_correlation)
## 
## Call:
## glm(formula = tobacco_missing ~ ., family = binomial, data = msmoking_clean_df)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -2.409e-06  -2.409e-06  -2.409e-06  -2.409e-06  -2.409e-06  
## 
## Coefficients: (17 not defined because of singularities)
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.135e+02  6.328e+15       0        1
## rectype     -4.664e-12  2.604e+03       0        1
## pldel3       1.195e-13  9.226e+03       0        1
## birattnd    -7.376e-13  2.123e+03       0        1
## cntocpop    -8.962e-13  2.978e+03       0        1
## stresfip     2.382e+00  1.392e+14       0        1
## dmage        2.629e-14  3.410e+02       0        1
## ormoth      -6.750e-14  2.578e+04       0        1
## mrace3      -2.935e-14  8.371e+03       0        1
## dmeduc      -4.103e-14  6.670e+02       0        1
## dmar         3.778e-14  3.187e+03       0        1
## adequacy     4.936e-14  6.223e+03       0        1
## nlbnl       -2.139e-13  6.771e+03       0        1
## dlivord      1.721e-13  6.777e+03       0        1
## dtotord      3.225e-14  5.436e+03       0        1
## totord9     -5.052e-14  5.559e+03       0        1
## monpre      -3.112e-14  1.100e+03       0        1
## nprevist     4.027e-15  4.154e+02       0        1
## disllb      -2.291e-16  6.853e+00       0        1
## isllb10     -1.173e-14  6.810e+02       0        1
## dfage        8.862e-16  2.600e+02       0        1
## orfath       4.941e-15  2.499e+04       0        1
## dfeduc       1.397e-16  6.255e+02       0        1
## birmon       7.954e-15  5.361e+03       0        1
## weekday      1.656e-14  4.082e+03       0        1
## dgestat     -5.137e-15  6.771e+02       0        1
## csex         1.432e-14  2.136e+03       0        1
## dbrwt        2.050e-17  2.550e+00       0        1
## dplural     -1.586e-13  6.674e+03       0        1
## omaps       -2.766e-14  1.076e+03       0        1
## fmaps        9.478e-15  1.962e+03       0        1
## clingest    -5.308e-15  9.061e+02       0        1
## delmeth5    -3.140e-13  2.275e+04       0        1
## anemia       1.486e-14  1.065e+04       0        1
## cardiac      1.618e-15  1.275e+04       0        1
## lung         6.829e-17  1.245e+04       0        1
## diabetes    -9.508e-15  6.624e+03       0        1
## herpes      -1.805e-15  1.351e+04       0        1
## chyper      -2.577e-15  1.212e+04       0        1
## phyper       8.529e-15  6.185e+03       0        1
## pre4000     -2.538e-13  8.973e+03       0        1
## preterm      2.001e-14  9.082e+03       0        1
## tobacco      2.215e-14  1.004e+04       0        1
## cigar       -2.675e-14  1.017e+03       0        1
## cigar6       1.544e-13  9.204e+03       0        1
## alcohol      5.847e-14  2.295e+04       0        1
## drink        1.119e-15  2.368e+03       0        1
## drink5       1.300e-14  1.141e+04       0        1
## wgain        1.359e-15  9.379e+01       0        1
## cntocpop_0   2.815e-13  7.581e+03       0        1
## cntocpop_1   1.972e-13  5.226e+03       0        1
## cntocpop_2          NA         NA      NA       NA
## cntocpop_3          NA         NA      NA       NA
## stresfip_0   8.691e+01  6.328e+15       0        1
## stresfip_4   7.738e+01  6.509e+15       0        1
## stresfip_6   7.262e+01  6.616e+15       0        1
## stresfip_8   6.785e+01  6.732e+15       0        1
## stresfip_9   6.547e+01  6.793e+15       0        1
## stresfip_10  6.309e+01  6.857e+15       0        1
## stresfip_11  6.071e+01  6.923e+15       0        1
## stresfip_12  5.833e+01  6.992e+15       0        1
## stresfip_13  5.595e+01  7.062e+15       0        1
## stresfip_17  4.642e+01  7.363e+15       0        1
## stresfip_19  4.166e+01  7.524e+15       0        1
## stresfip_21  3.689e+01  7.692e+15       0        1
## stresfip_23  3.213e+01  7.867e+15       0        1
## stresfip_24  2.975e+01  7.956e+15       0        1
## stresfip_25  2.737e+01  8.047e+15       0        1
## stresfip_26  2.499e+01  8.139e+15       0        1
## stresfip_27  2.260e+01  8.233e+15       0        1
## stresfip_29  1.784e+01  8.424e+15       0        1
## stresfip_31  1.308e+01  8.619e+15       0        1
## stresfip_32  1.070e+01  8.719e+15       0        1
## stresfip_34  5.934e+00  8.921e+15       0        1
## stresfip_36  1.171e+00  9.127e+15       0        1
## stresfip_37 -1.210e+00  9.232e+15       0        1
## stresfip_38 -3.592e+00  9.337e+15       0        1
## stresfip_39 -5.973e+00  9.443e+15       0        1
## stresfip_40 -8.355e+00  9.551e+15       0        1
## stresfip_42 -1.312e+01  9.767e+15       0        1
## stresfip_44 -1.788e+01  9.987e+15       0        1
## stresfip_45 -2.026e+01  1.010e+16       0        1
## stresfip_46 -2.264e+01  1.021e+16       0        1
## stresfip_47 -2.503e+01  1.032e+16       0        1
## stresfip_48         NA         NA      NA       NA
## stresfip_51 -3.455e+01  1.078e+16       0        1
## stresfip_53 -3.931e+01  1.101e+16       0        1
## stresfip_54 -4.170e+01  1.113e+16       0        1
## stresfip_55 -4.408e+01  1.124e+16       0        1
## mrace3_1    -3.574e-14  1.555e+04       0        1
## mrace3_2            NA         NA      NA       NA
## mrace3_3            NA         NA      NA       NA
## adequacy_1   1.381e-15  6.951e+03       0        1
## adequacy_2          NA         NA      NA       NA
## adequacy_3          NA         NA      NA       NA
## birmon_1     1.894e-13  5.650e+04       0        1
## birmon_2     8.279e-14  5.117e+04       0        1
## birmon_3     7.112e-14  4.582e+04       0        1
## birmon_4     5.437e-14  4.049e+04       0        1
## birmon_5     5.317e-14  3.516e+04       0        1
## birmon_6     4.330e-14  2.986e+04       0        1
## birmon_7     3.510e-14  2.456e+04       0        1
## birmon_8     3.324e-14  1.931e+04       0        1
## birmon_9     1.702e-14  1.416e+04       0        1
## birmon_10    2.097e-14  9.260e+03       0        1
## birmon_11           NA         NA      NA       NA
## birmon_12           NA         NA      NA       NA
## weekday_1    1.046e-13  2.248e+04       0        1
## weekday_2    1.465e-13  1.839e+04       0        1
## weekday_3    6.086e-14  1.437e+04       0        1
## weekday_4    4.598e-14  1.044e+04       0        1
## weekday_5    4.339e-14  6.703e+03       0        1
## weekday_6           NA         NA      NA       NA
## weekday_7           NA         NA      NA       NA
## delmeth5_1  -7.064e-13  6.904e+04       0        1
## delmeth5_2  -3.695e-13  4.663e+04       0        1
## delmeth5_3  -3.426e-13  2.399e+04       0        1
## delmeth5_4          NA         NA      NA       NA
## delmeth5_5          NA         NA      NA       NA
## ormoth_0    -4.103e-14  1.183e+05       0        1
## ormoth_1    -2.292e-14  9.506e+04       0        1
## ormoth_2    -3.326e-14  6.762e+04       0        1
## ormoth_3    -1.624e-14  5.794e+04       0        1
## ormoth_4            NA         NA      NA       NA
## ormoth_5            NA         NA      NA       NA
## orfath_0     4.806e-14  1.138e+05       0        1
## orfath_1     2.507e-14  9.077e+04       0        1
## orfath_2     2.300e-14  6.452e+04       0        1
## orfath_3    -1.862e-15  5.683e+04       0        1
## orfath_4            NA         NA      NA       NA
## orfath_5            NA         NA      NA       NA
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 0.0000e+00  on 114609  degrees of freedom
## Residual deviance: 6.6492e-07  on 114496  degrees of freedom
##   (5851 observations deleted due to missingness)
## AIC: 228
## 
## Number of Fisher Scoring iterations: 25
########Need more here!

###1(d) Produce a summary table describing the final analysis data set, i.e., create a table like below for all other variables. (Hint: You can do this in various ways by, for example, using “sum” or “tabstat”. An easy way to directly export the table to .csv file is to use estpost and esttab. You may need to add these commands by “ssc install esttab” or “findit esttab”.). Upload the file to this question.

#First calculate key summary statistics and create vectors by column
N_msmoking <- colSums(!is.na(msmoking_clean_df))
Mean_msmoking <- round(colMeans(msmoking_clean_df, na.rm = TRUE),3)
SD_msmoking <- round(apply(msmoking_clean_df,2, sd, na.rm = TRUE),3)
Min_msmoking <- apply(msmoking_clean_df,2, min, na.rm = TRUE)
Max_msmoking <- apply(msmoking_clean_df,2, max, na.rm = TRUE)
Label_msmoking <- colnames(msmoking_clean_df)

Summary_msmoking <- data.frame(
  Labels = Label_msmoking,
  N = N_msmoking,
  Mean = Mean_msmoking,
  SD = SD_msmoking,
  Min = Min_msmoking,
  Max = Max_msmoking
)

##2. The next part of the assignment is to try to estimate the “causal” effect of maternal smoking during pregnancy on infant birth weight. Use msmoking_clean.dta that you created in Question 1. Let’s start out using techniques that are familiar, and think about whether they are likely to work in this context. Answer the following questions.

###2(a) (a) Compute the mean difference in APGAR scores (both five and one minute versions, fmaps and omaps, respectively) as well as birthweight (dbrwt) by smoking status.

msmoking_clean_df %>%  
  group_by(tobacco) %>% 
  summarise( avg_fmaps = mean(fmaps, na.rm = TRUE))
## # A tibble: 3 × 2
##   tobacco avg_fmaps
##     <dbl>     <dbl>
## 1       1      8.99
## 2       2      9.01
## 3      NA      8.82
#Average mean difference for 5 minute apgar is a reduction of -0.013 pts after 5 minutes
AMD_fmaps = mean(msmoking_clean_df$fmaps[msmoking_clean_df$tobacco == 1],na.rm = TRUE) - mean(msmoking_clean_df$fmaps[msmoking_clean_df$tobacco == 2],na.rm = TRUE) 
round(AMD_fmaps,3)
## [1] -0.013
#Average mean difference for 1 minute apgar is a reduction of -0.043 pts after only 1 minute
#Expected to have more difference in immediate minute after birth
AMD_omaps = mean(msmoking_clean_df$omaps[msmoking_clean_df$tobacco == 1],na.rm = TRUE) - mean(msmoking_clean_df$omaps[msmoking_clean_df$tobacco == 2],na.rm = TRUE) 
round(AMD_omaps,3)
## [1] -0.043
#Average mean difference Birth weight is -246.9g when smoking is present in mother during pregnancy
AMD_dbrwt = mean(msmoking_clean_df$dbrwt[msmoking_clean_df$tobacco == 1],na.rm = TRUE) - mean(msmoking_clean_df$dbrwt[msmoking_clean_df$tobacco == 2],na.rm = TRUE) 
round(AMD_dbrwt,3)
## [1] -246.917

###2(b) Under what circumstances can one identify the average treatment effect of maternal smoking by comparing the unadjusted (meaning no other control variables) difference in mean birthweight of infants of smoking and non-smoking mothers? Provide and comment on some evidence for or against the validity of the assumption.

(b) Comparing the unadjusted differences in borthweight of children from smoking versus non-smoking mothers could only give a true average treatment effect without bias when the sample population were randomly allocated to the smoking treatment and control of non-smoking. Because these decisions were endogenously made, this data does not present results of an RCT study. And in fact, it would unethical to perform a true RCT given the health concerns involved. So it would be reasonable to assume that there exists some selection bias in the observed average mean differences between smoking and non-smoking groups.

In the summary of several observed characteristic means shared below, we can see that there are clearly differences that persist between the maternal smoker and non-smoker group. Specifically, mothers who smoked during pregnancy are more likely from higher population counties, average 1.4 years less education, and have on average delivered more babies before observed birth. Also worth noting that the smoking mother’s gained less gestational weight, likely because it is believed smoking surpress hunger. If the mother has gained less weight, it could mean the fewer nutrients are passed on to child.

#Compares several mean observed characteristics split by tobacco usage
msmoking_clean_df %>%  
  group_by(tobacco) %>% 
  summarize(across(c(dlivord, dfage, dmeduc, cntocpop, alcohol, wgain, nprevist), mean, na.rm = TRUE))
## Warning: There was 1 warning in `summarize()`.
## ℹ In argument: `across(...)`.
## ℹ In group 1: `tobacco = 1`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))
## # A tibble: 3 × 8
##   tobacco dlivord dfage dmeduc cntocpop alcohol wgain nprevist
##     <dbl>   <dbl> <dbl>  <dbl>    <dbl>   <dbl> <dbl>    <dbl>
## 1       1    2.22  29.0   12.0     1.54    1.92  29.5    10.4 
## 2       2    1.95  30.3   13.4     1.42    1.98  30.5    11.2 
## 3      NA    1.97  27.2   12.3     1.66    2     32.2     8.58

###2(c) Do you think the estimated effect is likely to be overestimating or underestimating the true effect of maternal smoking on birthweight?

(c) My instinct is that the calculated mean differences in the observed apgar (fmaps and omaps) birth weight (dbrwt) between smoking and non-smoking morthers, calculated in 2(a), are overstated. I suspect that individuals who make the choice to smoke during pregnancy may also share other behaviors that could be confounding factors for lower birth weight. For example, the summary table of mean characteristics showed that smoking mothers averaged nearly 1 fewer preterm visit to a doctor and had lower educational achievement, which may cause them to not be as well-versed in proper diet and care during pregnancy.

##Part 2 ###I. Summary Statistics

###1.
Conduct a test to see whether baseline characteristics are statistically different between the treatment and control group based on the model: Characteristics = a + bAnyCash Incentive + cCash Amount + dCash Amount2 + eDistance(km) + fDistance2 + u, where characteristics are: male, age (age), age squared (age2), HIV positive (hiv2004), years of education (educ2004), owns land (land2004), had sex in 2004 (hadsex12), used a condom in 2004 (usecondom04).

Do you see any differences? Are you concerned by any of the differences? How could they affect the analysis?

#read in data
hiv_df <- read_dta("thornton-HIV-testing-Data.dta")

#Run 8 models comparing base characteristics to treatment and control groups
model_2.1.1 <- hiv_df %>% lm(male ~ any + tinc + tinc^2 + distvct + distvct^2,.)
model_2.1.1 %>%  summary()
## 
## Call:
## lm(formula = male ~ any + tinc + tinc^2 + distvct + distvct^2, 
##     data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4921 -0.4667 -0.4458  0.5320  0.5618 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.4729842  0.0238454  19.835   <2e-16 ***
## any         -0.0359419  0.0275240  -1.306    0.192    
## tinc         0.0001198  0.0001222   0.980    0.327    
## distvct      0.0036832  0.0073368   0.502    0.616    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.499 on 2897 degrees of freedom
##   (1919 observations deleted due to missingness)
## Multiple R-squared:  0.0006798,  Adjusted R-squared:  -0.000355 
## F-statistic: 0.6569 on 3 and 2897 DF,  p-value: 0.5786
model_2.1.2 <- hiv_df %>% lm(age ~ any + tinc + tinc^2 + distvct + distvct^2,.)
model_2.1.2 %>%  summary()
## 
## Call:
## lm(formula = age ~ any + tinc + tinc^2 + distvct + distvct^2, 
##     data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.797 -11.664  -1.757   9.845  47.131 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 31.519246   0.651462  48.382   <2e-16 ***
## any          1.767921   0.751593   2.352   0.0187 *  
## tinc        -0.004072   0.003339  -1.220   0.2227    
## distvct      0.507070   0.200530   2.529   0.0115 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.61 on 2892 degrees of freedom
##   (1924 observations deleted due to missingness)
## Multiple R-squared:  0.004279,   Adjusted R-squared:  0.003246 
## F-statistic: 4.143 on 3 and 2892 DF,  p-value: 0.006119
model_2.1.3 <- hiv_df %>% lm(age2 ~ any + tinc + tinc^2 + distvct + distvct^2,.)
model_2.1.3 %>%  summary()
## 
## Call:
## lm(formula = age2 ~ any + tinc + tinc^2 + distvct + distvct^2, 
##     data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##   -8060   -2775   -1005     323 3664613 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  2582.50    3263.03   0.791    0.429
## any         -1515.20    3764.56  -0.402    0.687
## tinc           24.49      16.72   1.465    0.143
## distvct      -701.73    1004.35  -0.699    0.485
## 
## Residual standard error: 68190 on 2893 degrees of freedom
##   (1923 observations deleted due to missingness)
## Multiple R-squared:  0.001042,   Adjusted R-squared:  5.662e-06 
## F-statistic: 1.005 on 3 and 2893 DF,  p-value: 0.3893
model_2.1.4 <- hiv_df %>% lm(hiv2004 ~ any + tinc + tinc^2 + distvct + distvct^2,.)
model_2.1.4 %>%  summary()
## 
## Call:
## lm(formula = hiv2004 ~ any + tinc + tinc^2 + distvct + distvct^2, 
##     data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.06857 -0.06274 -0.05860 -0.05178  0.96137 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.815e-02  1.248e-02   5.460 5.17e-08 ***
## any          3.475e-03  1.424e-02   0.244    0.807    
## tinc        -2.205e-05  6.200e-05  -0.356    0.722    
## distvct     -5.294e-03  3.734e-03  -1.418    0.156    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2525 on 2830 degrees of freedom
##   (1986 observations deleted due to missingness)
## Multiple R-squared:  0.0007483,  Adjusted R-squared:  -0.0003109 
## F-statistic: 0.7065 on 3 and 2830 DF,  p-value: 0.5481
model_2.1.5 <- hiv_df %>% lm(educ2004 ~ any + tinc + tinc^2 + distvct + distvct^2,.)
model_2.1.5 %>%  summary()
## 
## Call:
## lm(formula = educ2004 ~ any + tinc + tinc^2 + distvct + distvct^2, 
##     data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9684 -3.4091 -0.7319  3.2863  9.3085 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.9768554  0.1874200  26.555  < 2e-16 ***
## any         -1.4589139  0.2150064  -6.785 1.43e-11 ***
## tinc         0.0029017  0.0009548   3.039   0.0024 ** 
## distvct     -0.2575272  0.0573100  -4.494 7.31e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.693 on 2603 degrees of freedom
##   (2213 observations deleted due to missingness)
## Multiple R-squared:  0.02632,    Adjusted R-squared:  0.0252 
## F-statistic: 23.46 on 3 and 2603 DF,  p-value: 5.594e-15
model_2.1.6 <- hiv_df %>% lm(land2004 ~ any + tinc + tinc^2 + distvct + distvct^2,.)
model_2.1.6 %>%  summary()
## 
## Call:
## lm(formula = land2004 ~ any + tinc + tinc^2 + distvct + distvct^2, 
##     data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8247 -0.6418  0.2455  0.2714  0.3779 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.214e-01  2.216e-02  28.040  < 2e-16 ***
## any          9.913e-02  2.538e-02   3.906 9.61e-05 ***
## tinc        -6.202e-05  1.123e-04  -0.552  0.58085    
## distvct      2.030e-02  6.728e-03   3.018  0.00257 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4405 on 2669 degrees of freedom
##   (2147 observations deleted due to missingness)
## Multiple R-squared:  0.0112, Adjusted R-squared:  0.01009 
## F-statistic: 10.08 on 3 and 2669 DF,  p-value: 1.331e-06
model_2.1.7 <- hiv_df %>% lm(hadsex12 ~ any + tinc + tinc^2 + distvct + distvct^2,.)
model_2.1.7 %>%  summary()
## 
## Call:
## lm(formula = hadsex12 ~ any + tinc + tinc^2 + distvct + distvct^2, 
##     data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8022  0.2012  0.2302  0.2459  0.2781 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.7650040  0.0227016  33.698   <2e-16 ***
## any         -0.0086908  0.0257923  -0.337    0.736    
## tinc        -0.0001187  0.0001139  -1.042    0.297    
## distvct      0.0076749  0.0068193   1.125    0.261    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4261 on 2412 degrees of freedom
##   (2404 observations deleted due to missingness)
## Multiple R-squared:  0.001554,   Adjusted R-squared:  0.0003125 
## F-statistic: 1.252 on 3 and 2412 DF,  p-value: 0.2894
model_2.1.8 <- hiv_df %>% lm(usecondom04 ~ any + tinc + tinc^2 + distvct + distvct^2,.)
model_2.1.8 %>%  summary()
## 
## Call:
## lm(formula = usecondom04 ~ any + tinc + tinc^2 + distvct + distvct^2, 
##     data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.2676 -0.2108 -0.2004 -0.1910  0.8140 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.465e-01  2.162e-02  11.400   <2e-16 ***
## any         -6.174e-02  2.461e-02  -2.509   0.0122 *  
## tinc         6.363e-05  1.086e-04   0.586   0.5582    
## distvct      4.064e-03  6.512e-03   0.624   0.5327    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4096 on 2444 degrees of freedom
##   (2372 observations deleted due to missingness)
## Multiple R-squared:  0.0032, Adjusted R-squared:  0.001976 
## F-statistic: 2.615 on 3 and 2444 DF,  p-value: 0.04956

In the regressions we would expect no differences of significance between any of the treatment group based on observed characteristics, however there are a few noteworthy differences. For example, those receiving cash incentives are more likely to be older, own land, and have used a condom in 2004. The characteristic I am most concerned with creating bias from those is the increased likelihood of landownership for the cash incentive treatment because land ownership may neccesitate more costs or indicate a greater responsiveness to cash. Then more broadly there are significant differences in terms of education by treatment. The study lacks some balance here, with those in the control groups slightly less likely to have received education from nurses. I think the landownership difference could lead to a very slight overstatement of results, while the education difference could actually make these results more robust assuming education has some positive influence.

###II. Analysis using linear regression Using OLS regression analysis is one of the most common tools used to estimate the effect of a treatment or randomly assigned intervention.

###2.
Run the following OLS regression, where getting your HIV test result is the dependent variable (got), and receiving a cash incentive (any) as your covariate. Got Test Result=a + b1AnyCash Incentive + u

What is your estimate of b1? Is it statistically significant? If so, at what level? What happens when you include additional control variables: age (age), age squared (age2), male (male), education (educ2004), marriage (mar), HIV status (hiv2004)? Does your estimate of b1 change? Does this suggest that there is covert or overt bias? Now interpret your findings. Based on your estimates, what can you say about the effect of offering cash incentives on people learning their HIV status? Would you say that this is a big or small effect?

#Variables required for initial model -> got and any
model_2.2 <- hiv_df %>% lm(got ~ any,.)
model_2.2 %>%  summary()
## 
## Call:
## lm(formula = got ~ any, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7892 -0.3387  0.2108  0.2108  0.6613 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.33868    0.01696   19.97   <2e-16 ***
## any          0.45055    0.01920   23.47   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4232 on 2832 degrees of freedom
##   (1986 observations deleted due to missingness)
## Multiple R-squared:  0.1628, Adjusted R-squared:  0.1625 
## F-statistic: 550.8 on 1 and 2832 DF,  p-value: < 2.2e-16

The cash incentive is shown to have a positive effect on the liklihood that an individual got an HIV test. B1 represents 45% increase in liklihood that an individual will have gotten a test when a cash incentive was in place and this result is highliy significant with a P-value less the 0.1%. In fact, in this model, the cash incentive more than doubles the liklihood of a test.

#Added age, age^2, is male, education level and marriage status + hiv status
model_2.2.alt <- hiv_df %>% lm(got ~ any + age + age2 + male + educ2004 + mar + hiv2004,.)
model_2.2.alt %>%  summary()
## 
## Call:
## lm(formula = got ~ any + age + age2 + male + educ2004 + mar + 
##     hiv2004, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9078 -0.3159  0.1608  0.2189  0.7409 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.064e-01  6.637e-02   4.616  4.1e-06 ***
## any          4.503e-01  2.014e-02  22.361  < 2e-16 ***
## age          4.024e-03  3.552e-03   1.133  0.25730    
## age2        -4.219e-05  4.399e-05  -0.959  0.33754    
## male        -1.150e-02  1.756e-02  -0.655  0.51278    
## educ2004    -8.084e-03  2.762e-03  -2.927  0.00346 ** 
## mar          4.836e-03  2.404e-02   0.201  0.84059    
## hiv2004     -5.031e-02  3.285e-02  -1.531  0.12580    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4133 on 2537 degrees of freedom
##   (2275 observations deleted due to missingness)
## Multiple R-squared:  0.1808, Adjusted R-squared:  0.1785 
## F-statistic: 79.98 on 7 and 2537 DF,  p-value: < 2.2e-16

The addition of additional control variables, while helping shift the level of variance explained slightly higher R^2 from 16% to 18%) and education level showing as an additional significant factor, the presence of a cash incentive maintains it’s level of significance as an explantory variable with virtually the same coefficient. This updated model reaffirms how important the offering of a cash incentive was in terms of getting individuals tested and makes me believe that the stability of B1 means there is minimal bias.

###3.
Now you additionally include “Cash Amount” (tinc). Got Test Result=a + b1AnyCash Incentive + b2Cash Amount + u

What is your estimate of b2? Is it statistically significant? What happens when you include additional control variables: age (age), age squared (age2), male (male), education (educ2004), marriage (mar), HIV status (hiv2004)? Does your estimate of b2 change? Does doubling the cash incentive from $1 to $2 (1 ~ 20 kwacha) have a big effect on people’s willingness to get their HIV test results? Does this surprise you?

#Cash amount layered onto any incentive
model_2.3 <- hiv_df %>% lm(got ~ any + tinc,.)
model_2.3 %>%  summary()
## 
## Call:
## lm(formula = got ~ any + tinc, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9187 -0.3387  0.1604  0.2395  0.6613 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.3386838  0.0167857  20.177  < 2e-16 ***
## any         0.3427218  0.0236189  14.510  < 2e-16 ***
## tinc        0.0007910  0.0001029   7.688 2.04e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.419 on 2831 degrees of freedom
##   (1986 observations deleted due to missingness)
## Multiple R-squared:  0.1799, Adjusted R-squared:  0.1794 
## F-statistic: 310.6 on 2 and 2831 DF,  p-value: < 2.2e-16

The B2 estimate is positive, suggesting that an incremental $1 increase in the value amount of the cash incentive will result in a nearly 0.1% increase in liklihood of getting tested, while any cash incentive remains a strong explanatory variable. Both any and tinc are strongly significant at 0.1% level. It still seems that just the precense of any cash incentive is more important than the amount. It follows rational expectation that more, or higher, cash incentive would lead to more participation/testing, but I am not surprised that the hedonistic impact of just a cash incentive at all has more power than the amount of the incentive.

#Added age, age^2, is male, education level and marriage status
model_2.3.alt <- hiv_df %>% lm(got ~ any + tinc + age + age2 + male + educ2004 + mar + hiv2004,.)
model_2.3.alt %>%  summary()
## 
## Call:
## lm(formula = got ~ any + tinc + age + age2 + male + educ2004 + 
##     mar + hiv2004, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9850 -0.3156  0.1533  0.2389  0.7522 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.341e-01  6.575e-02   5.081 4.02e-07 ***
## any          3.403e-01  2.467e-02  13.795  < 2e-16 ***
## tinc         8.049e-04  1.065e-04   7.559 5.64e-14 ***
## age          2.770e-03  3.517e-03   0.788 0.430973    
## age2        -2.756e-05  4.355e-05  -0.633 0.526953    
## male        -1.593e-02  1.738e-02  -0.916 0.359543    
## educ2004    -9.422e-03  2.738e-03  -3.441 0.000588 ***
## mar          8.997e-03  2.379e-02   0.378 0.705269    
## hiv2004     -4.662e-02  3.250e-02  -1.435 0.151536    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4088 on 2536 degrees of freedom
##   (2275 observations deleted due to missingness)
## Multiple R-squared:  0.1988, Adjusted R-squared:  0.1963 
## F-statistic: 78.67 on 8 and 2536 DF,  p-value: < 2.2e-16

When additional control factors are added, B1 and B2 remain robustly significant variables. And the impact of a $1 increase has actually climbed in this model, while Education is the one additional factor that is significant.

###4.
Now you additionally include “Near”, which is an indicator variable for under 1.5 km from the test center (under). Got Test Result=a + b1AnyCash Incentive + b2Cash Amount + b3Under + u What is your estimate of b3? Is it statistically significant? How does it compare to the effect of cash incentives?

#Cash amount layered onto any incentive
model_2.4 <- hiv_df %>% lm(got ~ any + tinc + under,.)
model_2.4 %>%  summary()
## 
## Call:
## lm(formula = got ~ any + tinc + under, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9439 -0.3204  0.1778  0.2568  0.6796 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.3204169  0.0181037  17.699  < 2e-16 ***
## any         0.3438446  0.0235970  14.572  < 2e-16 ***
## tinc        0.0007895  0.0001028   7.683 2.13e-14 ***
## under       0.0427829  0.0159871   2.676  0.00749 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4185 on 2830 degrees of freedom
##   (1986 observations deleted due to missingness)
## Multiple R-squared:  0.182,  Adjusted R-squared:  0.1811 
## F-statistic: 209.9 on 3 and 2830 DF,  p-value: < 2.2e-16

Not surprising, being in closer proximity to a testing center also has a positive effect on the liklihood of getting tested. The presence of a cash incentive, the amount, and prximity are all significant factors explaining about 18% of the variance. Proximity has the lowest significance level of the three variables, but it’s impact is greater than an incremental increase in the incentive. Ultimately, the biggest influence for getting tested from these options is any cash incentive, which is about 7x more impactful.

##III. Conditional (Heterogeneous) Treatment Effects We might be interested in whether the treatment has different effects for sub-populations. For example, does giving cash incentives have a different effect for men and women?

###5 Create an interaction term which interacts the treatment (any) with the gender variable (male). Run the following regression: Got Test Result= a + bAnyCash Incentive + cMale + dAny Cash Incentive × Male + u

What is your estimate of d? Is it statistically significant? How do you interpret this result? How does the interpretation of d differ from the interpretation is c?

#Cash amount layered onto any incentive
model_2.5 <- hiv_df %>% lm(got ~ any + male + any*male,.)
model_2.5 %>%  summary()
## 
## Call:
## lm(formula = got ~ any + male + any * male, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7934 -0.3299  0.2066  0.2157  0.6701 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.346505   0.023341  14.846   <2e-16 ***
## any          0.446946   0.026368  16.950   <2e-16 ***
## male        -0.016573   0.033977  -0.488    0.626    
## any:male     0.007435   0.038479   0.193    0.847    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4234 on 2830 degrees of freedom
##   (1986 observations deleted due to missingness)
## Multiple R-squared:  0.163,  Adjusted R-squared:  0.1621 
## F-statistic: 183.7 on 3 and 2830 DF,  p-value: < 2.2e-16

The any x male interaction term estimates the effect on getting tested when a cash incentive is preseented to a man. So men are more catalyzed by the cash incentive then women. And absent a cash incentive, the negative coeficient for men demonstrates that men are less likely to get tested than women. So d is a 0.7% increase in liklihood of getting tested for men who get cash incentives whereas C shows the effect of men on liklihood of getting tested when no incentive is present.

###6 Create an interaction term that interacts treatment (any) with education (educ2004). Run the following regression: Got Test Result=a + bAnyCash Incentive + cEducation + dAny Cash Incentive × Education + u

What is your estimate of d? Is it statistically significant? How do you interpret this result

#Cash amount layered onto any incentive
model_2.6 <- hiv_df %>% lm(got ~ any + educ2004 + any*educ2004,.)
model_2.6 %>%  summary()
## 
## Call:
## lm(formula = got ~ any + educ2004 + any * educ2004, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8372 -0.3097  0.1628  0.2205  0.7290 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.872e-01  2.685e-02  14.418   <2e-16 ***
## any           4.500e-01  2.968e-02  15.163   <2e-16 ***
## educ2004     -9.688e-03  4.508e-03  -2.149   0.0317 *  
## any:educ2004  6.371e-05  5.177e-03   0.012   0.9902    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4135 on 2546 degrees of freedom
##   (2270 observations deleted due to missingness)
## Multiple R-squared:  0.1796, Adjusted R-squared:  0.1787 
## F-statistic: 185.8 on 3 and 2546 DF,  p-value: < 2.2e-16

The estimate for D is very small, practically 0% (0.00006371) and is not statistically significant, which suggests that combining education and cash incentive does not move the needle in terms of HIV testing, while the cash incentive remains a robust explantory factor with strong positive effect. Interestingly, the coefficient C, which represents education from nurses, absent cash incentive has a nil to negative effect. My interpretation of this result is that, cash incentives is the main driver of behavior change in this study, not education.

###IV. Policy Implications ###7.
Based on your findings from Q4 above, if the goal of the government were to increase the number of people who know their HIV status, what type of policy would you recommend?

If I were advising the government on how to increase knowlege of HIV status via testing, I would encourage chanelling resources to offering cash incentives, specifically to men. I would also invest in more testing centers so that more individuals would be proximate (within 1.5km) to centers. At the margin, I would offer higher incentives, while spending less on education from nurses. However, before I would commit to that last guidance, I would be interested in the impact of education on men vs women. It may be that larger cash incentives drive male testing, but education increases the testing behavior of women.

##Part 3 Evaluation of Articles

###1 Discuss the internal and/or external validity of two studies presented in the lecture video in about 300 words per article. Note that your discussions can be solely based on the information presented in the lecture video, and I do not expect you to read the entire articles, although the articles are posted on Canvas and you are welcome to read them to obtain further information.

#####1. Miguel and Kremer (2004) The Miguel and Kremer (2004) study evaluated the impact of a deworming program in Nigeria on health and education status. Randomization of treatment vs. comparison groups were done at the school level rather than individually, which is key to the studies ability to evaluate the impact of externalities ton the internal validity of the Average Treatment Effect (ATE). While budget limitations can sometimes lead to external validity concerns, in this case the phased in treatment across the groups of schools allowed Miguel and Kremer to compare the cross-school spillovers. First, they established the reduction in infection rate between treatment and control group generally before comparing control groups proximate to treated schools and those more distan to show, the spillover effect of better health overall in proximate schools. Further, in-school externalities were also shown by comparing health impact on boys and girls in treatment schools where some girls over 13 were less likely to receive treatment due to health concern but still showed more positive outcomes than in non-treated group. The point here is that the internal validity of the study was understating the health benefit of the deworming program because of positive spillovers that meant healthier treated individuals also made their comparison cohorts healthier. However, Miguel and Kremer were able to capture the externalities via clever research design.

#####2. Bloom et al. (2015) The Bloom et al. study compares employee performance, satisfaction, and career prospects for call center workers given the option to work from home. Because this was a voluntary selection program, the experiment does not represent a true RCT. While the results sound impressive (13% productivity gains), the non-randomized nature raises concerns about selection bias. The low attrition rate (80% compliance of treatment group) is a positive to the experiments strength, but it could also evidence that those choosing to work from home are those individuals most motivated to work from home, have the most ideal at-home setup, and are confident in ability to maintain performance are most likely to opt-in to the program. If by contrast, less strong or motivated workers are in the control group, then it is possible that the treatment effects are overstated. I would especially expect the positive employee response to WFH to be higher for those self-selecting to it. Finally, the observer-expectancy effect is in play, as the participants in the study know that their productivity is very much the question, so it likely led to heightened effort during treatment.

The external validity is also questionable. Call centers are not representative of most workplaces as they do not require physical presence (like construction for example) nor are they highly collaborative on an internal scope (like strategy consulting). Further, the Chinese work culture is very different from America or Europe. Therefore, the degree to which these results can be extrapolated to all workplaces or cultures is narrow.