library("tidyverse") #need to call the library before you use the packages
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("knitr")
library("psych") #describe and describeby functions
## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library("dplyr") # %>% functions

library("wooldridge")
data("injury")
?injury

dataky <- filter(injury, ky == 1)
lm1 <- lm(durat ~ afchnge + highearn + afchnge*highearn ,data = dataky)
summary(lm1)
## 
## Call:
## lm(formula = durat ~ afchnge + highearn + afchnge * highearn, 
##     data = dataky)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.644  -6.787  -4.272  -0.272 175.728 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        6.2716     0.5229  11.994  < 2e-16 ***
## afchnge            0.7658     0.7607   1.007    0.314    
## highearn           4.9050     0.8071   6.077  1.3e-09 ***
## afchnge:highearn   0.9513     1.1654   0.816    0.414    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.59 on 5622 degrees of freedom
## Multiple R-squared:  0.01577,    Adjusted R-squared:  0.01524 
## F-statistic: 30.02 on 3 and 5622 DF,  p-value: < 2.2e-16

The intercept represents the average duration for those without highearn (in control group) before the policy change, ceteris paribus. The estimate is 6.272 and it is statistically significant at the 5% level. This means that the duration is 6.272 weeks on average for those without highearn (in control group) before the plicy change.

Coefficient of afchange is 0.7658. This would mean that there is an expected 0.7658 change in duration on low-earn individuals if the policy change is implemented, ceterus paribus. However, the coefficient is statistically insignificant at the 5% level of significance as p-value = 0.315 > 0.05. This means that we cannot reject the null hypothesis and can conclude that there would be no change in durat for low-earn individuals when the policy change is implemented, ceterus paribus.

coefficient of highearn is 4.9050. This means that there would be an expected difference of 4.9050 weeks for high-earning individuals compared to low-earning individuals when there is a policy change. This coefficient is statistically significant at the 5% level of significance with its p-value being 1.3e-09. Hence we can reject the null hypothesis and conclude that the durat of high earning individuals would be 4.9050 higher than that for low-earning individuals when there is a change in policy, ceterus paribus.

Coefficient of agchnge*highearn is 0.9513. This means that there would be an expected change of 0.9513 weeksfor durat in high-earning individuals when there is a change in policy, ceteris paribus. However, the coefficient is statistically insignificant at 5% level of significance as its p-value is 0.414. Hence we cannot reject the null hypothesis and conclude that implementation of policy would cause a change in durat of high-earning individuals, ceterus paribus.

lm2 <- lm(ldurat ~ afchnge + highearn + afchnge*highearn ,data = dataky)
summary(lm2)
## 
## Call:
## lm(formula = ldurat ~ afchnge + highearn + afchnge * highearn, 
##     data = dataky)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9666 -0.8872  0.0042  0.8126  4.0784 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.125615   0.030737  36.621  < 2e-16 ***
## afchnge          0.007657   0.044717   0.171  0.86404    
## highearn         0.256479   0.047446   5.406 6.72e-08 ***
## afchnge:highearn 0.190601   0.068509   2.782  0.00542 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.269 on 5622 degrees of freedom
## Multiple R-squared:  0.02066,    Adjusted R-squared:  0.02014 
## F-statistic: 39.54 on 3 and 5622 DF,  p-value: < 2.2e-16

The intercept represents the average log durat for the control group, the low-earning individuals before the policy implementation, ceterus paribus. Basically same as answer.

lm3 <- lm(ldurat ~ afchnge + highearn + afchnge*highearn + male + married + head + neck + upextr + trunk + lowback + lowextr + occdis + manuf + construc,data = dataky)
summary(lm3)
## 
## Call:
## lm(formula = ldurat ~ afchnge + highearn + afchnge * highearn + 
##     male + married + head + neck + upextr + trunk + lowback + 
##     lowextr + occdis + manuf + construc, data = dataky)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3436 -0.8541  0.0989  0.7856  4.4372 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.245922   0.106168  11.735  < 2e-16 ***
## afchnge           0.010627   0.044917   0.237 0.812974    
## highearn          0.175760   0.051746   3.397 0.000687 ***
## male             -0.097941   0.044550  -2.198 0.027959 *  
## married           0.122099   0.039123   3.121 0.001812 ** 
## head             -0.513900   0.129278  -3.975 7.13e-05 ***
## neck              0.269913   0.161490   1.671 0.094703 .  
## upextr           -0.178539   0.101179  -1.765 0.077692 .  
## trunk             0.126451   0.109016   1.160 0.246129    
## lowback          -0.008597   0.101527  -0.085 0.932524    
## lowextr          -0.120291   0.102326  -1.176 0.239821    
## occdis            0.272712   0.210769   1.294 0.195760    
## manuf            -0.160671   0.040904  -3.928 8.67e-05 ***
## construc          0.110197   0.051806   2.127 0.033458 *  
## afchnge:highearn  0.230877   0.069525   3.321 0.000904 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.251 on 5334 degrees of freedom
##   (277 observations deleted due to missingness)
## Multiple R-squared:  0.0412, Adjusted R-squared:  0.03868 
## F-statistic: 16.37 on 14 and 5334 DF,  p-value: < 2.2e-16

The coefficient of afchnge*highearn increased from 0.190601 to 0.230877, while still remaining statistically significant. One reason for this change could be from us adding more control variables. Due to this, the coefficient estimate gets closer to the true value and be less baised.

Q1d) The R-squared value is 0.0412, and the adjusted R-squared is 0.0387. The low R-squared value suggests that the covariates in the regression model account for only 4% of the variance in the dependent variable, ldurat. However, this does not imply that the estimation is ineffective or biased, as it remains statistically significant in this model. Thus, the causal inference is still valid.

Q1e) The key assumption is that both the control (baseline) and treatment groups exhibit parallel trends. To assess these parallel trends, we could create a line graph to visualize the trend of ldurat over time. However, the current dataset does not provide enough information to generate such a graph, as it lacks details on how ldurat fluctuates over time. We only have adequate data to compare ldurat immediately before and after the policy implementation.