Vào năm 1980, Kentucky đã tăng giới hạn thu nhập hàng tuần được bù đắp bởi tiền bồi thường của người lao động. Chúng tôi muốn biết liệu chính sách mới này có khiến người lao động mất nhiều thời gian hơn để thất nghiệp hay không. Nếu phúc lợi không đủ hào phóng, thì người lao động có thể kiện các công ty về các chấn thương tại chỗ, trong khi phúc lợi quá hào phóng có thể gây ra các vấn đề về rủi ro đạo đức và khiến người lao động thiếu thận trọng hơn trong công việc hoặc yêu cầu bồi thường rằng các chấn thương ngoài công việc đã được phát sinh trong khi làm việc.
Biến kết quả chính mà chúng tôi quan tâm là log_duration (trong dữ liệu ban đầu là ldurat, nhưng chúng tôi đổi tên nó để con người dễ đọc hơn) hoặc thời lượng ghi lại (tính bằng tuần) về quyền lợi bồi thường của người lao động. Chúng tôi ghi lại nó vì biến số khá lệch - hầu hết mọi người thất nghiệp trong vài tuần, với một số thất nghiệp trong thời gian dài. Chính sách được thiết kế để việc tăng trần không ảnh hưởng đến người lao động có thu nhập thấp, nhưng ảnh hưởng đến người lao động có thu nhập cao, vì vậy chúng tôi sử dụng người lao động có thu nhập thấp làm nhóm đối chứng và người lao động có thu nhập cao làm nhóm đối xử.
Dữ liệu được bao gồm trong gói wooldridge R dưới dạng tập dữ liệu chấn thương và nếu bạn cài đặt gói, hãy tải gói này bằng thư viện (wooldridge) và chạy? Thương trong bảng điều khiển, bạn có thể xem chi tiết đầy đủ về những gì có trong đó. Để giúp bạn thực hành nhiều hơn với việc tải dữ liệu từ các tệp bên ngoài, tôi đã xuất dữ liệu chấn thương dưới dạng tệp CSV (sử dụng write_csv (thương tích, “thương tích.csv”)) và đưa vào đây.
Đây là những cột chính mà chúng tôi sẽ lo lắng lúc này:
durat (chúng tôi sẽ đổi tên thành thời hạn): Thời gian trợ cấp thất nghiệp, tính bằng tuần
ldurat (mà chúng tôi sẽ đổi tên thành log_duration): Phiên bản đã ghi của durat (log (durat))
after_1980 (mà chúng tôi sẽ đổi tên thành after_1980): Đánh dấu biến chỉ báo nếu quan sát xảy ra trước (0) hoặc sau (1) thay đổi chính sách vào năm 1980. Đây là thời điểm của chúng tôi (hoặc trước / sau biến)
highearn: Đánh dấu biến chỉ số nếu quan sát là người có thu nhập thấp (0) hoặc cao (1). Đây là biến nhóm (hoặc điều trị / kiểm soát) của chúng tôi
library(tidyverse) # ggplot(), %>%, mutate(), and friends
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.1 v dplyr 1.0.6
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(broom) # Convert models to data frames
library(scales) # Format numbers with functions like comma(), percent(), and dollar()
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(modelsummary) # Create side-by-side regression tables
injury_raw=read.csv("C:/Users/binht/Dropbox/Hue/data/R/DID/injury.csv")
head(injury_raw)
## durat afchnge highearn male married hosp indust injtype age prewage
## 1 1 1 1 1 0 1 3 1 26 404.9500
## 2 1 1 1 1 1 0 3 1 31 643.8250
## 3 84 1 1 1 1 1 3 1 37 398.1250
## 4 4 1 1 1 1 1 3 1 31 527.8000
## 5 1 1 1 1 1 0 3 1 23 528.9375
## 6 1 1 1 1 1 0 3 1 34 614.2500
## totmed injdes benefit ky mi ldurat afhigh lprewage lage ltotmed
## 1 1187.5732 1010 246.8375 1 0 0.000000 1 6.003764 3.258096 7.079667
## 2 361.0786 1404 246.8375 1 0 0.000000 1 6.467427 3.433987 5.889095
## 3 8963.6572 1032 246.8375 1 0 4.430817 1 5.986766 3.610918 9.100934
## 4 1099.6483 1940 246.8375 1 0 1.386294 1 6.268717 3.433987 7.002746
## 5 372.8019 1940 211.5750 1 0 0.000000 1 6.270870 3.135494 5.921047
## 6 211.0199 1425 176.3125 1 0 0.000000 1 6.420402 3.526361 5.351953
## head neck upextr trunk lowback lowextr occdis manuf construc highlpre
## 1 1 0 0 0 0 0 0 0 0 6.003764
## 2 1 0 0 0 0 0 0 0 0 6.467427
## 3 1 0 0 0 0 0 0 0 0 5.986766
## 4 1 0 0 0 0 0 0 0 0 6.268717
## 5 1 0 0 0 0 0 0 0 0 6.270870
## 6 1 0 0 0 0 0 0 0 0 6.420402
injury <- injury_raw %>%
filter(ky == 1) %>%
# The syntax for rename is `new_name = original_name`
rename(duration = durat, log_duration = ldurat,
after_1980 = afchnge)
ggplot(data = injury, aes(x = duration)) +
# binwidth = 8 makes each column represent 2 months (8 weeks)
# boundary = 0 make it so the 0-8 bar starts at 0 and isn't -4 to 4
geom_histogram(binwidth = 8, color = "white", boundary = 0) +
facet_wrap(vars(highearn))
ggplot(data = injury, mapping = aes(x = log_duration)) +
geom_histogram(binwidth = 0.5, color = "white", boundary = 0) +
# Uncomment this line if you want to exponentiate the logged values on the
# x-axis. Instead of showing 1, 2, 3, etc., it'll show e^1, e^2, e^3, etc. and
# make the labels more human readable
# scale_x_continuous(labels = trans_format("exp", format = round)) +
facet_wrap(vars(highearn))
ggplot(data = injury, mapping = aes(x = log_duration)) +
geom_histogram(binwidth = 0.5, color = "white", boundary = 0) +
facet_wrap(vars(after_1980))
The distributions look normal-ish, but we can’t really easily see anything different between the before/after and treatment/control groups. We can plot the averages, though. There are a few different ways we can do this.
You can use a stat_summary() layer to have ggplot calculate summary statistics like averages. Here we just calculate the mean:
ggplot(injury, aes(x = factor(highearn), y = log_duration)) +
geom_point(size = 0.5, alpha = 0.2) +
stat_summary(geom = "point", fun = "mean", size = 5, color = "red") +
facet_wrap(vars(after_1980))
But we can also calculate the mean and 95% confidence interval:
ggplot(injury, aes(x = factor(highearn), y = log_duration)) +
stat_summary(geom = "pointrange", size = 1, color = "red",
fun.data = "mean_se", fun.args = list(mult = 1.96)) +
facet_wrap(vars(after_1980))
We can already start to see the classical diff-in-diff plot! It looks like high earners after 1980 had longer unemployment on average.
We can also use group_by() and summarize() to figure out group means before sending the data to ggplot. I prefer doing this because it gives me more control over the data that I’m plotting:
plot_data <- injury %>%
# Make these categories instead of 0/1 numbers so they look nicer in the plot
mutate(highearn = factor(highearn, labels = c("Low earner", "High earner")),
after_1980 = factor(after_1980, labels = c("Before 1980", "After 1980"))) %>%
group_by(highearn, after_1980) %>%
summarize(mean_duration = mean(log_duration),
se_duration = sd(log_duration) / sqrt(n()),
upper = mean_duration + (1.96 * se_duration),
lower = mean_duration + (-1.96 * se_duration))
## `summarise()` has grouped output by 'highearn'. You can override using the `.groups` argument.
ggplot(plot_data, aes(x = highearn, y = mean_duration)) +
geom_pointrange(aes(ymin = lower, ymax = upper),
color = "darkgreen", size = 1) +
facet_wrap(vars(after_1980))
Or, plotted in the more standard diff-in-diff format:
ggplot(plot_data, aes(x = after_1980, y = mean_duration, color = highearn)) +
geom_pointrange(aes(ymin = lower, ymax = upper), size = 1) +
# The group = highearn here makes it so the lines go across categories
geom_line(aes(group = highearn))
We can find that exact difference by filling out the 2x2 before/after treatment/control table:
A combination of group_by() and summarize() makes this really easy:
diffs <- injury %>%
group_by(after_1980, highearn) %>%
summarize(mean_duration = mean(log_duration),
# Calculate average with regular duration too, just for fun
mean_duration_for_humans = mean(duration))
## `summarise()` has grouped output by 'after_1980'. You can override using the `.groups` argument.
diffs
## # A tibble: 4 x 4
## # Groups: after_1980 [2]
## after_1980 highearn mean_duration mean_duration_for_humans
## <int> <int> <dbl> <dbl>
## 1 0 0 1.13 6.27
## 2 0 1 1.38 11.2
## 3 1 0 1.13 7.04
## 4 1 1 1.58 12.9
We can pull each of these numbers out of the table with some filter()s and pull():
before_treatment <- diffs %>%
filter(after_1980 == 0, highearn == 1) %>%
pull(mean_duration)
before_control <- diffs %>%
filter(after_1980 == 0, highearn == 0) %>%
pull(mean_duration)
after_treatment <- diffs %>%
filter(after_1980 == 1, highearn == 1) %>%
pull(mean_duration)
after_control <- diffs %>%
filter(after_1980 == 1, highearn == 0) %>%
pull(mean_duration)
diff_treatment_before_after <- after_treatment - before_treatment
diff_treatment_before_after
## [1] 0.1982585
diff_control_before_after <- after_control - before_control
diff_control_before_after
## [1] 0.007657313
diff_diff <- diff_treatment_before_after - diff_control_before_after
diff_diff
## [1] 0.1906012
The diff-in-diff estimate is 0.19, which means that the program causes an increase in unemployment duration of 0.19 logged weeks. Logged weeks is nonsensical, though, so we have to interpret it with percentages (here’s a handy guide!; this is Example B, where the dependent/outcome variable is logged). Receiving the treatment (i.e. being a high earner after the change in policy) causes a 19% increase in the length of unemployment.
ggplot(diffs, aes(x = as.factor(after_1980),
y = mean_duration,
color = as.factor(highearn))) +
geom_point() +
geom_line(aes(group = as.factor(highearn))) +
# If you use these lines you'll get some extra annotation lines and
# labels. The annotate() function lets you put stuff on a ggplot that's not
# part of a dataset. Normally with geom_line, geom_point, etc., you have to
# plot data that is in columns. With annotate() you can specify your own x and
# y values.
annotate(geom = "segment", x = "0", xend = "1",
y = before_treatment, yend = after_treatment - diff_diff,
linetype = "dashed", color = "grey50") +
annotate(geom = "segment", x = "1", xend = "1",
y = after_treatment, yend = after_treatment - diff_diff,
linetype = "dotted", color = "blue") +
annotate(geom = "label", x = "1", y = after_treatment - (diff_diff / 2),
label = "Program effect", size = 3)
# Here, all the as.factor changes are directly in the ggplot code. I generally
# don't like doing this and prefer to do that separately so there's less typing
# in the ggplot code, like this:
#
# diffs <- diffs %>%
# mutate(after_1980 = as.factor(after_1980), highearn = as.factor(highearn))
#
# ggplot(diffs, aes(x = after_1980, y = avg_durat, color = highearn)) +
# geom_line(aes(group = highearn))
Calculating all the pieces by hand like that is tedious, so we can use regression to do it instead! Remember that we need to include indicator variables for treatment/control and for before/after, as well as the interaction of the two. Here’s what the math equation looks like:
log(duration)=α+β highearn+γ after_1980+δ (highearn×after_1980)+ϵ
The δ coefficient is the effect we care about in the end—that’s the diff-in-diff estimator.
model_small <- lm(log_duration ~ highearn + after_1980 + highearn * after_1980,
data = injury)
tidy(model_small)
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.13 0.0307 36.6 1.62e-263
## 2 highearn 0.256 0.0474 5.41 6.72e- 8
## 3 after_1980 0.00766 0.0447 0.171 8.64e- 1
## 4 highearn:after_1980 0.191 0.0685 2.78 5.42e- 3
The coefficient for highearn:after_1980 is the same as what we found by hand, as it should be! It is statistically significant, so we can be fairly confident that it is not 0.
Diff-in-diff with regression + controls
One advantage to using regression for diff-in-diff is that we can include control variables to help isolate the effect. For example, perhaps claims made by construction or manufacturing workers tend to have longer duration than claims made workers in other industries. Or maybe those claiming back injuries tend to have longer claims than those claiming head injuries. We might also want to control for worker demographics such as gender, marital status, and age.
Let’s estimate an expanded version of the basic regression model with the following additional variables:
male
married
age
hosp (1 = hospitalized)
indust (1 = manuf, 2 = construc, 3 = other)
injtype (1-8; categories for different types of injury)
lprewage (log of wage prior to filing a claim)
Important: indust and injtype are in the dataset as numbers (1-3 and 1-8), but they’re actually categories. We have to tell R to treat them as categories (or factors), otherwise it’ll assume that you can have an injury type of 3.46 or something impossible.
# Convert industry and injury type to categories/factors
injury_fixed <- injury %>%
mutate(indust = as.factor(indust),
injtype = as.factor(injtype))
model_big <- lm(log_duration ~ highearn + after_1980 + highearn * after_1980 +
male + married + age + hosp + indust + injtype + lprewage,
data = injury_fixed)
tidy(model_big)
## # A tibble: 18 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -1.53 0.422 -3.62 2.98e- 4
## 2 highearn -0.152 0.0891 -1.70 8.86e- 2
## 3 after_1980 0.0495 0.0413 1.20 2.31e- 1
## 4 male -0.0843 0.0423 -1.99 4.64e- 2
## 5 married 0.0567 0.0373 1.52 1.29e- 1
## 6 age 0.00651 0.00134 4.86 1.19e- 6
## 7 hosp 1.13 0.0370 30.5 5.20e-189
## 8 indust2 0.184 0.0541 3.40 6.87e- 4
## 9 indust3 0.163 0.0379 4.32 1.60e- 5
## 10 injtype2 0.935 0.144 6.51 8.29e- 11
## 11 injtype3 0.635 0.0854 7.44 1.19e- 13
## 12 injtype4 0.555 0.0928 5.97 2.49e- 9
## 13 injtype5 0.641 0.0854 7.51 7.15e- 14
## 14 injtype6 0.615 0.0863 7.13 1.17e- 12
## 15 injtype7 0.991 0.191 5.20 2.03e- 7
## 16 injtype8 0.434 0.119 3.65 2.64e- 4
## 17 lprewage 0.284 0.0801 3.55 3.83e- 4
## 18 highearn:after_1980 0.169 0.0640 2.64 8.38e- 3
# Extract just the diff-in-diff estimate
diff_diff_controls <- tidy(model_big) %>%
filter(term == "highearn:after_1980") %>%
pull(estimate)
After controlling for a host of demographic controls, the diff-in-diff estimate is smaller (0.17), indicating that the policy caused a 17% increase in the duration of weeks unemployed following a workplace injury. It is smaller because the other independent variables now explain some of the variation in log_duration.
Comparison of results
We can put the model coefficients side-by-side to compare the value for highearn:after_1980 as we change the model.
modelsummary(list("Simple" = model_small, "Full" = model_big))
| Simple | Full | |
|---|---|---|
| (Intercept) | 1.126 | -1.528 |
| (0.031) | (0.422) | |
| highearn | 0.256 | -0.152 |
| (0.047) | (0.089) | |
| after_1980 | 0.008 | 0.050 |
| (0.045) | (0.041) | |
| highearn × after_1980 | 0.191 | 0.169 |
| (0.069) | (0.064) | |
| male | -0.084 | |
| (0.042) | ||
| married | 0.057 | |
| (0.037) | ||
| age | 0.007 | |
| (0.001) | ||
| hosp | 1.130 | |
| (0.037) | ||
| indust2 | 0.184 | |
| (0.054) | ||
| indust3 | 0.163 | |
| (0.038) | ||
| injtype2 | 0.935 | |
| (0.144) | ||
| injtype3 | 0.635 | |
| (0.085) | ||
| injtype4 | 0.555 | |
| (0.093) | ||
| injtype5 | 0.641 | |
| (0.085) | ||
| injtype6 | 0.615 | |
| (0.086) | ||
| injtype7 | 0.991 | |
| (0.191) | ||
| injtype8 | 0.434 | |
| (0.119) | ||
| lprewage | 0.284 | |
| (0.080) | ||
| Num.Obs. | 5626 | 5347 |
| R2 | 0.021 | 0.190 |
| R2 Adj. | 0.020 | 0.187 |
| AIC | 18654.0 | 16684.8 |
| BIC | 18687.2 | 16809.9 |
| Log.Lik. | -9321.997 | -8323.388 |
| F | 39.540 | 73.459 |
| RMSE | 1.27 | 1.15 |