Objective: This notebook replicates the main findings of Abrevaya (2002) and provide code for quantile regressions in RStudio. In his paper, Jason Abrevaya studies the impact of demographics and maternal behavior on various quantiles of the birthweight distribution. The question is relevant because high costs and long-term effects -medical and economic- are associated with low-birthweight babies. Thus, the identification strategy allows to focus on the lower end of the birthweight distribution to quantify drivers of low-birthweight. Using data on births in the US in 1992 and 1996, he mainly highlights that many covariates have larger effects at the lower quantiles and lower effects at the higher quantiles. According to the author, the OLS can seriously under-estimate the effects at lower quantiles. Finally, results don’t seem to be driven by state effect.
Dependencies & Data
Abrevaya relies on the 1992 and 1996 editions of the Natality Data Set. It gathers information on the universe of US births, including babies -weight, sex, disorders, if any- and maternal outcomes -age, race, marital status, education, prenatal care, smoking behavior, among others.
For the purpose of this notebook, I only use the 1992 to run the
analysis and show the main functions implemented in
RStudio.
# Clear the workplace
rm(list = ls())
# Working directory
setwd(dirname(dirname(rstudioapi::getActiveDocumentContext()$path)))
# Run a script with packages, functions and aesthetic settings
source("./code/00_dependencies.R")
# Load data
data <- read_dta("./data/natality1992us_clean.dta")
Descriptives
Following the author’s procedure, I cleaned beforehand the 1992 data
set to make sure I get the right results (see
01_cleaning.dta). Below, the descriptive statistics that
replicates Table 1. of the paper.
# Descriptives (mean & std.dev.)
sd_vars <- colnames(data)
sd_list <- list()
for (i in sd_vars){
sd_list[[i]] <-
sd_univ_mean(data, data[[i]], label(data[[i]])) %>%
dplyr::select(Variables, `Mean`, `Std.dev.`)
}
sd <- do.call(rbind, sd_list)
# Descriptives. HTML table.
sd %>% kable(., booktabs = TRUE,
caption = "Descriptive statistics numeric variables.") %>%
kable_styling(font_size = 12) %>%
scroll_box(width = "100%")
| Variables | Mean | Std.dev. |
|---|---|---|
| Birth Weight | 3,388.68 | (571.51) |
| Black Ethnicity Indicator | 0.16 | (0.37) |
| Marital Status (Married) | 0.75 | (0.43) |
| Mother’s Age | 26.96 | (5.43) |
| Mother’s Age Squared | 756.39 | (303.88) |
| High School Graduate | 0.39 | (0.49) |
| Some College Education | 0.23 | (0.42) |
| College Graduate | 0.21 | (0.41) |
| Weight Gain During Pregnancy | 30.76 | (12.25) |
| Second Trimester Prenatal Care | 0.16 | (0.36) |
| Third Trimester Prenatal Care | 0.03 | (0.16) |
| No Prenatal Visits | 0.01 | (0.1) |
| Non-Smoker During Pregnancy | 0.83 | (0.38) |
| Cigarette Usage | 2.14 | (5.74) |
| Male Child Indicator | 0.51 | (0.5) |
Quantile regressions
# Quantile reg. analysis
y <- "dbirwt"
x <- paste(sd_vars[-1], collapse = " + ")
formula_qr <- as.formula(paste(y, " ~ ", x))
# Estimate in separate R objects
q <- c(0.1, 0.25, 0.5, 0.75, 0.9)
qr <- list()
# 1. Quantile estimates
for (tau in q) {
qr[[as.character(tau)]] <-
# TYPE "help(rq)" FOR MORE INFORMATION!
rq(formula_qr, # Put y ~ x type of formula
tau, # Scalar or vector of quantiles in [0,1]
data, # Data frame
method = 'fn') # Algorithm used to compute the quantile estimates in
# {'br':(default); 'fn':(recommended for large df); ...}
}
# 2. Bootstrap std.err. computations ('BOOT' INSTEAD OF 'KER': TIME-CONSUMING /!\)
qr_rob <- list()
for (tau in q) {
qr_rob[[as.character(tau)]] <-
summary(qr[[as.character(tau)]], # Model previously estimated
se = "ker", # Standard errors in {'iid':(default);
# 'nid':(non-iid), 'ker':(kernel-smoothed),
# 'boot':(bootstrapped)}
R = 1000)$coefficients[,2]
}
# Estimates
names(qr) <- paste0("qr_", names(qr))
invisible(list2env(qr, envir = .GlobalEnv))
# Std. err.
names(qr_rob) <- paste0("qr_rob_", names(qr_rob))
invisible(list2env(qr_rob, envir = .GlobalEnv))
# To get a tidy data frame from reg. output. Useful for further manipulation.
df_qr_0.5 <- tidy(qr_0.5)
# Baseline comparison with OLS
ols <- lm(formula_qr, data)
ols_rob <- coeftest(ols, vcov = vcovHC(ols, type = "HC1")) # Full model
ols_rob <- sqrt(diag(vcovHC(ols, fit="HC1"))) # Only se. in a list
Table
Below the regression tables for the 10-th, 25-th, 50-th (LAD), 75-th
and 90-th quantiles, together with the OLS estimates. It reproduces
Table 2. of the paper. Several notes are worth
mentioning: i) the point estimates are slightly different -most likely
due to both a slightly different cleaning, I have some more observations
here, and differences in statistical softwares computations-; ii) the
standard errors also differ a bit because I didn’t go for bootstrapped
computations (too time-consuming). It is possible to compute the
bootstrapped errors if you change the se option by
'boot' in summary().
There are several threat to identification. As highlighted by the authors, there is likely an OVB issue since they can not control for all maternal behavior that are likely correlated with other regressors in the model, like alcohol use or prenatal diet for instance. Another issue is the likely failure of the rank invariance condition: it seems implausible that the babies would remain on the same position in the birthweight distribution, independently of regressors values. The interpretations are thus only descriptive here, but still informative. For instance,
If anything, the black-white differential in birthweight at the 10-th quantile is 253 grams. In other words, the 10-th quantile of birthweight for a baby born to a black mother is 253 grams below than the 10-th quantile of birthweight for a baby born to a white mother;
If anything, the differential in birthweight between college graduate and “dropouts” shrink completely from 82 grams at the 10-th quantile compared to the 90-th quantile. The effect of college on birthweight is sizeable for low-birthweight babies, while not being significant at the high-end of the distribution;
The intercept can be interpreted as the estimated conditional quantile function of the birthweight distribution of a girl born to an unmaried, white mother with less than a high school education, who is 27yo and had a weight gain of 30 pounds, didn’t smoke, and had her first prenatal visit in the first trimester of the pregnancy (the number reflects mean values). In this example, given the high number of covariates, this baseline comparison seems hard to grasp but it might be more informative in simpler settings.
# Regression table output
stargazer(qr_0.1,
qr_0.25,
qr_0.5,
qr_0.75,
qr_0.9,
ols,
se = c(list(qr_rob_0.1),
list(qr_rob_0.25),
list(qr_rob_0.5),
list(qr_rob_0.75),
list(qr_rob_0.9),
list(ols_rob)),
type = "html",
dep.var.labels = "Birth weight",
model.names = FALSE,
column.labels = c("0.1","0.25","0.5","0.75","0.9", "OLS"),
covariate.labels = sapply(data, function(x) attr(x, "label"))[-1])
| Dependent variable: | ||||||
| Birth weight | ||||||
| 0.1 | 0.25 | 0.5 | 0.75 | 0.9 | OLS | |
| (1) | (2) | (3) | (4) | (5) | (6) | |
| Black Ethnicity Indicator | -253.656*** | -216.794*** | -199.071*** | -192.405*** | -182.087*** | -220.271*** |
| (7.730) | (4.609) | (3.905) | (4.446) | (5.710) | (3.874) | |
| Marital Status (Married) | 73.323*** | 55.181*** | 50.563*** | 42.456*** | 38.560*** | 57.560*** |
| (6.272) | (4.188) | (3.583) | (4.209) | (5.487) | (3.466) | |
| Mother’s Age | 45.177*** | 36.722*** | 34.636*** | 31.909*** | 33.788*** | 37.584*** |
| (4.174) | (2.706) | (2.304) | (2.560) | (3.469) | (2.176) | |
| Mother’s Age Squared | -0.781*** | -0.591*** | -0.530*** | -0.440*** | -0.436*** | -0.579*** |
| (0.075) | (0.048) | (0.041) | (0.045) | (0.061) | (0.039) | |
| High School Graduate | 28.580*** | 25.078*** | 17.543*** | 15.024*** | 13.355** | 16.684*** |
| (6.924) | (4.682) | (3.884) | (4.456) | (6.016) | (3.732) | |
| Some College Education | 49.494*** | 40.549*** | 31.698*** | 26.975*** | 18.640*** | 30.372*** |
| (7.754) | (5.248) | (4.515) | (5.133) | (6.810) | (4.271) | |
| College Graduate | 82.645*** | 57.601*** | 36.839*** | 16.270*** | -4.663 | 36.718*** |
| (8.294) | (5.730) | (4.928) | (5.496) | (7.554) | (4.608) | |
| Weight Gain During Pregnancy | 11.776*** | 9.905*** | 9.134*** | 8.838*** | 8.576*** | 10.490*** |
| (0.180) | (0.124) | (0.111) | (0.117) | (0.150) | (0.109) | |
| Second Trimester Prenatal Care | 6.176 | 2.833 | -4.477 | -0.544 | 3.796 | 7.731** |
| (6.399) | (4.218) | (3.705) | (4.118) | (5.687) | (3.450) | |
| Third Trimester Prenatal Care | 39.434*** | 12.679 | 3.967 | -6.239 | -25.498** | 20.268*** |
| (13.223) | (9.284) | (7.746) | (8.793) | (11.445) | (7.095) | |
| No Prenatal Visits | -388.584*** | -196.452*** | -147.230*** | -126.642*** | -101.659*** | -193.737*** |
| (43.675) | (24.659) | (14.529) | (17.049) | (18.954) | (16.026) | |
| Non-Smoker During Pregnancy | 170.750*** | 171.840*** | 158.821*** | 153.227*** | 150.219*** | 161.929*** |
| (11.710) | (7.038) | (6.152) | (6.582) | (8.643) | (5.718) | |
| Cigarette Usage | -3.803*** | -3.519*** | -3.902*** | -4.461*** | -5.168*** | -4.043*** |
| (0.766) | (0.463) | (0.398) | (0.422) | (0.515) | (0.367) | |
| Male Child Indicator | 87.763*** | 109.568*** | 129.128*** | 142.402*** | 153.582*** | 122.568*** |
| (4.439) | (2.947) | (2.566) | (2.863) | (3.843) | (2.384) | |
| Constant | 1,556.905*** | 2,020.488*** | 2,377.883*** | 2,715.228*** | 2,967.683*** | 2,273.476*** |
| (56.734) | (37.365) | (31.759) | (34.950) | (47.531) | (29.967) | |
| Observations | 199,181 | 199,181 | 199,181 | 199,181 | 199,181 | 199,181 |
| R2 | 0.134 | |||||
| Adjusted R2 | 0.134 | |||||
| Residual Std. Error | 531.845 (df = 199166) | |||||
| F Statistic | 2,202.321*** (df = 14; 199166) | |||||
| Note: | p<0.1; p<0.05; p<0.01 | |||||
Wald tests
The two tables below respectively test,
For equality of the quantile coefficients along the conditional y-distribution \(\longrightarrow\) except for High School Graduate, Second Trimester Prenatal Care, Non-Smoker During Pregnancy and Cigarette Usage, all variables show coefficients that significantly differ across the quantiles;
For the interquantile range (are the \(\beta_{0.1}\) and \(\beta_{0.9}\) statistically different?) \(\longrightarrow\) same conclusion as above, with High School Graduate that shows significantly different coefficients across the 10-th and the 90-th quantiles at 10% confidence level now.
# Wald test
qr_all <- rq(formula_qr, q, data, method = 'fn')
qr_wald1 <- anova(qr_all, test = "Wald", joint = F)$table %>%
as.data.frame() %>%
round(., digits = 3)
qr_wald2 <- anova(qr_0.1, qr_0.9, test = "Wald", joint = F)$table %>%
as.data.frame() %>%
round(., digits = 3)
# Table 1.
colnames(qr_wald1) <- c("DF", "Res.DF", "F-value", "p-values")
row.names(qr_wald1) <- var_label(data)[-1]
qr_wald1 %>% kable(., booktabs = TRUE,
caption = "Wald test for QR") %>%
kable_styling(font_size = 12) %>%
scroll_box(width = "100%")
| DF | Res.DF | F-value | p-values | |
|---|---|---|---|---|
| Black Ethnicity Indicator | 4 | 995901 | 16.243 | 0.000 |
| Marital Status (Married) | 4 | 995901 | 5.342 | 0.000 |
| Mother’s Age | 4 | 995901 | 2.260 | 0.060 |
| Mother’s Age Squared | 4 | 995901 | 4.630 | 0.001 |
| High School Graduate | 4 | 995901 | 1.235 | 0.293 |
| Some College Education | 4 | 995901 | 2.681 | 0.030 |
| College Graduate | 4 | 995901 | 18.053 | 0.000 |
| Weight Gain During Pregnancy | 4 | 995901 | 72.269 | 0.000 |
| Second Trimester Prenatal Care | 4 | 995901 | 1.532 | 0.190 |
| Third Trimester Prenatal Care | 4 | 995901 | 3.574 | 0.006 |
| No Prenatal Visits | 4 | 995901 | 9.973 | 0.000 |
| Non-Smoker During Pregnancy | 4 | 995901 | 1.545 | 0.186 |
| Cigarette Usage | 4 | 995901 | 1.533 | 0.189 |
| Male Child Indicator | 4 | 995901 | 40.502 | 0.000 |
# Table 2.
colnames(qr_wald2) <- c("DF", "Res.DF", "F-value", "p-values")
row.names(qr_wald2) <- var_label(data)[-1]
qr_wald2 %>% kable(., booktabs = TRUE,
caption = "Wald test for QR") %>%
kable_styling(font_size = 12) %>%
scroll_box(width = "100%")
| DF | Res.DF | F-value | p-values | |
|---|---|---|---|---|
| Black Ethnicity Indicator | 1 | 398361 | 60.239 | 0.000 |
| Marital Status (Married) | 1 | 398361 | 18.732 | 0.000 |
| Mother’s Age | 1 | 398361 | 4.823 | 0.028 |
| Mother’s Age Squared | 1 | 398361 | 13.932 | 0.000 |
| High School Graduate | 1 | 398361 | 2.957 | 0.086 |
| Some College Education | 1 | 398361 | 9.522 | 0.002 |
| College Graduate | 1 | 398361 | 64.188 | 0.000 |
| Weight Gain During Pregnancy | 1 | 398361 | 219.641 | 0.000 |
| Second Trimester Prenatal Care | 1 | 398361 | 0.085 | 0.770 |
| Third Trimester Prenatal Care | 1 | 398361 | 13.700 | 0.000 |
| No Prenatal Visits | 1 | 398361 | 39.691 | 0.000 |
| Non-Smoker During Pregnancy | 1 | 398361 | 2.214 | 0.137 |
| Cigarette Usage | 1 | 398361 | 2.695 | 0.101 |
| Male Child Indicator | 1 | 398361 | 139.003 | 0.000 |
Plots
Here are some (more or less customed) plots that can be useful for visualization and comparison of coefficients.
# No customization. Good for first analysis and overview.
summary(qr_all) %>% plot()
# Customized plot for each regressor: see function 'ggplot.qr' in 00.dependencies.R
qr_plots <- list()
for(k in colnames(data)[-1]){
qr_plots[[k]] <- ggplot.qr(ols, qr_all, q, data, k, "ker")
qr_plots[[k]]
}
names(qr_plots) <- paste0("qr_plot_", names(qr_plots))
invisible(list2env(qr_plots, envir = .GlobalEnv))
# Examples
grid.arrange(qr_plot_married, qr_plot_boy, qr_plot_collgrad,
qr_plot_novisit, qr_plot_black, ncol = 2)
# Some customization w/ comparison between regressor.
# TYPE "help(plot_models)" FOR MORE INFO!
qr_comp_1to7 <- plot_models(ols, qr_0.5, qr_0.9,
spacing = 0.8,
show.values = T,
rm.terms = colnames(data)[9:15],
m.labels = c("OLS", "LAD", "QR 90%"),
legend.title = "Specification",
colors = c(my_purple, my_blue, my_red)) +
my_theme +
theme(legend.position = c(0.1, 0.1)) +
coord_flip()
qr_comp_1to7
qr_comp_8to15 <- plot_models(ols, qr_0.5, qr_0.9,
spacing = 0.8,
show.values = T,
rm.terms = colnames(data)[2:8],
m.labels = c("OLS", "LAD", "QR 90%"),
legend.title = "Specification",
colors = c(my_purple, my_blue, my_red)) +
my_theme +
theme(legend.position = c(0.1, 0.1)) +
coord_flip()
qr_comp_8to15