Objective: This notebook replicates the main findings of Abrevaya (2002) and provide code for quantile regressions in RStudio. In his paper, Jason Abrevaya studies the impact of demographics and maternal behavior on various quantiles of the birthweight distribution. The question is relevant because high costs and long-term effects -medical and economic- are associated with low-birthweight babies. Thus, the identification strategy allows to focus on the lower end of the birthweight distribution to quantify drivers of low-birthweight. Using data on births in the US in 1992 and 1996, he mainly highlights that many covariates have larger effects at the lower quantiles and lower effects at the higher quantiles. According to the author, the OLS can seriously under-estimate the effects at lower quantiles. Finally, results don’t seem to be driven by state effect.
Dependencies & Data
Abrevaya relies on the 1992 and 1996 editions of the Natality Data Set. It gathers information on the universe of US births, including babies -weight, sex, disorders, if any- and maternal outcomes -age, race, marital status, education, prenatal care, smoking behavior, among others.
For the purpose of this notebook, I only use the 1992 to run the analysis and show the main functions implemented in RStudio.
Code
# Clear the workplacerm(list =ls())# Working directoryif (rstudioapi::isAvailable()) {setwd(dirname(dirname(rstudioapi::getActiveDocumentContext()$path)))} else {# Replace with your path if it doesn't runsetwd("/Users/cepremap/Dropbox/Mac/Documents/PhD/Econometric Game/QR") }# Run a script with packages, functions and aesthetic settingssource("./code/00_dependencies.R")# Load datadata <-read_dta("./data/natality1992us_clean.dta")
Descriptives
Following the author’s procedure, I cleaned beforehand the 1992 data set to make sure I get the right results (see 01_cleaning.dta). Below, the descriptive statistics that replicates Table 1. of the paper.
# Quantile reg. analysisy <-"dbirwt"x <-paste(sd_vars[-1], collapse =" + ")formula_qr <-as.formula(paste(y, " ~ ", x))# Estimate in separate R objectsq <-c(0.1, 0.25, 0.5, 0.75, 0.9)qr <-list()# 1. Quantile estimatesfor (tau in q) { qr[[as.character(tau)]] <-# TYPE "help(rq)" FOR MORE INFORMATION!rq(formula_qr, # Put y ~ x type of formula tau, # Scalar or vector of quantiles in [0,1] data, # Data framemethod ='fn') # Algorithm used to compute the quantile estimates in # {'br':(default); 'fn':(recommended for large df); ...}}# 2. Bootstrap std.err. computations ('BOOT' INSTEAD OF 'KER': TIME-CONSUMING /!\)qr_rob <-list()for (tau in q) { qr_rob[[as.character(tau)]] <-summary(qr[[as.character(tau)]], # Model previously estimatedse ="ker", # Standard errors in {'iid':(default);# 'nid':(non-iid), 'ker':(kernel-smoothed),# 'boot':(bootstrapped)}R =1000)$coefficients[,2]}# Estimates names(qr) <-paste0("qr_", names(qr))invisible(list2env(qr, envir = .GlobalEnv))# Std. err. names(qr_rob) <-paste0("qr_rob_", names(qr_rob))invisible(list2env(qr_rob, envir = .GlobalEnv))# To get a tidy data frame from reg. output. Useful for further manipulation.df_qr_0.5<-tidy(qr_0.5)# Baseline comparison with OLSols <-lm(formula_qr, data)ols_rob <-coeftest(ols, vcov =vcovHC(ols, type ="HC1")) # Full model ols_rob <-sqrt(diag(vcovHC(ols, fit="HC1"))) # Only se. in a list
Table
Below the regression tables for the 10-th, 25-th, 50-th (LAD), 75-th and 90-th quantiles, together with the OLS estimates. It reproduces Table 2. of the paper. Several notes are worth mentioning: i) the point estimates are slightly different -most likely due to both a slightly different cleaning, I have some more observations here, and differences in statistical softwares computations-; ii) the standard errors also differ a bit because I didn’t go for bootstrapped computations (too time-consuming). It is possible to compute the bootstrapped errors if you change the se option by 'boot' in summary().
There are several threat to identification. As highlighted by the authors, there is likely an OVB issue since they can not control for all maternal behavior that are likely correlated with other regressors in the model, like alcohol use or prenatal diet for instance. Another issue is the likely failure of the rank invariance condition: it seems implausible that the babies would remain on the same position in the birthweight distribution, independently of regressors values. The interpretations are thus only descriptive here, but still informative. For instance,
If anything, the black-white differential in birthweight at the 10-th quantile is 253 grams. In other words, the 10-th quantile of birthweight for a baby born to a black mother is 253 grams below than the 10-th quantile of birthweight for a baby born to a white mother;
If anything, the differential in birthweight between college graduate and “dropouts” shrink completely from 82 grams at the 10-th quantile compared to the 90-th quantile. The effect of college on birthweight is sizeable for low-birthweight babies, while not being significant at the high-end of the distribution;
The intercept can be interpreted as the estimated conditional quantile function of the birthweight distribution of a girl born to an unmaried, white mother with less than a high school education, who is 27yo and had a weight gain of 30 pounds, didn’t smoke, and had her first prenatal visit in the first trimester of the pregnancy (the number reflects mean values). In this example, given the high number of covariates, this baseline comparison seems hard to grasp but it might be more informative in simpler settings.
For equality of the quantile coefficients along the conditional y-distribution \(\longrightarrow\) except for High School Graduate, Second Trimester Prenatal Care, Non-Smoker During Pregnancy and Cigarette Usage, all variables show coefficients that significantly differ across the quantiles;
For the interquantile range (are the \(\beta_{0.1}\) and \(\beta_{0.9}\) statistically different?) \(\longrightarrow\) same conclusion as above, with High School Graduate that shows significantly different coefficients across the 10-th and the 90-th quantiles at 10% confidence level now.
Code
# Wald test qr_all <-rq(formula_qr, q, data, method ='fn')qr_wald1 <-anova(qr_all, test ="Wald", joint = F)$table %>%as.data.frame() %>%round(., digits =3)qr_wald2 <-anova(qr_0.1, qr_0.9, test ="Wald", joint = F)$table %>%as.data.frame() %>%round(., digits =3)# Table 1.colnames(qr_wald1) <-c("DF", "Res.DF", "F-value", "p-values")row.names(qr_wald1) <-var_label(data)[-1] qr_wald1 %>%kable(., booktabs =TRUE,caption ="Wald test for QR") %>%kable_styling(font_size =12)
Wald test for QR
DF
Res.DF
F-value
p-values
Black Ethnicity Indicator
4
995901
16.243
0.000
Marital Status (Married)
4
995901
5.342
0.000
Mother's Age
4
995901
2.260
0.060
Mother's Age Squared
4
995901
4.630
0.001
High School Graduate
4
995901
1.235
0.293
Some College Education
4
995901
2.681
0.030
College Graduate
4
995901
18.053
0.000
Weight Gain During Pregnancy
4
995901
72.269
0.000
Second Trimester Prenatal Care
4
995901
1.532
0.190
Third Trimester Prenatal Care
4
995901
3.574
0.006
No Prenatal Visits
4
995901
9.973
0.000
Non-Smoker During Pregnancy
4
995901
1.545
0.186
Cigarette Usage
4
995901
1.533
0.189
Male Child Indicator
4
995901
40.502
0.000
Code
# Table 2.colnames(qr_wald2) <-c("DF", "Res.DF", "F-value", "p-values")row.names(qr_wald2) <-var_label(data)[-1] qr_wald2 %>%kable(., booktabs =TRUE,caption ="Wald test for QR") %>%kable_styling(font_size =12)
Wald test for QR
DF
Res.DF
F-value
p-values
Black Ethnicity Indicator
1
398361
60.239
0.000
Marital Status (Married)
1
398361
18.732
0.000
Mother's Age
1
398361
4.823
0.028
Mother's Age Squared
1
398361
13.932
0.000
High School Graduate
1
398361
2.957
0.086
Some College Education
1
398361
9.522
0.002
College Graduate
1
398361
64.188
0.000
Weight Gain During Pregnancy
1
398361
219.641
0.000
Second Trimester Prenatal Care
1
398361
0.085
0.770
Third Trimester Prenatal Care
1
398361
13.700
0.000
No Prenatal Visits
1
398361
39.691
0.000
Non-Smoker During Pregnancy
1
398361
2.214
0.137
Cigarette Usage
1
398361
2.695
0.101
Male Child Indicator
1
398361
139.003
0.000
Plots
Here are some (more or less customed) plots that can be useful for visualization and comparison of coefficients.
Code
# No customization. Good for first analysis and overview.summary(qr_all) %>%plot()
Code
# Customized plot for each regressor: see function 'ggplot.qr' in 00.dependencies.Rqr_plots <-list()for(k incolnames(data)[-1]){ qr_plots[[k]] <-ggplot.qr(ols, qr_all, q, data, k, "ker")}names(qr_plots) <-paste0("qr_plot_", names(qr_plots))invisible(list2env(qr_plots, envir = .GlobalEnv))# Examples grid.arrange(qr_plot_married, qr_plot_boy, qr_plot_collgrad, qr_plot_novisit, qr_plot_black, ncol =2)
Code
# Some customization w/ comparison between regressor. # TYPE "help(plot_models)" FOR MORE INFO!qr_comp_1to7 <-plot_models(ols, qr_0.5, qr_0.9,spacing =0.8,show.values = T,rm.terms =colnames(data)[9:15],m.labels =c("OLS", "LAD", "QR 90%"),legend.title ="Specification",colors =c(my_purple, my_blue, my_red)) + my_theme +theme(legend.position =c(0.1, 0.1)) +coord_flip()qr_comp_1to7
Abrevaya, Jason. 2002. “The Effects of Demographics and Maternal Behavior on the Distribution of Birth Outcomes.”Economic Applications of Quantile Regression, 247–57.
Source Code
---title: "Replication QR - Abrevaya (2002)"author: "Mickael Sixdenier"date: "Nov. 13, 2024"format: html: theme: default highlight-style: kate code-block-bg: true code-fold: true code-tools: true toc: true toc-depth: 5bibliography: references.bib ---```{=html}<style>body {text-align: justify}</style>``````{r setup, include=F}knitr::opts_chunk$set(echo = T, fig.width = 10, fig.height = 8, fig.align = 'center')```**Objective**: This notebook replicates the main findings of @abrevaya_2002 and provide code for quantile regressions in RStudio. In his paper, Jason Abrevaya studies the impact of demographics and maternal behavior on various quantiles of the birthweight distribution. The question is relevant because high costs and long-term effects -medical and economic- are associated with low-birthweight babies. Thus, the identification strategy allows to focus on the lower end of the birthweight distribution to quantify drivers of low-birthweight. Using data on births in the US in 1992 and 1996, he mainly highlights that **many covariates have larger effects at the lower quantiles and lower effects at the higher quantiles.** According to the author, the **OLS can seriously under-estimate the effects at lower quantiles.** Finally, **results don't seem to be driven by state effect.**\------------------------------------------------------------------------# Dependencies & DataAbrevaya relies on the 1992 and 1996 editions of the Natality Data Set. It gathers information on the universe of US births, including babies -weight, sex, disorders, if any- and maternal outcomes -age, race, marital status, education, prenatal care, smoking behavior, among others.For the purpose of this notebook, I only use the 1992 to run the analysis and show the main functions implemented in `RStudio`.```{r dep_data, message = F, warning = F}# Clear the workplacerm(list = ls())# Working directoryif (rstudioapi::isAvailable()) { setwd(dirname(dirname(rstudioapi::getActiveDocumentContext()$path)))} else { # Replace with your path if it doesn't run setwd("/Users/cepremap/Dropbox/Mac/Documents/PhD/Econometric Game/QR") }# Run a script with packages, functions and aesthetic settingssource("./code/00_dependencies.R")# Load datadata <- read_dta("./data/natality1992us_clean.dta") ```\------------------------------------------------------------------------# DescriptivesFollowing the author's procedure, I cleaned beforehand the 1992 data set to make sure I get the right results (see `01_cleaning.dta`). Below, the descriptive statistics that replicates **Table 1.** of the paper.```{r sd, message = F, warning = F}# Descriptives (mean & std.dev.)sd_vars <- colnames(data)sd_list <- list()for (i in sd_vars){ sd_list[[i]] <- sd_univ_mean(data, data[[i]], label(data[[i]])) %>% dplyr::select(Variables, `Mean`, `Std.dev.`)}sd <- do.call(rbind, sd_list) # Descriptives. HTML table.sd %>% kable(., booktabs = TRUE, caption = "Descriptive statistics numeric variables.") %>% kable_styling(font_size = 12)```\------------------------------------------------------------------------# Quantile regressions```{r qr, message = F, warning = F}# Quantile reg. analysisy <- "dbirwt"x <- paste(sd_vars[-1], collapse = " + ")formula_qr <- as.formula(paste(y, " ~ ", x))# Estimate in separate R objectsq <- c(0.1, 0.25, 0.5, 0.75, 0.9)qr <- list()# 1. Quantile estimatesfor (tau in q) { qr[[as.character(tau)]] <- # TYPE "help(rq)" FOR MORE INFORMATION! rq(formula_qr, # Put y ~ x type of formula tau, # Scalar or vector of quantiles in [0,1] data, # Data frame method = 'fn') # Algorithm used to compute the quantile estimates in # {'br':(default); 'fn':(recommended for large df); ...}}# 2. Bootstrap std.err. computations ('BOOT' INSTEAD OF 'KER': TIME-CONSUMING /!\)qr_rob <- list()for (tau in q) { qr_rob[[as.character(tau)]] <- summary(qr[[as.character(tau)]], # Model previously estimated se = "ker", # Standard errors in {'iid':(default); # 'nid':(non-iid), 'ker':(kernel-smoothed), # 'boot':(bootstrapped)} R = 1000)$coefficients[,2]}# Estimates names(qr) <- paste0("qr_", names(qr))invisible(list2env(qr, envir = .GlobalEnv))# Std. err. names(qr_rob) <- paste0("qr_rob_", names(qr_rob))invisible(list2env(qr_rob, envir = .GlobalEnv))# To get a tidy data frame from reg. output. Useful for further manipulation.df_qr_0.5 <- tidy(qr_0.5)# Baseline comparison with OLSols <- lm(formula_qr, data)ols_rob <- coeftest(ols, vcov = vcovHC(ols, type = "HC1")) # Full model ols_rob <- sqrt(diag(vcovHC(ols, fit="HC1"))) # Only se. in a list```\### TableBelow the regression tables for the 10-th, 25-th, 50-th (LAD), 75-th and 90-th quantiles, together with the OLS estimates. It reproduces **Table 2.** of the paper. Several notes are worth mentioning: i) the point estimates are slightly different -most likely due to both a slightly different cleaning, I have some more observations here, and differences in statistical softwares computations-; ii) the standard errors also differ a bit because I didn't go for bootstrapped computations (too time-consuming). It is possible to compute the bootstrapped errors if you change the `se` option by `'boot'` in `summary()`. There are several threat to identification. As highlighted by the authors, there is likely an **OVB issue** since they can not control for all maternal behavior that are likely correlated with other regressors in the model, like alcohol use or prenatal diet for instance. Another issue is the likely **failure of the rank invariance condition**: it seems implausible that the babies would remain on the same position in the birthweight distribution, independently of regressors values. The interpretations are thus only descriptive here, but still informative. For instance, 1. If anything, *the black-white differential in birthweight at the 10-th quantile is 253 grams*. In other words, the 10-th quantile of birthweight for a baby born to a black mother is 253 grams below than the 10-th quantile of birthweight for a baby born to a white mother; 2. If anything, *the differential in birthweight between college graduate and "dropouts" shrink completely from 82 grams at the 10-th quantile compared to the 90-th quantile*. The effect of college on birthweight is sizeable for low-birthweight babies, while not being significant at the high-end of the distribution;3. The intercept can be interpreted as *the estimated conditional quantile function of the birthweight distribution of <u>a girl born to an unmaried, white mother with less than a high school education, who is 27yo and had a weight gain of 30 pounds, didn't smoke, and had her first prenatal visit in the first trimester of the pregnancy</u>* (the number reflects mean values). In this example, given the high number of covariates, this baseline comparison seems hard to grasp but it might be more informative in simpler settings. ```{r output_table2, message=F, warning=F, results = 'asis'}# Regression table outputstargazer(qr_0.1, qr_0.25, qr_0.5, qr_0.75, qr_0.9, ols, se = c(list(qr_rob_0.1), list(qr_rob_0.25), list(qr_rob_0.5), list(qr_rob_0.75), list(qr_rob_0.9), list(ols_rob)), type = "html", dep.var.labels = "Birth weight", model.names = FALSE, column.labels = c("0.1","0.25","0.5","0.75","0.9", "OLS"), covariate.labels = sapply(data, function(x) attr(x, "label"))[-1], font.size = "footnotesize")```\ ### Wald testsThe two tables below respectively test, 1. For equality of the quantile coefficients along the conditional y-distribution $\longrightarrow$ except for `r label(data$hsgrad)`, `r label(data$natal2)`, `r label(data$nosmoke)` and `r label(data$cigar)`, all variables show coefficients that significantly differ across the quantiles;2. For the interquantile range (are the $\beta_{0.1}$ and $\beta_{0.9}$ statistically different?) $\longrightarrow$ same conclusion as above, with `r label(data$hsgrad)` that shows significantly different coefficients across the 10-th and the 90-th quantiles at 10% confidence level now.```{r wald_test, message = F, warning = F}# Wald test qr_all <- rq(formula_qr, q, data, method = 'fn')qr_wald1 <- anova(qr_all, test = "Wald", joint = F)$table %>% as.data.frame() %>% round(., digits = 3)qr_wald2 <- anova(qr_0.1, qr_0.9, test = "Wald", joint = F)$table %>% as.data.frame() %>% round(., digits = 3)# Table 1.colnames(qr_wald1) <- c("DF", "Res.DF", "F-value", "p-values")row.names(qr_wald1) <- var_label(data)[-1] qr_wald1 %>% kable(., booktabs = TRUE, caption = "Wald test for QR") %>% kable_styling(font_size = 12)# Table 2.colnames(qr_wald2) <- c("DF", "Res.DF", "F-value", "p-values")row.names(qr_wald2) <- var_label(data)[-1] qr_wald2 %>% kable(., booktabs = TRUE, caption = "Wald test for QR") %>% kable_styling(font_size = 12)```\### PlotsHere are some (more or less customed) plots that can be useful for visualization and comparison of coefficients. ```{r qr_visu_basic, message=F, warning=F}# No customization. Good for first analysis and overview.summary(qr_all) %>% plot()``````{r qr_visu_custom, message=F, warning=F}# Customized plot for each regressor: see function 'ggplot.qr' in 00.dependencies.Rqr_plots <- list()for(k in colnames(data)[-1]){ qr_plots[[k]] <- ggplot.qr(ols, qr_all, q, data, k, "ker")}names(qr_plots) <- paste0("qr_plot_", names(qr_plots))invisible(list2env(qr_plots, envir = .GlobalEnv))# Examples grid.arrange(qr_plot_married, qr_plot_boy, qr_plot_collgrad, qr_plot_novisit, qr_plot_black, ncol = 2)``````{r qr_visu_comp, message=F, warning=F}# Some customization w/ comparison between regressor. # TYPE "help(plot_models)" FOR MORE INFO!qr_comp_1to7 <- plot_models(ols, qr_0.5, qr_0.9, spacing = 0.8, show.values = T, rm.terms = colnames(data)[9:15], m.labels = c("OLS", "LAD", "QR 90%"), legend.title = "Specification", colors = c(my_purple, my_blue, my_red)) + my_theme + theme(legend.position = c(0.1, 0.1)) + coord_flip()qr_comp_1to7qr_comp_8to15 <- plot_models(ols, qr_0.5, qr_0.9, spacing = 0.8, show.values = T, rm.terms = colnames(data)[2:8], m.labels = c("OLS", "LAD", "QR 90%"), legend.title = "Specification", colors = c(my_purple, my_blue, my_red)) + my_theme + theme(legend.position = c(0.1, 0.1)) + coord_flip()qr_comp_8to15```\# References