Replication QR - Abrevaya (2002)

Mickael Sixdenier - Nov. 13, 2024

Objective: This notebook replicates the main findings of Abrevaya (2002) and provide code for quantile regressions in RStudio. In his paper, Jason Abrevaya studies the impact of demographics and maternal behavior on various quantiles of the birthweight distribution. The question is relevant because high costs and long-term effects -medical and economic- are associated with low-birthweight babies. Thus, the identification strategy allows to focus on the lower end of the birthweight distribution to quantify drivers of low-birthweight. Using data on births in the US in 1992 and 1996, he mainly highlights that many covariates have larger effects at the lower quantiles and lower effects at the higher quantiles. According to the author, the OLS can seriously under-estimate the effects at lower quantiles. Finally, results don’t seem to be driven by state effect.

Dependencies & Data

Abrevaya relies on the 1992 and 1996 editions of the Natality Data Set. It gathers information on the universe of US births, including babies -weight, sex, disorders, if any- and maternal outcomes -age, race, marital status, education, prenatal care, smoking behavior, among others.

For the purpose of this notebook, I only use the 1992 to run the analysis and show the main functions implemented in RStudio.

# Clear the workplace
rm(list = ls())

# Working directory
setwd(dirname(dirname(rstudioapi::getActiveDocumentContext()$path)))

# Run a script with packages, functions and aesthetic settings
source("./code/00_dependencies.R")

# Load data
data <- read_dta("./data/natality1992us_clean.dta")

Descriptives

Following the author’s procedure, I cleaned beforehand the 1992 data set to make sure I get the right results (see 01_cleaning.dta). Below, the descriptive statistics that replicates Table 1. of the paper.

# Descriptives (mean & std.dev.)
sd_vars <- colnames(data)
sd_list <- list()

for (i in sd_vars){
  sd_list[[i]] <- 
    sd_univ_mean(data, data[[i]], label(data[[i]])) %>%
    dplyr::select(Variables, `Mean`, `Std.dev.`)
}
sd <- do.call(rbind, sd_list) 

# Descriptives. HTML table.
sd %>% kable(., booktabs = TRUE,
             caption = "Descriptive statistics numeric variables.") %>%
  kable_styling(font_size = 12) %>% 
  scroll_box(width = "100%")

Descriptive statistics numeric variables.
Variables	Mean	Std.dev.
Birth Weight	3,388.68	(571.51)
Black Ethnicity Indicator	0.16	(0.37)
Marital Status (Married)	0.75	(0.43)
Mother’s Age	26.96	(5.43)
Mother’s Age Squared	756.39	(303.88)
High School Graduate	0.39	(0.49)
Some College Education	0.23	(0.42)
College Graduate	0.21	(0.41)
Weight Gain During Pregnancy	30.76	(12.25)
Second Trimester Prenatal Care	0.16	(0.36)
Third Trimester Prenatal Care	0.03	(0.16)
No Prenatal Visits	0.01	(0.1)
Non-Smoker During Pregnancy	0.83	(0.38)
Cigarette Usage	2.14	(5.74)
Male Child Indicator	0.51	(0.5)

Quantile regressions

# Quantile reg. analysis
y <- "dbirwt"
x <- paste(sd_vars[-1], collapse = " + ")
formula_qr <- as.formula(paste(y, " ~ ", x))

# Estimate in separate R objects
q <- c(0.1, 0.25, 0.5, 0.75, 0.9)
qr <- list()

# 1. Quantile estimates
for (tau in q) {
  qr[[as.character(tau)]] <- 
    
    # TYPE "help(rq)" FOR MORE INFORMATION!
    rq(formula_qr,     # Put y ~ x type of formula
       tau,            # Scalar or vector of quantiles in [0,1]
       data,           # Data frame
       method = 'fn')  # Algorithm used to compute the quantile estimates in 
                       # {'br':(default); 'fn':(recommended for large df); ...}
}

# 2. Bootstrap std.err. computations ('BOOT' INSTEAD OF 'KER': TIME-CONSUMING /!\)
qr_rob <- list()
for (tau in q) {
  qr_rob[[as.character(tau)]] <- 
    
    summary(qr[[as.character(tau)]], # Model previously estimated
            se = "ker",              # Standard errors in {'iid':(default);
                                     # 'nid':(non-iid), 'ker':(kernel-smoothed),
                                     # 'boot':(bootstrapped)}
            R = 1000)$coefficients[,2]
}

# Estimates 
names(qr) <- paste0("qr_", names(qr))
invisible(list2env(qr, envir = .GlobalEnv))

# Std. err. 
names(qr_rob) <- paste0("qr_rob_", names(qr_rob))
invisible(list2env(qr_rob, envir = .GlobalEnv))

# To get a tidy data frame from reg. output. Useful for further manipulation.
df_qr_0.5 <- tidy(qr_0.5)

# Baseline comparison with OLS
ols <- lm(formula_qr, data)
ols_rob <- coeftest(ols, vcov = vcovHC(ols, type = "HC1")) # Full model 
ols_rob <- sqrt(diag(vcovHC(ols, fit="HC1"))) # Only se. in a list

Table

Below the regression tables for the 10-th, 25-th, 50-th (LAD), 75-th and 90-th quantiles, together with the OLS estimates. It reproduces Table 2. of the paper. Several notes are worth mentioning: i) the point estimates are slightly different -most likely due to both a slightly different cleaning, I have some more observations here, and differences in statistical softwares computations-; ii) the standard errors also differ a bit because I didn’t go for bootstrapped computations (too time-consuming). It is possible to compute the bootstrapped errors if you change the se option by 'boot' in summary().

There are several threat to identification. As highlighted by the authors, there is likely an OVB issue since they can not control for all maternal behavior that are likely correlated with other regressors in the model, like alcohol use or prenatal diet for instance. Another issue is the likely failure of the rank invariance condition: it seems implausible that the babies would remain on the same position in the birthweight distribution, independently of regressors values. The interpretations are thus only descriptive here, but still informative. For instance,

If anything, the black-white differential in birthweight at the 10-th quantile is 253 grams. In other words, the 10-th quantile of birthweight for a baby born to a black mother is 253 grams below than the 10-th quantile of birthweight for a baby born to a white mother;
If anything, the differential in birthweight between college graduate and “dropouts” shrink completely from 82 grams at the 10-th quantile compared to the 90-th quantile. The effect of college on birthweight is sizeable for low-birthweight babies, while not being significant at the high-end of the distribution;
The intercept can be interpreted as the estimated conditional quantile function of the birthweight distribution of a girl born to an unmaried, white mother with less than a high school education, who is 27yo and had a weight gain of 30 pounds, didn’t smoke, and had her first prenatal visit in the first trimester of the pregnancy (the number reflects mean values). In this example, given the high number of covariates, this baseline comparison seems hard to grasp but it might be more informative in simpler settings.

# Regression table output
stargazer(qr_0.1, 
          qr_0.25, 
          qr_0.5,
          qr_0.75, 
          qr_0.9, 
          ols,
          se = c(list(qr_rob_0.1), 
                 list(qr_rob_0.25), 
                 list(qr_rob_0.5), 
                 list(qr_rob_0.75), 
                 list(qr_rob_0.9),
                 list(ols_rob)),
          type = "html",
          dep.var.labels = "Birth weight",
          model.names = FALSE,
          column.labels = c("0.1","0.25","0.5","0.75","0.9", "OLS"),
          covariate.labels = sapply(data, function(x) attr(x, "label"))[-1])


	Dependent variable:

	Birth weight
	0.1	0.25	0.5	0.75	0.9	OLS
	(1)	(2)	(3)	(4)	(5)	(6)

Black Ethnicity Indicator	-253.656^***	-216.794^***	-199.071^***	-192.405^***	-182.087^***	-220.271^***
	(7.730)	(4.609)	(3.905)	(4.446)	(5.710)	(3.874)

Marital Status (Married)	73.323^***	55.181^***	50.563^***	42.456^***	38.560^***	57.560^***
	(6.272)	(4.188)	(3.583)	(4.209)	(5.487)	(3.466)

Mother’s Age	45.177^***	36.722^***	34.636^***	31.909^***	33.788^***	37.584^***
	(4.174)	(2.706)	(2.304)	(2.560)	(3.469)	(2.176)

Mother’s Age Squared	-0.781^***	-0.591^***	-0.530^***	-0.440^***	-0.436^***	-0.579^***
	(0.075)	(0.048)	(0.041)	(0.045)	(0.061)	(0.039)

High School Graduate	28.580^***	25.078^***	17.543^***	15.024^***	13.355^**	16.684^***
	(6.924)	(4.682)	(3.884)	(4.456)	(6.016)	(3.732)

Some College Education	49.494^***	40.549^***	31.698^***	26.975^***	18.640^***	30.372^***
	(7.754)	(5.248)	(4.515)	(5.133)	(6.810)	(4.271)

College Graduate	82.645^***	57.601^***	36.839^***	16.270^***	-4.663	36.718^***
	(8.294)	(5.730)	(4.928)	(5.496)	(7.554)	(4.608)

Weight Gain During Pregnancy	11.776^***	9.905^***	9.134^***	8.838^***	8.576^***	10.490^***
	(0.180)	(0.124)	(0.111)	(0.117)	(0.150)	(0.109)

Second Trimester Prenatal Care	6.176	2.833	-4.477	-0.544	3.796	7.731^**
	(6.399)	(4.218)	(3.705)	(4.118)	(5.687)	(3.450)

Third Trimester Prenatal Care	39.434^***	12.679	3.967	-6.239	-25.498^**	20.268^***
	(13.223)	(9.284)	(7.746)	(8.793)	(11.445)	(7.095)

No Prenatal Visits	-388.584^***	-196.452^***	-147.230^***	-126.642^***	-101.659^***	-193.737^***
	(43.675)	(24.659)	(14.529)	(17.049)	(18.954)	(16.026)

Non-Smoker During Pregnancy	170.750^***	171.840^***	158.821^***	153.227^***	150.219^***	161.929^***
	(11.710)	(7.038)	(6.152)	(6.582)	(8.643)	(5.718)

Cigarette Usage	-3.803^***	-3.519^***	-3.902^***	-4.461^***	-5.168^***	-4.043^***
	(0.766)	(0.463)	(0.398)	(0.422)	(0.515)	(0.367)

Male Child Indicator	87.763^***	109.568^***	129.128^***	142.402^***	153.582^***	122.568^***
	(4.439)	(2.947)	(2.566)	(2.863)	(3.843)	(2.384)

Constant	1,556.905^***	2,020.488^***	2,377.883^***	2,715.228^***	2,967.683^***	2,273.476^***
	(56.734)	(37.365)	(31.759)	(34.950)	(47.531)	(29.967)


Observations	199,181	199,181	199,181	199,181	199,181	199,181
R²						0.134
Adjusted R²						0.134
Residual Std. Error						531.845 (df = 199166)
F Statistic						2,202.321^*** (df = 14; 199166)

Note:	p<0.1; p<0.05; p<0.01

Wald tests

The two tables below respectively test,

For equality of the quantile coefficients along the conditional y-distribution \(\longrightarrow\) except for High School Graduate, Second Trimester Prenatal Care, Non-Smoker During Pregnancy and Cigarette Usage, all variables show coefficients that significantly differ across the quantiles;
For the interquantile range (are the \(\beta_{0.1}\) and \(\beta_{0.9}\) statistically different?) \(\longrightarrow\) same conclusion as above, with High School Graduate that shows significantly different coefficients across the 10-th and the 90-th quantiles at 10% confidence level now.

# Wald test 
qr_all <- rq(formula_qr, q, data, method = 'fn')
qr_wald1 <- anova(qr_all, test = "Wald", joint = F)$table %>% 
  as.data.frame() %>%
  round(., digits = 3)

qr_wald2 <- anova(qr_0.1, qr_0.9, test = "Wald", joint = F)$table %>% 
  as.data.frame() %>%
  round(., digits = 3)

# Table 1.
colnames(qr_wald1) <- c("DF", "Res.DF", "F-value", "p-values")
row.names(qr_wald1) <- var_label(data)[-1] 
qr_wald1 %>% kable(., booktabs = TRUE,
             caption = "Wald test for QR") %>%
  kable_styling(font_size = 12) %>% 
  scroll_box(width = "100%")

Wald test for QR
	DF	Res.DF	F-value	p-values
Black Ethnicity Indicator	4	995901	16.243	0.000
Marital Status (Married)	4	995901	5.342	0.000
Mother’s Age	4	995901	2.260	0.060
Mother’s Age Squared	4	995901	4.630	0.001
High School Graduate	4	995901	1.235	0.293
Some College Education	4	995901	2.681	0.030
College Graduate	4	995901	18.053	0.000
Weight Gain During Pregnancy	4	995901	72.269	0.000
Second Trimester Prenatal Care	4	995901	1.532	0.190
Third Trimester Prenatal Care	4	995901	3.574	0.006
No Prenatal Visits	4	995901	9.973	0.000
Non-Smoker During Pregnancy	4	995901	1.545	0.186
Cigarette Usage	4	995901	1.533	0.189
Male Child Indicator	4	995901	40.502	0.000

# Table 2.
colnames(qr_wald2) <- c("DF", "Res.DF", "F-value", "p-values")
row.names(qr_wald2) <- var_label(data)[-1] 
qr_wald2 %>% kable(., booktabs = TRUE,
             caption = "Wald test for QR") %>%
  kable_styling(font_size = 12) %>% 
  scroll_box(width = "100%")

Wald test for QR
	DF	Res.DF	F-value	p-values
Black Ethnicity Indicator	1	398361	60.239	0.000
Marital Status (Married)	1	398361	18.732	0.000
Mother’s Age	1	398361	4.823	0.028
Mother’s Age Squared	1	398361	13.932	0.000
High School Graduate	1	398361	2.957	0.086
Some College Education	1	398361	9.522	0.002
College Graduate	1	398361	64.188	0.000
Weight Gain During Pregnancy	1	398361	219.641	0.000
Second Trimester Prenatal Care	1	398361	0.085	0.770
Third Trimester Prenatal Care	1	398361	13.700	0.000
No Prenatal Visits	1	398361	39.691	0.000
Non-Smoker During Pregnancy	1	398361	2.214	0.137
Cigarette Usage	1	398361	2.695	0.101
Male Child Indicator	1	398361	139.003	0.000

Plots

Here are some (more or less customed) plots that can be useful for visualization and comparison of coefficients.

# No customization. Good for first analysis and overview.
summary(qr_all) %>% plot()

# Customized plot for each regressor: see function 'ggplot.qr' in 00.dependencies.R
qr_plots <- list()
for(k in colnames(data)[-1]){
  qr_plots[[k]] <- ggplot.qr(ols, qr_all, q, data, k, "ker")
  qr_plots[[k]]
}
names(qr_plots) <- paste0("qr_plot_", names(qr_plots))
invisible(list2env(qr_plots, envir = .GlobalEnv))

# Examples 
grid.arrange(qr_plot_married, qr_plot_boy, qr_plot_collgrad,
             qr_plot_novisit, qr_plot_black, ncol = 2)

# Some customization w/ comparison between regressor. 
# TYPE "help(plot_models)" FOR MORE INFO!
qr_comp_1to7 <- plot_models(ols, qr_0.5, qr_0.9,
                            spacing = 0.8,
                            show.values = T,
                            rm.terms = colnames(data)[9:15],
                            m.labels = c("OLS", "LAD", "QR 90%"),
                            legend.title = "Specification",
                            colors = c(my_purple, my_blue, my_red)) +
  my_theme +
  theme(legend.position = c(0.1, 0.1)) +
  coord_flip()
qr_comp_1to7

qr_comp_8to15 <- plot_models(ols, qr_0.5, qr_0.9,
                             spacing = 0.8,
                             show.values = T,
                             rm.terms = colnames(data)[2:8],
                             m.labels = c("OLS", "LAD", "QR 90%"),
                             legend.title = "Specification",
                             colors = c(my_purple, my_blue, my_red)) +
  my_theme +
  theme(legend.position = c(0.1, 0.1)) +
  coord_flip()
qr_comp_8to15

References

Abrevaya, Jason. 2002. “The Effects of Demographics and Maternal Behavior on the Distribution of Birth Outcomes.” Economic Applications of Quantile Regression, 247–57.