Introduction

There is a popular and persistent perception in American society that women who pursue higher education have fewer children. This narrative arises again and again in popular culture in movies such as Idiocracy as well as conservative talk shows and podcasts. In this analysis we will explore whether the pursuit of higher education by women is a statistically and practically significant predictor of birth rates. Prior research by the CDC (Hamilton) and NIH (Chen) has reached conflicting conclusions on the extent of the correlation between the pursuit of higher education and birth rates. These prior studies focused on birth records from hospitals and performed some extrapolation on the expected lifetime fertility of mothers to reach their conclusions. Approaching the question from a slightly different angle, we will be using data from the American Community Survey on birth rates and the highest level of education attained by the female population in Public Use Microdata Areas (PUMAs). If the pursuit of higher education is a relevant negative predictor of birth rates we would expect birth rates in a PUMA for a given year to be lower when a higher proportion of the female population has pursued some level of higher education. In this study we will use multiple linear regression to determine if the proportion of the female population who have pursued higher education is a statistically and practically significant negative predictor of births per thousand women in a PUMA.

Chen S. (2022). The Positive Effect of Women’s Education on Fertility in Low-Fertility China. European journal of population = Revue europeenne de demographie, 38(1), 125–161. https://doi.org/10.1007/s10680-021-09603-2

Hamilton BE. Total fertility rates, by maternal educational attainment and race and Hispanic origin: United States, 2019. National Vital Statistics Reports; vol 70 no 5. Hyattsville, MD: National Center for Health Statistics. 2021. DOI: https://doi.org/10.15620/cdc:105234.

Data Overview

The American Community Survey is an ongoing annual survey conducted by the US Census Bureau. It differs from the decennial Census not only in the frequency of data collection but also the focus of the survey. It consists of questions designed to track the “changing social and economic characteristics of the US population”, with questions on topics such as housing, jobs, education, and birth rates. Additionally, the survey allows for data to be broken down into units called “Public Use Microdata Areas” or PUMAs which are “non-overlapping, statistical geographic areas that partition each state or equivalent entity into geographic areas containing no fewer than 100,000 people each”. This ensures that we can analyze our data without having to do complex weightings for population like we would if using county level data. We will be using the tidycensus library to call the US Census Bureau API to retrieve the records needed for this analysis.

United States Census Bureau (2024, February 18) Public Use Microdata Areas (PUMAs). https://www.census.gov/programs-surveys/geography/guidance/geo-areas/pumas.html#reference

Data Transformation and Exploratory Analysis

The data used for this analysis needs to be retrieved from three different Census Bureau datasets, the first containing mappings of label IDs to plain text labels, the second tracking different levels of education and the total population of the PUMA that have attained the given level, and the last tracking the total births in the PUMA.

geography <- "public use microdata area"

all_vars <- load_variables(2022, "acs5") # Plain text labels associated with variable codes

education_data <- get_acs( 
  geography = geography,
  table = "B15001",
  year = 2022,
  survey = "acs5",
  cache_table = TRUE
)

birth_data <- get_acs(
  geography = geography,
  table = "B13016",
  year = 2022,
  survey = "acs5",
  cache_table = TRUE
)

head(all_vars) |>
  kbl() |>
  kable_styling()
name label concept geography
B01001A_001 Estimate!!Total: Sex by Age (White Alone) tract
B01001A_002 Estimate!!Total:!!Male: Sex by Age (White Alone) tract
B01001A_003 Estimate!!Total:!!Male:!!Under 5 years Sex by Age (White Alone) tract
B01001A_004 Estimate!!Total:!!Male:!!5 to 9 years Sex by Age (White Alone) tract
B01001A_005 Estimate!!Total:!!Male:!!10 to 14 years Sex by Age (White Alone) tract
B01001A_006 Estimate!!Total:!!Male:!!15 to 17 years Sex by Age (White Alone) tract
head(education_data)|>
  kbl() |>
  kable_styling()
GEOID NAME variable estimate moe
0100100 Lauderdale, Colbert & Franklin Counties PUMA; Alabama B15001_001 145457 130
0100100 Lauderdale, Colbert & Franklin Counties PUMA; Alabama B15001_002 69715 230
0100100 Lauderdale, Colbert & Franklin Counties PUMA; Alabama B15001_003 9057 191
0100100 Lauderdale, Colbert & Franklin Counties PUMA; Alabama B15001_004 199 184
0100100 Lauderdale, Colbert & Franklin Counties PUMA; Alabama B15001_005 871 230
0100100 Lauderdale, Colbert & Franklin Counties PUMA; Alabama B15001_006 3689 508
head(birth_data)|>
  kbl() |>
  kable_styling()
GEOID NAME variable estimate moe
0100100 Lauderdale, Colbert & Franklin Counties PUMA; Alabama B13016_001 40987 400
0100100 Lauderdale, Colbert & Franklin Counties PUMA; Alabama B13016_002 1928 334
0100100 Lauderdale, Colbert & Franklin Counties PUMA; Alabama B13016_003 69 62
0100100 Lauderdale, Colbert & Franklin Counties PUMA; Alabama B13016_004 434 221
0100100 Lauderdale, Colbert & Franklin Counties PUMA; Alabama B13016_005 729 188
0100100 Lauderdale, Colbert & Franklin Counties PUMA; Alabama B13016_006 418 119

Replacing the label IDs with the plain text label provides a much better understanding of what the education and birth datasets are tracking. As our population of interest is strictly the female population we drop the male data from both the education and birth datasets. Both track total population in a given age bracket as well as breakdowns by age groups. In order to get the information we need to conduct our analysis we will need to aggregate these values.

# Function to make labels readable
convert_label_to_varname <- function(label) {
  label |>
    str_replace_all("^(Estimate!!|Total:!!)", "") |> # Remove "Estimate!!" and "Total:!!" prefixes
    str_replace_all("!!", " ") |>                    # Replace remaining "!!" with a single space
    str_to_lower() |>                                # Convert to lowercase
    str_replace_all("[^a-z0-9]+", "_") |>            # Replace special characters and spaces with underscores
    str_replace_all("^_|_$", "")                     # Remove leading or trailing underscores
}

# Fetch the labels for the datasets
education_labels <- all_vars |>
  filter(str_detect(name, "B15001")) |>
  dplyr::rename(variable = name) # Rename function was mapping to plyr by default

birth_labels <- all_vars |>
  filter(str_detect(name, "B13016")) |>
  dplyr::rename(variable = name)

# Label the datasets
labeled_education_data <- education_data |>
  left_join(education_labels, by = "variable") |> # Join to get labels into education data
  sapply(convert_label_to_varname) |>             # Convert labels to an easier to understand format
  as_tibble() |>
  filter(str_detect(label, "female")) |>          # Get rid of male data
  select(GEOID, NAME, estimate, moe, label)       # Remove unneeded columns
labeled_education_data %<>% mutate(estimate = as.integer(estimate), moe = as.integer(moe)) # Fix column typing

labeled_birth_data <- birth_labels |>
  left_join(birth_data, by = "variable") |> # Join to get labels into education data
  sapply(convert_label_to_varname) |>       # Convert labels to an easier to understand format
  as_tibble() |>
  arrange(GEOID) |>                         # Reordering for readability when developing
  select(GEOID, NAME, label, estimate, moe) # Keep the required fields
labeled_birth_data %<>% mutate(estimate = as.integer(estimate)) # Fix column typing

head(labeled_education_data)|>
  kbl() |>
  kable_styling()
GEOID NAME estimate moe label
0100100 lauderdale_colbert_franklin_counties_puma_alabama 75742 220 total_female
0100100 lauderdale_colbert_franklin_counties_puma_alabama 9650 192 total_female_18_to_24_years
0100100 lauderdale_colbert_franklin_counties_puma_alabama 8 16 total_female_18_to_24_years_less_than_9th_grade
0100100 lauderdale_colbert_franklin_counties_puma_alabama 562 181 total_female_18_to_24_years_9th_to_12th_grade_no_diploma
0100100 lauderdale_colbert_franklin_counties_puma_alabama 2686 471 total_female_18_to_24_years_high_school_graduate_includes_equivalency
0100100 lauderdale_colbert_franklin_counties_puma_alabama 5115 516 total_female_18_to_24_years_some_college_no_degree
head(labeled_birth_data)|>
  kbl() |>
  kable_styling()
GEOID NAME label estimate moe
0100100 lauderdale_colbert_franklin_counties_puma_alabama total 40987 400
0100100 lauderdale_colbert_franklin_counties_puma_alabama total_women_who_had_a_birth_in_the_past_12_months 1928 334
0100100 lauderdale_colbert_franklin_counties_puma_alabama total_women_who_had_a_birth_in_the_past_12_months_15_to_19_years_old 69 62
0100100 lauderdale_colbert_franklin_counties_puma_alabama total_women_who_had_a_birth_in_the_past_12_months_20_to_24_years_old 434 221
0100100 lauderdale_colbert_franklin_counties_puma_alabama total_women_who_had_a_birth_in_the_past_12_months_25_to_29_years_old 729 188
0100100 lauderdale_colbert_franklin_counties_puma_alabama total_women_who_had_a_birth_in_the_past_12_months_30_to_34_years_old 418 119

After aggregating the rows we need to pivot the education table and turn the different levels of educational attainment into columns so each row is an observation for an individual PUMA. For the birth data we need to calculate births per thousand women and do so by pivoting wider on the labels before performing our calculation and subsequently drop the columns that are no longer needed.

# Summarize education data, previously split by age groups
summarised_ed <- labeled_education_data |>
  mutate(prelabel = str_remove(label, r"(\w*_\w*_\d*_\w*_\d*_years_)")) |>   # Removes age range from labels
  mutate(relabel = str_remove(prelabel, r"(\w*_\w*_65_years_\w*_over_)")) |> # 65 and over has a slightly different pattern
  select(!c(label, prelabel)) |>
  group_by(GEOID, NAME, relabel) |>       # relabel is of the form "less_than_9th_grade"
  summarise(pop_estimate = sum(estimate)) # collapse age ranges to education aggregates

head(summarised_ed)|>
  kbl() |>
  kable_styling()
GEOID NAME relabel pop_estimate
0100100 lauderdale_colbert_franklin_counties_puma_alabama 9th_to_12th_grade_no_diploma 6293
0100100 lauderdale_colbert_franklin_counties_puma_alabama associate_s_degree 6778
0100100 lauderdale_colbert_franklin_counties_puma_alabama bachelor_s_degree 10256
0100100 lauderdale_colbert_franklin_counties_puma_alabama graduate_or_professional_degree 5790
0100100 lauderdale_colbert_franklin_counties_puma_alabama high_school_graduate_includes_equivalency 25708
0100100 lauderdale_colbert_franklin_counties_puma_alabama less_than_9th_grade 2377
# Pivot wider education data, now one row per PUMA
tidied_ed <- summarised_ed |>
  pivot_wider(names_from = relabel, values_from = pop_estimate) |>
  select(!starts_with("total_female_")) # drop the population totals by age

# Truncating birth data labels to remove overly long values
relabeled_birth_data <- labeled_birth_data |> # Label trimming is different for this data
  mutate(trimmed_label = case_when(
    str_detect(label, "total_women_who_had_a_birth_in_the_past_12_months_") ~ paste0("age_", str_extract(label, r"(\d*_to_\d*)"), "_birth"),
    str_detect(label, "total_women_who_did_not_have_a_birth_in_the_past_12_months_") ~ paste0("age_", str_extract(label, r"(\d*_to_\d*)"), "_none"),
    .default = label
  )) |>
  select(GEOID, NAME, trimmed_label, estimate)

head(relabeled_birth_data)|>
  kbl() |>
  kable_styling()
GEOID NAME trimmed_label estimate
0100100 lauderdale_colbert_franklin_counties_puma_alabama total 40987
0100100 lauderdale_colbert_franklin_counties_puma_alabama total_women_who_had_a_birth_in_the_past_12_months 1928
0100100 lauderdale_colbert_franklin_counties_puma_alabama age_15_to_19_birth 69
0100100 lauderdale_colbert_franklin_counties_puma_alabama age_20_to_24_birth 434
0100100 lauderdale_colbert_franklin_counties_puma_alabama age_25_to_29_birth 729
0100100 lauderdale_colbert_franklin_counties_puma_alabama age_30_to_34_birth 418
# Pivoting birth data, now one row per PUMA
detailed_birth_data <- relabeled_birth_data |>
  pivot_wider(names_from = trimmed_label, values_from = estimate)

# Removing granular columns and calculating birth rate
births_per_thousand <- detailed_birth_data |>
  mutate(births_per_thousand = total_women_who_had_a_birth_in_the_past_12_months / total * 1000) |>
  select(GEOID, NAME, births_per_thousand)

head(tidied_ed)|>
  kbl() |>
  kable_styling()
GEOID NAME 9th_to_12th_grade_no_diploma associate_s_degree bachelor_s_degree graduate_or_professional_degree high_school_graduate_includes_equivalency less_than_9th_grade some_college_no_degree total_female
0100100 lauderdale_colbert_franklin_counties_puma_alabama 6293 6778 10256 5790 25708 2377 18540 75742
0100200 limestone_county_puma_alabama 3298 4621 7326 4081 10228 1381 9475 40410
0100300 morgan_lawrence_counties_decatur_city_puma_alabama 5897 5661 8338 4434 21119 2780 13536 61765
0100401 madison_county_north_east_huntsville_city_east_puma_alabama 1883 3968 10810 5904 11506 1001 9913 44985
0100402 huntsville_north_far_west_madison_east_triana_cities_puma_alabama 2353 5342 16855 9957 12127 1031 15259 62924
0100403 huntsville_city_central_south_puma_alabama 2743 3418 11266 5980 8731 983 11027 44148
head(births_per_thousand)|>
  kbl() |>
  kable_styling()
GEOID NAME births_per_thousand
0100100 lauderdale_colbert_franklin_counties_puma_alabama 47.03931
0100200 limestone_county_puma_alabama 38.29303
0100300 morgan_lawrence_counties_decatur_city_puma_alabama 54.07314
0100401 madison_county_north_east_huntsville_city_east_puma_alabama 49.73894
0100402 huntsville_north_far_west_madison_east_triana_cities_puma_alabama 47.86090
0100403 huntsville_city_central_south_puma_alabama 52.11753

Joining our two datasets on the GEOID ensures that all PUMAs have the correct birth and education data for the female population. Our last step before beginning our analysis is to convert the education data to track proportion of the female population to scale the data as we want to prevent our analysis from being skewed by population variance.

# Join education and fertility data
fertility_and_education <- tidied_ed |>
  left_join(births_per_thousand, by = "GEOID")

# Converting population of education achievement levels to proportions. This is the final dataframe for analysis
prop_fertility_and_education <- fertility_and_education |>
  mutate(
    prop_less_ninth = less_than_9th_grade / total_female,
    prop_no_diploma = `9th_to_12th_grade_no_diploma` / total_female,
    prop_diploma_or_equivalent = high_school_graduate_includes_equivalency / total_female,
    prop_some_college = some_college_no_degree / total_female,
    prop_associate_degree = associate_s_degree / total_female,
    prop_bachelor_degree = bachelor_s_degree / total_female,
    prop_advanced_degree = graduate_or_professional_degree / total_female
   ) |>
  select(GEOID, NAME.x, prop_less_ninth, prop_no_diploma, prop_diploma_or_equivalent, prop_some_college, prop_associate_degree, prop_bachelor_degree, prop_advanced_degree, births_per_thousand) |>
  ungroup()

head(prop_fertility_and_education)|>
  kbl() |>
  kable_styling()
GEOID NAME.x prop_less_ninth prop_no_diploma prop_diploma_or_equivalent prop_some_college prop_associate_degree prop_bachelor_degree prop_advanced_degree births_per_thousand
0100100 lauderdale_colbert_franklin_counties_puma_alabama 0.0313829 0.0830847 0.3394154 0.2447783 0.0894880 0.1354070 0.0764437 47.03931
0100200 limestone_county_puma_alabama 0.0341747 0.0816135 0.2531057 0.2344717 0.1143529 0.1812918 0.1009899 38.29303
0100300 morgan_lawrence_counties_decatur_city_puma_alabama 0.0450093 0.0954748 0.3419250 0.2191532 0.0916538 0.1349955 0.0717882 54.07314
0100401 madison_county_north_east_huntsville_city_east_puma_alabama 0.0222519 0.0418584 0.2557741 0.2203623 0.0882072 0.2403023 0.1312437 49.73894
0100402 huntsville_north_far_west_madison_east_triana_cities_puma_alabama 0.0163848 0.0373943 0.1927246 0.2424989 0.0848961 0.2678628 0.1582385 47.86090
0100403 huntsville_city_central_south_puma_alabama 0.0222660 0.0621319 0.1977666 0.2497735 0.0774214 0.2551871 0.1354535 52.11753

Taking an initial look at our data we can see that both the female population and birth rates per thousand women are normally distributed. The female population has a bit of tail which is not unexpected given some areas have a much higher population, and this is addressed by tracking proportional education levels to prevent larger populations skewing the analysis.

pop_dist_plot <- labeled_education_data |>
  filter(label == "total_female") |>
  ggplot(aes(x = estimate)) +
  geom_histogram() +
  labs(title = "Distribution of Female Population in Census PUMAs", x = "Female Population Estimate") 
pop_dist_plot

birth_dist_plot <- births_per_thousand |>
  ggplot(aes(x = births_per_thousand)) +
  geom_histogram() +
  labs(title = "Distribution of Birth Rate in Census PUMAs", x = "Births per Thousand Women")
birth_dist_plot

The distribution of educational achievement is much more problematic for our analysis. There are a significant number of outliers in each level, in particular in the populations that have attained a graduate degree or higher and those who have attained less than a 9th grade education. Analysis will be done on the original distributions as well as transformed distributions to determine if the outliers significantly degrade the predictive power of education levels. For those intereseted the large outliers in graduate level education are the Washington, D.C. and Silicon Valley areas, and the large outlier in some college is the Provo, Utah area.

female_pop_totals <- summarised_ed |>
  filter(relabel == "total_female") |>
  pivot_wider(names_from = relabel, values_from = pop_estimate)

percent_education <- summarised_ed |>
  left_join(female_pop_totals, by = "GEOID") |>
  filter(!str_detect(relabel, "total_female")) |>
  mutate(pop_percent = pop_estimate / total_female * 100) |>
  select(GEOID, relabel, pop_percent)

education_order <- c("graduate_or_professional_degree", "bachelor_s_degree", "associate_s_degree", "some_college_no_degree", "high_school_graduate_includes_equivalency",  "9th_to_12th_grade_no_diploma", "less_than_9th_grade")

massaged_ed <- percent_education |>
  mutate(ed_factor = factor(relabel, levels = education_order)) |>
  mutate(renamed = fct_recode(ed_factor, "< 9th Grade" = "less_than_9th_grade", "9th-12th" = "9th_to_12th_grade_no_diploma", "HS diploma or equivalent" = "high_school_graduate_includes_equivalency", "Some College" = "some_college_no_degree", "Associate's Degree" = "associate_s_degree", "Bachelor's Degree" = "bachelor_s_degree", "Graduate Degree or Higher" = "graduate_or_professional_degree")) |>
  select(GEOID, renamed, pop_percent)

ed_dist_plot <- massaged_ed |>
  ggplot(aes(x = renamed, y = pop_percent)) + 
  geom_boxplot() +
  labs(title = "Education Distribution of the Female Population of PUMAs", x = "Highest Level of Education Attained",
       y = "Percent of Female Population") +
  coord_flip()
ed_dist_plot

Given the independent variables used in our analysis are all proportions of the same population and proportions are inherently zero sum we would expect a high level of correlation between variables. High levels of correlation between predictors can be problematic for linear models, and we will therefore have to address this issue. The most correlated predictors are the populations that have attained advanced degrees, those with a bachelor’s degree, and those with no diploma or equivalent. The correlation between predictors will be addressed later on in the analysis.

ggpairs(prop_fertility_and_education |> select(!c("GEOID", "NAME.x", "births_per_thousand")))

Data Analysis

We will begin by performing multiple linear regression on the untransformed data to provide a baseline for analysis and demonstrate why the independant variables need to be transformed. We can see the model completely disregards the population with advanced degrees as it is perfectly correlated with the other predictors. Performing the analysis required to answer our research question would be impossible using this model, so the deficiencies in our data must be addressed.

raw_model <- lm(births_per_thousand ~ prop_less_ninth + prop_no_diploma + prop_diploma_or_equivalent + prop_some_college + prop_associate_degree + prop_bachelor_degree + prop_advanced_degree, data = prop_fertility_and_education)

summary(raw_model)
## 
## Call:
## lm(formula = births_per_thousand ~ prop_less_ninth + prop_no_diploma + 
##     prop_diploma_or_equivalent + prop_some_college + prop_associate_degree + 
##     prop_bachelor_degree + prop_advanced_degree, data = prop_fertility_and_education)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.503  -7.020  -0.329   6.675 121.461 
## 
## Coefficients: (1 not defined because of singularities)
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  12.363      6.732   1.836  0.06640 .  
## prop_less_ninth              -1.546     10.019  -0.154  0.87741    
## prop_no_diploma             121.175     17.647   6.867 8.27e-12 ***
## prop_diploma_or_equivalent   47.166      8.131   5.801 7.45e-09 ***
## prop_some_college            33.458      7.869   4.252 2.20e-05 ***
## prop_associate_degree        62.862     11.934   5.267 1.50e-07 ***
## prop_bachelor_degree         33.262     12.274   2.710  0.00678 ** 
## prop_advanced_degree             NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.22 on 2479 degrees of freedom
## Multiple R-squared:  0.1606, Adjusted R-squared:  0.1586 
## F-statistic: 79.08 on 6 and 2479 DF,  p-value: < 2.2e-16

Before proceeding, it is necessary to look at the residuals for the model to determine if we can have confidence in any of the metrics produced. We can see that the residuals are normally distributed and we can therefore have confidence in the accuracy of any conclusions reached. The residuals for all subsequent models are included but will not be explicitly discussed as all follow the same distribution.

ggplot(data = raw_model, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  xlab("Fitted values") +
  ylab("Residuals")

ggplot(data = raw_model, aes(x = .resid)) +
  geom_histogram() +
  xlab("Residuals")

ggplot(data = raw_model, aes(sample = .resid)) +
  stat_qq()

We will transform our dependent variables using a Box-Cox transformation in order to provide the best predictive power possible to the model. As we can see there are significantly fewer outliers in the transformed data and a much more normal distribution in each variable. The variables are split into two plots due to differences in scaling and for no other reason.

trans_df <- prop_fertility_and_education |>
  mutate(
    across(
      .cols = where(is.numeric),
      .fns = ~ predict(BoxCoxTrans(.x), .x),
      .names = "{.col}_bc"
    ),
    .keep = "unused"
  )

education_order_trans <- c("prop_advanced_degree_bc", "prop_bachelor_degree_bc", "prop_associate_degree_bc", "prop_some_college_bc", "prop_diploma_or_equivalent_bc",  "prop_no_diploma_bc", "prop_less_ninth_bc")

trans_vis <- trans_df |>
  select(!c(NAME.x, births_per_thousand_bc)) |>
  pivot_longer(cols = c(prop_less_ninth_bc:prop_advanced_degree_bc), names_to = "ed_level", values_to = "adjusted_proportion") |>
  mutate(ed_factor = factor(ed_level, levels = education_order_trans)) |>
  mutate(renamed = fct_recode(ed_factor, "< 9th Grade" = "prop_less_ninth_bc", "9th-12th" = "prop_no_diploma_bc", "HS diploma or equivalent" = "prop_diploma_or_equivalent_bc", "Some College" = "prop_some_college_bc", "Associate's Degree" = "prop_associate_degree_bc", "Bachelor's Degree" = "prop_bachelor_degree_bc", "Graduate Degree or Higher" = "prop_advanced_degree_bc")) 

trans_vis |>
  filter(ed_level %in% c("prop_less_ninth_bc", "prop_no_diploma_bc", "prop_bachelor_degree_bc", "prop_advanced_degree_bc")) |>
  ggplot(aes(x = renamed, y = adjusted_proportion)) + 
  geom_boxplot() +
  labs(title = "Adjusted Education Distribution of the Female Population of PUMAs", x = "Highest Level of Education Attained",
       y = "Adjusted Measure of Female Population") +
  coord_flip()

trans_vis |>
  filter(ed_level %in% c("prop_diploma_or_equivalent_bc", "prop_some_college_bc", "prop_associate_degree_bc")) |>
  ggplot(aes(x = renamed, y = adjusted_proportion)) + 
  geom_boxplot() +
  labs(title = "Adjusted Education Distribution of the Female Population of PUMAs", x = "Highest Level of Education Attained",
       y = "Adjusted Measure of Female Population") +
  coord_flip()

Transforming the data actually leads to a larger degree of correlation between the predictors, which will still need to be addressed.

ggpairs(trans_df |> select(!c("GEOID", "NAME.x", "births_per_thousand_bc")))

Before doing so, we will perform another regression on the transformed data to get a true baseline for later comparison. Doing so reveals three things of note. Firstly, there are only two predictors with negative coefficients, the population with less than a 9th grade education and the population with an advanced degree with the latter predictor being the most significant predictor by far. Secondly, half of the predictors indicating some level of higher education (some college and bachelor’s degree holders) have no statistical significance at all. Third, the adjusted \(R^2\) indicates the model has an extremely low level of predictive power. This model provides an early indication that the proportional education level of the female population is likely not a strong predictor of birth rates and the pursuit of higher education does not initially appear to have much if any predictive power. As noted before, there are issues with predictor correlation that must be addressed before drawing any conclusions. Additionally, our research question asked whether the pursuit of higher education was a significant predictor and it could be argued our current division of the education levels is too granular to answer this question.

proportional_model <- lm(births_per_thousand_bc ~ prop_less_ninth_bc + prop_no_diploma_bc + prop_diploma_or_equivalent_bc + prop_some_college_bc + prop_associate_degree_bc + prop_bachelor_degree_bc + prop_advanced_degree_bc, data = trans_df)

summary(proportional_model)
## 
## Call:
## lm(formula = births_per_thousand_bc ~ prop_less_ninth_bc + prop_no_diploma_bc + 
##     prop_diploma_or_equivalent_bc + prop_some_college_bc + prop_associate_degree_bc + 
##     prop_bachelor_degree_bc + prop_advanced_degree_bc, data = trans_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.4356  -2.0491  -0.0147   2.1081  30.5056 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    29.0101     5.4580   5.315 1.16e-07 ***
## prop_less_ninth_bc             -0.3700     0.1726  -2.143 0.032206 *  
## prop_no_diploma_bc              3.1943     0.8332   3.834 0.000129 ***
## prop_diploma_or_equivalent_bc   7.4106     2.8733   2.579 0.009962 ** 
## prop_some_college_bc            2.5667     3.5184   0.730 0.465756    
## prop_associate_degree_bc       12.4800     3.4467   3.621 0.000300 ***
## prop_bachelor_degree_bc         0.4401     0.7920   0.556 0.578484    
## prop_advanced_degree_bc        -1.3661     0.3340  -4.091 4.44e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.425 on 2478 degrees of freedom
## Multiple R-squared:  0.1658, Adjusted R-squared:  0.1634 
## F-statistic: 70.35 on 7 and 2478 DF,  p-value: < 2.2e-16

Model residuals, for posterity.

ggplot(data = proportional_model, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  xlab("Fitted values") +
  ylab("Residuals")

ggplot(data = proportional_model, aes(x = .resid)) +
  geom_histogram() +
  xlab("Residuals")

ggplot(data = proportional_model, aes(sample = .resid)) +
  stat_qq()

In order to address the issues with correlation as well as the potentially too granular breakdown of the population we will aggregate the population into “milestone” groups focused on higher education. The first group will be the proportion of the population with no college, the next the proportion who have some college up to attaining a bachelor’s degree, and the last those who have attained an advanced degree. We can then look at the correlation between these aggregated populations and select our predictors.

prop_milestone <- prop_fertility_and_education |>
  mutate(prop_any_college = prop_some_college + prop_associate_degree + prop_bachelor_degree,
         prop_no_college = prop_less_ninth + prop_no_diploma + prop_diploma_or_equivalent) |>
  select(GEOID, NAME.x, prop_any_college, prop_no_college, prop_advanced_degree, births_per_thousand) |>
  mutate(
    across(
      .cols = where(is.numeric),
      .fns = ~ predict(BoxCoxTrans(.x), .x),
      .names = "{.col}_bc"
    ),
    .keep = "unused"
  )
milestone_model <- lm(births_per_thousand_bc ~ prop_no_college_bc + prop_any_college_bc + prop_advanced_degree_bc, data = prop_milestone)

# summary(milestone_model)

We can see that the two populations of interest, the “any college” and “advanced degree” populations, are also the least correlated populations. This means we can exclude the “no college” predictor from our linear regression as its value is implied by the other predictors.

ggpairs(prop_milestone |> select(!c("GEOID", "NAME.x", "births_per_thousand_bc")))

Performing linear regression on our predictors of interest generates perhaps the most interesting results in this analysis. We can see that while both populations have negative coefficients, the proportion of the population with any college education is nowhere near a statistically significant predictor. This, combined with the consistently low adjusted \(R^2\) values of the regression models provides strong evidence that simply pursuing higher education is not a good predictor of birth rates. It should be noted the proportion of the population with an advanced degree is still a statistically significant negative predictor of birth rates, and has been so in every model we have looked at.

no_cor_degree_model <- lm(births_per_thousand_bc ~ prop_any_college_bc + prop_advanced_degree_bc, data = prop_milestone)

summary(no_cor_degree_model)
## 
## Call:
## lm(formula = births_per_thousand_bc ~ prop_any_college_bc + prop_advanced_degree_bc, 
##     data = prop_milestone)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.7468  -2.0979   0.0398   2.1429  30.5016 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              14.7179     0.7492  19.645   <2e-16 ***
## prop_any_college_bc      -0.7614     2.2965  -0.332     0.74    
## prop_advanced_degree_bc  -2.7078     0.1551 -17.464   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.464 on 2483 degrees of freedom
## Multiple R-squared:  0.1448, Adjusted R-squared:  0.1441 
## F-statistic: 210.3 on 2 and 2483 DF,  p-value: < 2.2e-16

Once again, the residuals.

ggplot(data = no_cor_degree_model, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  xlab("Fitted values") +
  ylab("Residuals")

ggplot(data = no_cor_degree_model, aes(x = .resid)) +
  geom_histogram() +
  xlab("Residuals")

ggplot(data = no_cor_degree_model, aes(sample = .resid)) +
  stat_qq()

Performing our analysis at the broadest possible level, we can divide our population into two categories: those who went to any college, and those who did not. Doing so results in the proportion of the population seeking higher education becoming a statistically significant negative predictor of birth rates, however; our previous models indicate that this is largely due to the predictive power of the population holding an advanced degree. The further decrease in the already extremely low adjusted \(R^2\) of the model appears to support this conclusion.

prop_college <- prop_fertility_and_education |>
  mutate(prop_any_college = prop_associate_degree + prop_bachelor_degree + prop_advanced_degree + prop_some_college) |>
  select(GEOID, NAME.x, prop_any_college, births_per_thousand) |>
  mutate(
    across(
      .cols = where(is.numeric),
      .fns = ~ predict(BoxCoxTrans(.x), .x),
      .names = "{.col}_bc"
    ),
    .keep = "unused"
  )

simple_model <- lm(births_per_thousand_bc ~ prop_any_college_bc, data = prop_college)

summary(simple_model)
## 
## Call:
## lm(formula = births_per_thousand_bc ~ prop_any_college_bc, data = prop_college)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5390  -2.1762   0.0147   2.2732  28.8972 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          28.6156     0.4136   69.18   <2e-16 ***
## prop_any_college_bc -11.8570     0.6382  -18.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.51 on 2484 degrees of freedom
## Multiple R-squared:  0.122,  Adjusted R-squared:  0.1216 
## F-statistic: 345.1 on 1 and 2484 DF,  p-value: < 2.2e-16

For the last time, the residuals.

ggplot(data = simple_model, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  xlab("Fitted values") +
  ylab("Residuals")

ggplot(data = simple_model, aes(x = .resid)) +
  geom_histogram() +
  xlab("Residuals")

ggplot(data = simple_model, aes(sample = .resid)) +
  stat_qq()

Conclusion

While it is possible to coerce the population of women who seek higher education into being a statistically significant negative predictor of birth rates, it is a predictor that does not contain any meaningful level of predictive power. It is clear from our analysis that the proportion of the female population seeking or attaining an undergraduate education has no predictive power whatsoever as no model demonstrated statistical significance for the relevant populations until the population of advanced degree holders was included. We can conclude that the only remotely relevant predictor of birth rates among the population pursuing higher education is the proportion who attained an advanced degree. This was always the most statistically significant predictor, and appears to have the majority of the predictive power present in any of the observed models. Ultimately the breakdown of predictive power and statistical significance is not particularly relevant, as the proportion of women who pursue higher education for a given population has virtually no relevance when predicting birth rates.