Setup

install.packages("tidyverse")
install.packages("here")
install.packages("scales")
library(tidyverse)
library(here)
library(scales)
theme_acs <- theme_minimal()+
  theme(
    plot.title = element_text(face = "bold"),
    plot.subtitle = element_text(color = "gray40"),
    plot.caption = element_text(color = "gray50"),
    panel.grid = element_blank()
  )

theme_set(theme_acs)
# Intentionally created using the Okabe-Ito Palette for color blindness accessibility
region_colors <- c(
  "Northeast" = "#0072B2",
  "Midwest" ="#E69F00",
  "South" = "#009E73",
  "West" = "#cc79A7",
  "Territory" = "#999999"
)

tier_colors <- c(
  "High Growth" = "#009e73",
  "Moderate Growth" = "#E69F00",
  "Stable" = "#0072B2",
  "Decline" = "#D55E00"
)

aging_colors <- c(
  "Aging" = "#D55E00",
  "Transitional" = "#E69F00",
  "Young" = "#0072B2"
)

diversity_colors <- c(
  "High" = "#009E73",
  "Moderate" = "#E69F00",
  "Low" = "#0072B2"
)

Data Acquisition

The five datasets used in this analysis come form the U.S. Census Bureau’s American Community Survey (ACS) 1-Year Demographic Profile table DP05, covering county-level estimates for 2019, 2021, 2022, 2023, and 2024. The year 2020 is absent because the Census Bureau suspended data collection during the COVID-19 pandemic.

Each file is read with col_types = cols(.default = "c") to force all columns to character type on load. This is intentional because the ACS file contain suppresion codes like "N", "(x)", and "-" in numeric columns that would be silently lost if R attempted type inference at read time. Those values are handled explictily in the cleaning phase.

df_2019 <- read_csv(
  here("data", "raw", "2019", "ACSDP1Y2019.DP05-Data.CSV"),
  col_types = cols(.default = "c"),
  show_col_types = FALSE
)

df_2021 <- read_csv(
  here("data", "raw", "2021", "ACSDP1Y2021.DP05-Data.CSV"),
  col_types = cols(.default = "c"),
  show_col_types = FALSE
)

df_2022 <- read_csv(
  here("data", "raw", "2022", "ACSDP1Y2022.DP05-Data.CSV"),
  col_types = cols(.default = "c"),
  show_col_types = FALSE
)

df_2023 <- read_csv(
  here("data", "raw", "2023", "ACSDP1Y2023.DP05-Data.CSV"),
  col_types = cols(.default = "c"),
  show_col_types = FALSE
)

df_2024 <- read_csv(
  here("data", "raw", "2024", "ACSDP1Y2024.DP05-Data.CSV"),
  col_types = cols(.default = "c"),
  show_col_types = FALSE
)

Initial Exploration

Before making any changes to the raw files, we take a first look at the structure and contents of each dataset. This step helps us understand the shape of the data, confirm that all files loaded correctly, and identify anything unexpected before we start cleaning.

Dimensions

tibble(
  year = c(2019, 2021, 2022, 2023, 2024),
  rows = c(
    nrow(df_2019),
    nrow(df_2021),
    nrow(df_2022),
    nrow(df_2023),
    nrow(df_2024)
  ),
  cols = c(
    ncol(df_2019),
    ncol(df_2021),
    ncol(df_2022),
    ncol(df_2023),
    ncol(df_2024)
  )
)

The row count increases each year from 841 to 862, reflecting ACS 1-Year survey’s population threshold requirement; only counties with 65,000 or more residents are included, and more counties cross that threshold over time. The column count grows more sharply, from 359 in 2019 to 435 in 2024, due to structural changes the Census Bureau made ot its race and ethnicity question following the 2020 Census. This means we cannot simply select the same column codes across all five years and will need to map each variable to its correct code per year.

Structure

glimpse(df_2019)
## Rows: 841
## Columns: 359
## $ GEO_ID      <chr> "Geography", "0500000US01003", "0500000US01015", "0500000U…
## $ NAME        <chr> "Geographic Area Name", "Baldwin County, Alabama", "Calhou…
## $ DP05_0001E  <chr> "Estimate!!SEX AND AGE!!Total population", "223234", "1136…
## $ DP05_0001M  <chr> "Margin of Error!!SEX AND AGE!!Total population", "*****",…
## $ DP05_0002E  <chr> "Estimate!!SEX AND AGE!!Total population!!Male", "109192",…
## $ DP05_0002M  <chr> "Margin of Error!!SEX AND AGE!!Total population!!Male", "1…
## $ DP05_0003E  <chr> "Estimate!!SEX AND AGE!!Total population!!Female", "114042…
## $ DP05_0003M  <chr> "Margin of Error!!SEX AND AGE!!Total population!!Female", …
## $ DP05_0004E  <chr> "Estimate!!SEX AND AGE!!Total population!!Sex ratio (males…
## $ DP05_0004M  <chr> "Margin of Error!!SEX AND AGE!!Total population!!Sex ratio…
## $ DP05_0005E  <chr> "Estimate!!SEX AND AGE!!Total population!!Under 5 years", …
## $ DP05_0005M  <chr> "Margin of Error!!SEX AND AGE!!Total population!!Under 5 y…
## $ DP05_0006E  <chr> "Estimate!!SEX AND AGE!!Total population!!5 to 9 years", "…
## $ DP05_0006M  <chr> "Margin of Error!!SEX AND AGE!!Total population!!5 to 9 ye…
## $ DP05_0007E  <chr> "Estimate!!SEX AND AGE!!Total population!!10 to 14 years",…
## $ DP05_0007M  <chr> "Margin of Error!!SEX AND AGE!!Total population!!10 to 14 …
## $ DP05_0008E  <chr> "Estimate!!SEX AND AGE!!Total population!!15 to 19 years",…
## $ DP05_0008M  <chr> "Margin of Error!!SEX AND AGE!!Total population!!15 to 19 …
## $ DP05_0009E  <chr> "Estimate!!SEX AND AGE!!Total population!!20 to 24 years",…
## $ DP05_0009M  <chr> "Margin of Error!!SEX AND AGE!!Total population!!20 to 24 …
## $ DP05_0010E  <chr> "Estimate!!SEX AND AGE!!Total population!!25 to 34 years",…
## $ DP05_0010M  <chr> "Margin of Error!!SEX AND AGE!!Total population!!25 to 34 …
## $ DP05_0011E  <chr> "Estimate!!SEX AND AGE!!Total population!!35 to 44 years",…
## $ DP05_0011M  <chr> "Margin of Error!!SEX AND AGE!!Total population!!35 to 44 …
## $ DP05_0012E  <chr> "Estimate!!SEX AND AGE!!Total population!!45 to 54 years",…
## $ DP05_0012M  <chr> "Margin of Error!!SEX AND AGE!!Total population!!45 to 54 …
## $ DP05_0013E  <chr> "Estimate!!SEX AND AGE!!Total population!!55 to 59 years",…
## $ DP05_0013M  <chr> "Margin of Error!!SEX AND AGE!!Total population!!55 to 59 …
## $ DP05_0014E  <chr> "Estimate!!SEX AND AGE!!Total population!!60 to 64 years",…
## $ DP05_0014M  <chr> "Margin of Error!!SEX AND AGE!!Total population!!60 to 64 …
## $ DP05_0015E  <chr> "Estimate!!SEX AND AGE!!Total population!!65 to 74 years",…
## $ DP05_0015M  <chr> "Margin of Error!!SEX AND AGE!!Total population!!65 to 74 …
## $ DP05_0016E  <chr> "Estimate!!SEX AND AGE!!Total population!!75 to 84 years",…
## $ DP05_0016M  <chr> "Margin of Error!!SEX AND AGE!!Total population!!75 to 84 …
## $ DP05_0017E  <chr> "Estimate!!SEX AND AGE!!Total population!!85 years and ove…
## $ DP05_0017M  <chr> "Margin of Error!!SEX AND AGE!!Total population!!85 years …
## $ DP05_0018E  <chr> "Estimate!!SEX AND AGE!!Total population!!Median age (year…
## $ DP05_0018M  <chr> "Margin of Error!!SEX AND AGE!!Total population!!Median ag…
## $ DP05_0019E  <chr> "Estimate!!SEX AND AGE!!Total population!!Under 18 years",…
## $ DP05_0019M  <chr> "Margin of Error!!SEX AND AGE!!Total population!!Under 18 …
## $ DP05_0020E  <chr> "Estimate!!SEX AND AGE!!Total population!!16 years and ove…
## $ DP05_0020M  <chr> "Margin of Error!!SEX AND AGE!!Total population!!16 years …
## $ DP05_0021E  <chr> "Estimate!!SEX AND AGE!!Total population!!18 years and ove…
## $ DP05_0021M  <chr> "Margin of Error!!SEX AND AGE!!Total population!!18 years …
## $ DP05_0022E  <chr> "Estimate!!SEX AND AGE!!Total population!!21 years and ove…
## $ DP05_0022M  <chr> "Margin of Error!!SEX AND AGE!!Total population!!21 years …
## $ DP05_0023E  <chr> "Estimate!!SEX AND AGE!!Total population!!62 years and ove…
## $ DP05_0023M  <chr> "Margin of Error!!SEX AND AGE!!Total population!!62 years …
## $ DP05_0024E  <chr> "Estimate!!SEX AND AGE!!Total population!!65 years and ove…
## $ DP05_0024M  <chr> "Margin of Error!!SEX AND AGE!!Total population!!65 years …
## $ DP05_0025E  <chr> "Estimate!!SEX AND AGE!!Total population!!18 years and ove…
## $ DP05_0025M  <chr> "Margin of Error!!SEX AND AGE!!Total population!!18 years …
## $ DP05_0026E  <chr> "Estimate!!SEX AND AGE!!Total population!!18 years and ove…
## $ DP05_0026M  <chr> "Margin of Error!!SEX AND AGE!!Total population!!18 years …
## $ DP05_0027E  <chr> "Estimate!!SEX AND AGE!!Total population!!18 years and ove…
## $ DP05_0027M  <chr> "Margin of Error!!SEX AND AGE!!Total population!!18 years …
## $ DP05_0028E  <chr> "Estimate!!SEX AND AGE!!Total population!!18 years and ove…
## $ DP05_0028M  <chr> "Margin of Error!!SEX AND AGE!!Total population!!18 years …
## $ DP05_0029E  <chr> "Estimate!!SEX AND AGE!!Total population!!65 years and ove…
## $ DP05_0029M  <chr> "Margin of Error!!SEX AND AGE!!Total population!!65 years …
## $ DP05_0030E  <chr> "Estimate!!SEX AND AGE!!Total population!!65 years and ove…
## $ DP05_0030M  <chr> "Margin of Error!!SEX AND AGE!!Total population!!65 years …
## $ DP05_0031E  <chr> "Estimate!!SEX AND AGE!!Total population!!65 years and ove…
## $ DP05_0031M  <chr> "Margin of Error!!SEX AND AGE!!Total population!!65 years …
## $ DP05_0032E  <chr> "Estimate!!SEX AND AGE!!Total population!!65 years and ove…
## $ DP05_0032M  <chr> "Margin of Error!!SEX AND AGE!!Total population!!65 years …
## $ DP05_0033E  <chr> "Estimate!!RACE!!Total population", "223234", "113605", "8…
## $ DP05_0033M  <chr> "Margin of Error!!RACE!!Total population", "*****", "*****…
## $ DP05_0034E  <chr> "Estimate!!RACE!!Total population!!One race", "218523", "1…
## $ DP05_0034M  <chr> "Margin of Error!!RACE!!Total population!!One race", "2205…
## $ DP05_0035E  <chr> "Estimate!!RACE!!Total population!!Two or more races", "47…
## $ DP05_0035M  <chr> "Margin of Error!!RACE!!Total population!!Two or more race…
## $ DP05_0036E  <chr> "Estimate!!RACE!!Total population!!One race", "218523", "1…
## $ DP05_0036M  <chr> "Margin of Error!!RACE!!Total population!!One race", "2205…
## $ DP05_0037E  <chr> "Estimate!!RACE!!Total population!!One race!!White", "1909…
## $ DP05_0037M  <chr> "Margin of Error!!RACE!!Total population!!One race!!White"…
## $ DP05_0038E  <chr> "Estimate!!RACE!!Total population!!One race!!Black or Afri…
## $ DP05_0038M  <chr> "Margin of Error!!RACE!!Total population!!One race!!Black …
## $ DP05_0039E  <chr> "Estimate!!RACE!!Total population!!One race!!American Indi…
## $ DP05_0039M  <chr> "Margin of Error!!RACE!!Total population!!One race!!Americ…
## $ DP05_0040E  <chr> "Estimate!!RACE!!Total population!!One race!!American Indi…
## $ DP05_0040M  <chr> "Margin of Error!!RACE!!Total population!!One race!!Americ…
## $ DP05_0041E  <chr> "Estimate!!RACE!!Total population!!One race!!American Indi…
## $ DP05_0041M  <chr> "Margin of Error!!RACE!!Total population!!One race!!Americ…
## $ DP05_0042E  <chr> "Estimate!!RACE!!Total population!!One race!!American Indi…
## $ DP05_0042M  <chr> "Margin of Error!!RACE!!Total population!!One race!!Americ…
## $ DP05_0043E  <chr> "Estimate!!RACE!!Total population!!One race!!American Indi…
## $ DP05_0043M  <chr> "Margin of Error!!RACE!!Total population!!One race!!Americ…
## $ DP05_0044E  <chr> "Estimate!!RACE!!Total population!!One race!!Asian", "2160…
## $ DP05_0044M  <chr> "Margin of Error!!RACE!!Total population!!One race!!Asian"…
## $ DP05_0045E  <chr> "Estimate!!RACE!!Total population!!One race!!Asian!!Asian …
## $ DP05_0045M  <chr> "Margin of Error!!RACE!!Total population!!One race!!Asian!…
## $ DP05_0046E  <chr> "Estimate!!RACE!!Total population!!One race!!Asian!!Chines…
## $ DP05_0046M  <chr> "Margin of Error!!RACE!!Total population!!One race!!Asian!…
## $ DP05_0047E  <chr> "Estimate!!RACE!!Total population!!One race!!Asian!!Filipi…
## $ DP05_0047M  <chr> "Margin of Error!!RACE!!Total population!!One race!!Asian!…
## $ DP05_0048E  <chr> "Estimate!!RACE!!Total population!!One race!!Asian!!Japane…
## $ DP05_0048M  <chr> "Margin of Error!!RACE!!Total population!!One race!!Asian!…
## $ DP05_0049E  <chr> "Estimate!!RACE!!Total population!!One race!!Asian!!Korean…
## $ DP05_0049M  <chr> "Margin of Error!!RACE!!Total population!!One race!!Asian!…
## $ DP05_0050E  <chr> "Estimate!!RACE!!Total population!!One race!!Asian!!Vietna…
## $ DP05_0050M  <chr> "Margin of Error!!RACE!!Total population!!One race!!Asian!…
## $ DP05_0051E  <chr> "Estimate!!RACE!!Total population!!One race!!Asian!!Other …
## $ DP05_0051M  <chr> "Margin of Error!!RACE!!Total population!!One race!!Asian!…
## $ DP05_0052E  <chr> "Estimate!!RACE!!Total population!!One race!!Native Hawaii…
## $ DP05_0052M  <chr> "Margin of Error!!RACE!!Total population!!One race!!Native…
## $ DP05_0053E  <chr> "Estimate!!RACE!!Total population!!One race!!Native Hawaii…
## $ DP05_0053M  <chr> "Margin of Error!!RACE!!Total population!!One race!!Native…
## $ DP05_0054E  <chr> "Estimate!!RACE!!Total population!!One race!!Native Hawaii…
## $ DP05_0054M  <chr> "Margin of Error!!RACE!!Total population!!One race!!Native…
## $ DP05_0055E  <chr> "Estimate!!RACE!!Total population!!One race!!Native Hawaii…
## $ DP05_0055M  <chr> "Margin of Error!!RACE!!Total population!!One race!!Native…
## $ DP05_0056E  <chr> "Estimate!!RACE!!Total population!!One race!!Native Hawaii…
## $ DP05_0056M  <chr> "Margin of Error!!RACE!!Total population!!One race!!Native…
## $ DP05_0057E  <chr> "Estimate!!RACE!!Total population!!One race!!Some other ra…
## $ DP05_0057M  <chr> "Margin of Error!!RACE!!Total population!!One race!!Some o…
## $ DP05_0058E  <chr> "Estimate!!RACE!!Total population!!Two or more races", "47…
## $ DP05_0058M  <chr> "Margin of Error!!RACE!!Total population!!Two or more race…
## $ DP05_0059E  <chr> "Estimate!!RACE!!Total population!!Two or more races!!Whit…
## $ DP05_0059M  <chr> "Margin of Error!!RACE!!Total population!!Two or more race…
## $ DP05_0060E  <chr> "Estimate!!RACE!!Total population!!Two or more races!!Whit…
## $ DP05_0060M  <chr> "Margin of Error!!RACE!!Total population!!Two or more race…
## $ DP05_0061E  <chr> "Estimate!!RACE!!Total population!!Two or more races!!Whit…
## $ DP05_0061M  <chr> "Margin of Error!!RACE!!Total population!!Two or more race…
## $ DP05_0062E  <chr> "Estimate!!RACE!!Total population!!Two or more races!!Blac…
## $ DP05_0062M  <chr> "Margin of Error!!RACE!!Total population!!Two or more race…
## $ DP05_0063E  <chr> "Estimate!!Race alone or in combination with one or more o…
## $ DP05_0063M  <chr> "Margin of Error!!Race alone or in combination with one or…
## $ DP05_0064E  <chr> "Estimate!!Race alone or in combination with one or more o…
## $ DP05_0064M  <chr> "Margin of Error!!Race alone or in combination with one or…
## $ DP05_0065E  <chr> "Estimate!!Race alone or in combination with one or more o…
## $ DP05_0065M  <chr> "Margin of Error!!Race alone or in combination with one or…
## $ DP05_0066E  <chr> "Estimate!!Race alone or in combination with one or more o…
## $ DP05_0066M  <chr> "Margin of Error!!Race alone or in combination with one or…
## $ DP05_0067E  <chr> "Estimate!!Race alone or in combination with one or more o…
## $ DP05_0067M  <chr> "Margin of Error!!Race alone or in combination with one or…
## $ DP05_0068E  <chr> "Estimate!!Race alone or in combination with one or more o…
## $ DP05_0068M  <chr> "Margin of Error!!Race alone or in combination with one or…
## $ DP05_0069E  <chr> "Estimate!!Race alone or in combination with one or more o…
## $ DP05_0069M  <chr> "Margin of Error!!Race alone or in combination with one or…
## $ DP05_0070E  <chr> "Estimate!!HISPANIC OR LATINO AND RACE!!Total population",…
## $ DP05_0070M  <chr> "Margin of Error!!HISPANIC OR LATINO AND RACE!!Total popul…
## $ DP05_0071E  <chr> "Estimate!!HISPANIC OR LATINO AND RACE!!Total population!!…
## $ DP05_0071M  <chr> "Margin of Error!!HISPANIC OR LATINO AND RACE!!Total popul…
## $ DP05_0072E  <chr> "Estimate!!HISPANIC OR LATINO AND RACE!!Total population!!…
## $ DP05_0072M  <chr> "Margin of Error!!HISPANIC OR LATINO AND RACE!!Total popul…
## $ DP05_0073E  <chr> "Estimate!!HISPANIC OR LATINO AND RACE!!Total population!!…
## $ DP05_0073M  <chr> "Margin of Error!!HISPANIC OR LATINO AND RACE!!Total popul…
## $ DP05_0074E  <chr> "Estimate!!HISPANIC OR LATINO AND RACE!!Total population!!…
## $ DP05_0074M  <chr> "Margin of Error!!HISPANIC OR LATINO AND RACE!!Total popul…
## $ DP05_0075E  <chr> "Estimate!!HISPANIC OR LATINO AND RACE!!Total population!!…
## $ DP05_0075M  <chr> "Margin of Error!!HISPANIC OR LATINO AND RACE!!Total popul…
## $ DP05_0076E  <chr> "Estimate!!HISPANIC OR LATINO AND RACE!!Total population!!…
## $ DP05_0076M  <chr> "Margin of Error!!HISPANIC OR LATINO AND RACE!!Total popul…
## $ DP05_0077E  <chr> "Estimate!!HISPANIC OR LATINO AND RACE!!Total population!!…
## $ DP05_0077M  <chr> "Margin of Error!!HISPANIC OR LATINO AND RACE!!Total popul…
## $ DP05_0078E  <chr> "Estimate!!HISPANIC OR LATINO AND RACE!!Total population!!…
## $ DP05_0078M  <chr> "Margin of Error!!HISPANIC OR LATINO AND RACE!!Total popul…
## $ DP05_0079E  <chr> "Estimate!!HISPANIC OR LATINO AND RACE!!Total population!!…
## $ DP05_0079M  <chr> "Margin of Error!!HISPANIC OR LATINO AND RACE!!Total popul…
## $ DP05_0080E  <chr> "Estimate!!HISPANIC OR LATINO AND RACE!!Total population!!…
## $ DP05_0080M  <chr> "Margin of Error!!HISPANIC OR LATINO AND RACE!!Total popul…
## $ DP05_0081E  <chr> "Estimate!!HISPANIC OR LATINO AND RACE!!Total population!!…
## $ DP05_0081M  <chr> "Margin of Error!!HISPANIC OR LATINO AND RACE!!Total popul…
## $ DP05_0082E  <chr> "Estimate!!HISPANIC OR LATINO AND RACE!!Total population!!…
## $ DP05_0082M  <chr> "Margin of Error!!HISPANIC OR LATINO AND RACE!!Total popul…
## $ DP05_0083E  <chr> "Estimate!!HISPANIC OR LATINO AND RACE!!Total population!!…
## $ DP05_0083M  <chr> "Margin of Error!!HISPANIC OR LATINO AND RACE!!Total popul…
## $ DP05_0084E  <chr> "Estimate!!HISPANIC OR LATINO AND RACE!!Total population!!…
## $ DP05_0084M  <chr> "Margin of Error!!HISPANIC OR LATINO AND RACE!!Total popul…
## $ DP05_0085E  <chr> "Estimate!!HISPANIC OR LATINO AND RACE!!Total population!!…
## $ DP05_0085M  <chr> "Margin of Error!!HISPANIC OR LATINO AND RACE!!Total popul…
## $ DP05_0086E  <chr> "Estimate!!Total housing units", "119425", "53809", "38256…
## $ DP05_0086M  <chr> "Margin of Error!!Total housing units", "483", "404", "332…
## $ DP05_0087E  <chr> "Estimate!!CITIZEN, VOTING AGE POPULATION!!Citizen, 18 and…
## $ DP05_0087M  <chr> "Margin of Error!!CITIZEN, VOTING AGE POPULATION!!Citizen,…
## $ DP05_0088E  <chr> "Estimate!!CITIZEN, VOTING AGE POPULATION!!Citizen, 18 and…
## $ DP05_0088M  <chr> "Margin of Error!!CITIZEN, VOTING AGE POPULATION!!Citizen,…
## $ DP05_0089E  <chr> "Estimate!!CITIZEN, VOTING AGE POPULATION!!Citizen, 18 and…
## $ DP05_0089M  <chr> "Margin of Error!!CITIZEN, VOTING AGE POPULATION!!Citizen,…
## $ DP05_0001PE <chr> "Percent!!SEX AND AGE!!Total population", "223234", "11360…
## $ DP05_0001PM <chr> "Percent Margin of Error!!SEX AND AGE!!Total population", …
## $ DP05_0002PE <chr> "Percent!!SEX AND AGE!!Total population!!Male", "48.9", "4…
## $ DP05_0002PM <chr> "Percent Margin of Error!!SEX AND AGE!!Total population!!M…
## $ DP05_0003PE <chr> "Percent!!SEX AND AGE!!Total population!!Female", "51.1", …
## $ DP05_0003PM <chr> "Percent Margin of Error!!SEX AND AGE!!Total population!!F…
## $ DP05_0004PE <chr> "Percent!!SEX AND AGE!!Total population!!Sex ratio (males …
## $ DP05_0004PM <chr> "Percent Margin of Error!!SEX AND AGE!!Total population!!S…
## $ DP05_0005PE <chr> "Percent!!SEX AND AGE!!Total population!!Under 5 years", "…
## $ DP05_0005PM <chr> "Percent Margin of Error!!SEX AND AGE!!Total population!!U…
## $ DP05_0006PE <chr> "Percent!!SEX AND AGE!!Total population!!5 to 9 years", "5…
## $ DP05_0006PM <chr> "Percent Margin of Error!!SEX AND AGE!!Total population!!5…
## $ DP05_0007PE <chr> "Percent!!SEX AND AGE!!Total population!!10 to 14 years", …
## $ DP05_0007PM <chr> "Percent Margin of Error!!SEX AND AGE!!Total population!!1…
## $ DP05_0008PE <chr> "Percent!!SEX AND AGE!!Total population!!15 to 19 years", …
## $ DP05_0008PM <chr> "Percent Margin of Error!!SEX AND AGE!!Total population!!1…
## $ DP05_0009PE <chr> "Percent!!SEX AND AGE!!Total population!!20 to 24 years", …
## $ DP05_0009PM <chr> "Percent Margin of Error!!SEX AND AGE!!Total population!!2…
## $ DP05_0010PE <chr> "Percent!!SEX AND AGE!!Total population!!25 to 34 years", …
## $ DP05_0010PM <chr> "Percent Margin of Error!!SEX AND AGE!!Total population!!2…
## $ DP05_0011PE <chr> "Percent!!SEX AND AGE!!Total population!!35 to 44 years", …
## $ DP05_0011PM <chr> "Percent Margin of Error!!SEX AND AGE!!Total population!!3…
## $ DP05_0012PE <chr> "Percent!!SEX AND AGE!!Total population!!45 to 54 years", …
## $ DP05_0012PM <chr> "Percent Margin of Error!!SEX AND AGE!!Total population!!4…
## $ DP05_0013PE <chr> "Percent!!SEX AND AGE!!Total population!!55 to 59 years", …
## $ DP05_0013PM <chr> "Percent Margin of Error!!SEX AND AGE!!Total population!!5…
## $ DP05_0014PE <chr> "Percent!!SEX AND AGE!!Total population!!60 to 64 years", …
## $ DP05_0014PM <chr> "Percent Margin of Error!!SEX AND AGE!!Total population!!6…
## $ DP05_0015PE <chr> "Percent!!SEX AND AGE!!Total population!!65 to 74 years", …
## $ DP05_0015PM <chr> "Percent Margin of Error!!SEX AND AGE!!Total population!!6…
## $ DP05_0016PE <chr> "Percent!!SEX AND AGE!!Total population!!75 to 84 years", …
## $ DP05_0016PM <chr> "Percent Margin of Error!!SEX AND AGE!!Total population!!7…
## $ DP05_0017PE <chr> "Percent!!SEX AND AGE!!Total population!!85 years and over…
## $ DP05_0017PM <chr> "Percent Margin of Error!!SEX AND AGE!!Total population!!8…
## $ DP05_0018PE <chr> "Percent!!SEX AND AGE!!Total population!!Median age (years…
## $ DP05_0018PM <chr> "Percent Margin of Error!!SEX AND AGE!!Total population!!M…
## $ DP05_0019PE <chr> "Percent!!SEX AND AGE!!Total population!!Under 18 years", …
## $ DP05_0019PM <chr> "Percent Margin of Error!!SEX AND AGE!!Total population!!U…
## $ DP05_0020PE <chr> "Percent!!SEX AND AGE!!Total population!!16 years and over…
## $ DP05_0020PM <chr> "Percent Margin of Error!!SEX AND AGE!!Total population!!1…
## $ DP05_0021PE <chr> "Percent!!SEX AND AGE!!Total population!!18 years and over…
## $ DP05_0021PM <chr> "Percent Margin of Error!!SEX AND AGE!!Total population!!1…
## $ DP05_0022PE <chr> "Percent!!SEX AND AGE!!Total population!!21 years and over…
## $ DP05_0022PM <chr> "Percent Margin of Error!!SEX AND AGE!!Total population!!2…
## $ DP05_0023PE <chr> "Percent!!SEX AND AGE!!Total population!!62 years and over…
## $ DP05_0023PM <chr> "Percent Margin of Error!!SEX AND AGE!!Total population!!6…
## $ DP05_0024PE <chr> "Percent!!SEX AND AGE!!Total population!!65 years and over…
## $ DP05_0024PM <chr> "Percent Margin of Error!!SEX AND AGE!!Total population!!6…
## $ DP05_0025PE <chr> "Percent!!SEX AND AGE!!Total population!!18 years and over…
## $ DP05_0025PM <chr> "Percent Margin of Error!!SEX AND AGE!!Total population!!1…
## $ DP05_0026PE <chr> "Percent!!SEX AND AGE!!Total population!!18 years and over…
## $ DP05_0026PM <chr> "Percent Margin of Error!!SEX AND AGE!!Total population!!1…
## $ DP05_0027PE <chr> "Percent!!SEX AND AGE!!Total population!!18 years and over…
## $ DP05_0027PM <chr> "Percent Margin of Error!!SEX AND AGE!!Total population!!1…
## $ DP05_0028PE <chr> "Percent!!SEX AND AGE!!Total population!!18 years and over…
## $ DP05_0028PM <chr> "Percent Margin of Error!!SEX AND AGE!!Total population!!1…
## $ DP05_0029PE <chr> "Percent!!SEX AND AGE!!Total population!!65 years and over…
## $ DP05_0029PM <chr> "Percent Margin of Error!!SEX AND AGE!!Total population!!6…
## $ DP05_0030PE <chr> "Percent!!SEX AND AGE!!Total population!!65 years and over…
## $ DP05_0030PM <chr> "Percent Margin of Error!!SEX AND AGE!!Total population!!6…
## $ DP05_0031PE <chr> "Percent!!SEX AND AGE!!Total population!!65 years and over…
## $ DP05_0031PM <chr> "Percent Margin of Error!!SEX AND AGE!!Total population!!6…
## $ DP05_0032PE <chr> "Percent!!SEX AND AGE!!Total population!!65 years and over…
## $ DP05_0032PM <chr> "Percent Margin of Error!!SEX AND AGE!!Total population!!6…
## $ DP05_0033PE <chr> "Percent!!RACE!!Total population", "223234", "113605", "83…
## $ DP05_0033PM <chr> "Percent Margin of Error!!RACE!!Total population", "(X)", …
## $ DP05_0034PE <chr> "Percent!!RACE!!Total population!!One race", "97.9", "97.9…
## $ DP05_0034PM <chr> "Percent Margin of Error!!RACE!!Total population!!One race…
## $ DP05_0035PE <chr> "Percent!!RACE!!Total population!!Two or more races", "2.1…
## $ DP05_0035PM <chr> "Percent Margin of Error!!RACE!!Total population!!Two or m…
## $ DP05_0036PE <chr> "Percent!!RACE!!Total population!!One race", "97.9", "97.9…
## $ DP05_0036PM <chr> "Percent Margin of Error!!RACE!!Total population!!One race…
## $ DP05_0037PE <chr> "Percent!!RACE!!Total population!!One race!!White", "85.5"…
## $ DP05_0037PM <chr> "Percent Margin of Error!!RACE!!Total population!!One race…
## $ DP05_0038PE <chr> "Percent!!RACE!!Total population!!One race!!Black or Afric…
## $ DP05_0038PM <chr> "Percent Margin of Error!!RACE!!Total population!!One race…
## $ DP05_0039PE <chr> "Percent!!RACE!!Total population!!One race!!American India…
## $ DP05_0039PM <chr> "Percent Margin of Error!!RACE!!Total population!!One race…
## $ DP05_0040PE <chr> "Percent!!RACE!!Total population!!One race!!American India…
## $ DP05_0040PM <chr> "Percent Margin of Error!!RACE!!Total population!!One race…
## $ DP05_0041PE <chr> "Percent!!RACE!!Total population!!One race!!American India…
## $ DP05_0041PM <chr> "Percent Margin of Error!!RACE!!Total population!!One race…
## $ DP05_0042PE <chr> "Percent!!RACE!!Total population!!One race!!American India…
## $ DP05_0042PM <chr> "Percent Margin of Error!!RACE!!Total population!!One race…
## $ DP05_0043PE <chr> "Percent!!RACE!!Total population!!One race!!American India…
## $ DP05_0043PM <chr> "Percent Margin of Error!!RACE!!Total population!!One race…
## $ DP05_0044PE <chr> "Percent!!RACE!!Total population!!One race!!Asian", "1.0",…
## $ DP05_0044PM <chr> "Percent Margin of Error!!RACE!!Total population!!One race…
## $ DP05_0045PE <chr> "Percent!!RACE!!Total population!!One race!!Asian!!Asian I…
## $ DP05_0045PM <chr> "Percent Margin of Error!!RACE!!Total population!!One race…
## $ DP05_0046PE <chr> "Percent!!RACE!!Total population!!One race!!Asian!!Chinese…
## $ DP05_0046PM <chr> "Percent Margin of Error!!RACE!!Total population!!One race…
## $ DP05_0047PE <chr> "Percent!!RACE!!Total population!!One race!!Asian!!Filipin…
## $ DP05_0047PM <chr> "Percent Margin of Error!!RACE!!Total population!!One race…
## $ DP05_0048PE <chr> "Percent!!RACE!!Total population!!One race!!Asian!!Japanes…
## $ DP05_0048PM <chr> "Percent Margin of Error!!RACE!!Total population!!One race…
## $ DP05_0049PE <chr> "Percent!!RACE!!Total population!!One race!!Asian!!Korean"…
## $ DP05_0049PM <chr> "Percent Margin of Error!!RACE!!Total population!!One race…
## $ DP05_0050PE <chr> "Percent!!RACE!!Total population!!One race!!Asian!!Vietnam…
## $ DP05_0050PM <chr> "Percent Margin of Error!!RACE!!Total population!!One race…
## $ DP05_0051PE <chr> "Percent!!RACE!!Total population!!One race!!Asian!!Other A…
## $ DP05_0051PM <chr> "Percent Margin of Error!!RACE!!Total population!!One race…
## $ DP05_0052PE <chr> "Percent!!RACE!!Total population!!One race!!Native Hawaiia…
## $ DP05_0052PM <chr> "Percent Margin of Error!!RACE!!Total population!!One race…
## $ DP05_0053PE <chr> "Percent!!RACE!!Total population!!One race!!Native Hawaiia…
## $ DP05_0053PM <chr> "Percent Margin of Error!!RACE!!Total population!!One race…
## $ DP05_0054PE <chr> "Percent!!RACE!!Total population!!One race!!Native Hawaiia…
## $ DP05_0054PM <chr> "Percent Margin of Error!!RACE!!Total population!!One race…
## $ DP05_0055PE <chr> "Percent!!RACE!!Total population!!One race!!Native Hawaiia…
## $ DP05_0055PM <chr> "Percent Margin of Error!!RACE!!Total population!!One race…
## $ DP05_0056PE <chr> "Percent!!RACE!!Total population!!One race!!Native Hawaiia…
## $ DP05_0056PM <chr> "Percent Margin of Error!!RACE!!Total population!!One race…
## $ DP05_0057PE <chr> "Percent!!RACE!!Total population!!One race!!Some other rac…
## $ DP05_0057PM <chr> "Percent Margin of Error!!RACE!!Total population!!One race…
## $ DP05_0058PE <chr> "Percent!!RACE!!Total population!!Two or more races", "2.1…
## $ DP05_0058PM <chr> "Percent Margin of Error!!RACE!!Total population!!Two or m…
## $ DP05_0059PE <chr> "Percent!!RACE!!Total population!!Two or more races!!White…
## $ DP05_0059PM <chr> "Percent Margin of Error!!RACE!!Total population!!Two or m…
## $ DP05_0060PE <chr> "Percent!!RACE!!Total population!!Two or more races!!White…
## $ DP05_0060PM <chr> "Percent Margin of Error!!RACE!!Total population!!Two or m…
## $ DP05_0061PE <chr> "Percent!!RACE!!Total population!!Two or more races!!White…
## $ DP05_0061PM <chr> "Percent Margin of Error!!RACE!!Total population!!Two or m…
## $ DP05_0062PE <chr> "Percent!!RACE!!Total population!!Two or more races!!Black…
## $ DP05_0062PM <chr> "Percent Margin of Error!!RACE!!Total population!!Two or m…
## $ DP05_0063PE <chr> "Percent!!Race alone or in combination with one or more ot…
## $ DP05_0063PM <chr> "Percent Margin of Error!!Race alone or in combination wit…
## $ DP05_0064PE <chr> "Percent!!Race alone or in combination with one or more ot…
## $ DP05_0064PM <chr> "Percent Margin of Error!!Race alone or in combination wit…
## $ DP05_0065PE <chr> "Percent!!Race alone or in combination with one or more ot…
## $ DP05_0065PM <chr> "Percent Margin of Error!!Race alone or in combination wit…
## $ DP05_0066PE <chr> "Percent!!Race alone or in combination with one or more ot…
## $ DP05_0066PM <chr> "Percent Margin of Error!!Race alone or in combination wit…
## $ DP05_0067PE <chr> "Percent!!Race alone or in combination with one or more ot…
## $ DP05_0067PM <chr> "Percent Margin of Error!!Race alone or in combination wit…
## $ DP05_0068PE <chr> "Percent!!Race alone or in combination with one or more ot…
## $ DP05_0068PM <chr> "Percent Margin of Error!!Race alone or in combination wit…
## $ DP05_0069PE <chr> "Percent!!Race alone or in combination with one or more ot…
## $ DP05_0069PM <chr> "Percent Margin of Error!!Race alone or in combination wit…
## $ DP05_0070PE <chr> "Percent!!HISPANIC OR LATINO AND RACE!!Total population", …
## $ DP05_0070PM <chr> "Percent Margin of Error!!HISPANIC OR LATINO AND RACE!!Tot…
## $ DP05_0071PE <chr> "Percent!!HISPANIC OR LATINO AND RACE!!Total population!!H…
## $ DP05_0071PM <chr> "Percent Margin of Error!!HISPANIC OR LATINO AND RACE!!Tot…
## $ DP05_0072PE <chr> "Percent!!HISPANIC OR LATINO AND RACE!!Total population!!H…
## $ DP05_0072PM <chr> "Percent Margin of Error!!HISPANIC OR LATINO AND RACE!!Tot…
## $ DP05_0073PE <chr> "Percent!!HISPANIC OR LATINO AND RACE!!Total population!!H…
## $ DP05_0073PM <chr> "Percent Margin of Error!!HISPANIC OR LATINO AND RACE!!Tot…
## $ DP05_0074PE <chr> "Percent!!HISPANIC OR LATINO AND RACE!!Total population!!H…
## $ DP05_0074PM <chr> "Percent Margin of Error!!HISPANIC OR LATINO AND RACE!!Tot…
## $ DP05_0075PE <chr> "Percent!!HISPANIC OR LATINO AND RACE!!Total population!!H…
## $ DP05_0075PM <chr> "Percent Margin of Error!!HISPANIC OR LATINO AND RACE!!Tot…
## $ DP05_0076PE <chr> "Percent!!HISPANIC OR LATINO AND RACE!!Total population!!N…
## $ DP05_0076PM <chr> "Percent Margin of Error!!HISPANIC OR LATINO AND RACE!!Tot…
## $ DP05_0077PE <chr> "Percent!!HISPANIC OR LATINO AND RACE!!Total population!!N…
## $ DP05_0077PM <chr> "Percent Margin of Error!!HISPANIC OR LATINO AND RACE!!Tot…
## $ DP05_0078PE <chr> "Percent!!HISPANIC OR LATINO AND RACE!!Total population!!N…
## $ DP05_0078PM <chr> "Percent Margin of Error!!HISPANIC OR LATINO AND RACE!!Tot…
## $ DP05_0079PE <chr> "Percent!!HISPANIC OR LATINO AND RACE!!Total population!!N…
## $ DP05_0079PM <chr> "Percent Margin of Error!!HISPANIC OR LATINO AND RACE!!Tot…
## $ DP05_0080PE <chr> "Percent!!HISPANIC OR LATINO AND RACE!!Total population!!N…
## $ DP05_0080PM <chr> "Percent Margin of Error!!HISPANIC OR LATINO AND RACE!!Tot…
## $ DP05_0081PE <chr> "Percent!!HISPANIC OR LATINO AND RACE!!Total population!!N…
## $ DP05_0081PM <chr> "Percent Margin of Error!!HISPANIC OR LATINO AND RACE!!Tot…
## $ DP05_0082PE <chr> "Percent!!HISPANIC OR LATINO AND RACE!!Total population!!N…
## $ DP05_0082PM <chr> "Percent Margin of Error!!HISPANIC OR LATINO AND RACE!!Tot…
## $ DP05_0083PE <chr> "Percent!!HISPANIC OR LATINO AND RACE!!Total population!!N…
## $ DP05_0083PM <chr> "Percent Margin of Error!!HISPANIC OR LATINO AND RACE!!Tot…
## $ DP05_0084PE <chr> "Percent!!HISPANIC OR LATINO AND RACE!!Total population!!N…
## $ DP05_0084PM <chr> "Percent Margin of Error!!HISPANIC OR LATINO AND RACE!!Tot…
## $ DP05_0085PE <chr> "Percent!!HISPANIC OR LATINO AND RACE!!Total population!!N…
## $ DP05_0085PM <chr> "Percent Margin of Error!!HISPANIC OR LATINO AND RACE!!Tot…
## $ DP05_0086PE <chr> "Percent!!Total housing units", "(X)", "(X)", "(X)", "(X)"…
## $ DP05_0086PM <chr> "Percent Margin of Error!!Total housing units", "(X)", "(X…
## $ DP05_0087PE <chr> "Percent!!CITIZEN, VOTING AGE POPULATION!!Citizen, 18 and …
## $ DP05_0087PM <chr> "Percent Margin of Error!!CITIZEN, VOTING AGE POPULATION!!…
## $ DP05_0088PE <chr> "Percent!!CITIZEN, VOTING AGE POPULATION!!Citizen, 18 and …
## $ DP05_0088PM <chr> "Percent Margin of Error!!CITIZEN, VOTING AGE POPULATION!!…
## $ DP05_0089PE <chr> "Percent!!CITIZEN, VOTING AGE POPULATION!!Citizen, 18 and …
## $ DP05_0089PM <chr> "Percent Margin of Error!!CITIZEN, VOTING AGE POPULATION!!…
## $ ...359      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

The Label Row

df_2019[1, 1:5]

Every DP05 export includes a second row containing verbose column description such as "Estimate!!SEX AND AGE!!Total Population" rather than actual data values. This row must be removed before any transformation or combination step, otherwise it would be treated as a data observation.

Cleaning & Transformation

Preserving the Raw Data

Before making any changes we create a working copy of each dataframe. This ensures the original objects loaded from disk remain untouched and can be referenced at any point if needed.

raw_2019 <- df_2019
raw_2021 <- df_2021
raw_2022 <- df_2022
raw_2023 <- df_2023
raw_2024 <- df_2024

Dropping the Label Row and Adding the Survey Year

The label row is removed from each dataframe using slice(-1). Immediately after, we add a survey_year column and position it as the first column using relocate(). Both steps are chained together since there is no reason to separate them.

raw_2019 <- raw_2019 %>%
  slice(-1) %>%
  mutate(survey_year = 2019) %>%
  relocate(survey_year)
raw_2021 <- raw_2021 %>%
  slice(-1) %>%
  mutate(survey_year = 2021) %>%
  relocate(survey_year)
raw_2022 <- raw_2022 %>%
  slice(-1) %>%
  mutate(survey_year = 2022) %>%
  relocate(survey_year)
raw_2023 <- raw_2023 %>%
  slice(-1) %>%
  mutate(survey_year = 2023) %>%
  relocate(survey_year)
raw_2024 <- raw_2024 %>%
  slice(-1) %>%
  mutate(survey_year = 2024) %>%
  relocate(survey_year)

Verifying the Label Row is Gone

raw_2019[1, 1:5]

Column selection and renaming

The 21 variables retained for this analysis must be selected from each year’s file individually because Census Bureau variables codes shifted between 2019 and later years following updates to the race and ethnicity question. Where a code differs from the 2019 baseline, the shift is documented inline as a comment.

raw_2019 <- raw_2019 %>%
  select(
    survey_year,
    geo_id = GEO_ID,
    name = NAME,
    total_population = DP05_0001E,
    total_male_population = DP05_0002E,
    total_female_population = DP05_0003E,
    sex_ratio = DP05_0004E,
    population_under_5 = DP05_0005E,
    population_under_18 = DP05_0019E,
    population_16_and_over = DP05_0020E,
    population_18_and_over = DP05_0021E,
    population_65_and_over = DP05_0024E,
    median_age = DP05_0018E,
    white_alone = DP05_0037E,
    black_alone = DP05_0038E,
    american_indian_alaska_native_alone = DP05_0039E,
    asian_alone = DP05_0044E,
    native_hawaiian_pacific_islander_alone = DP05_0052E,
    two_or_more_races = DP05_0035E,
    hispanic_latino = DP05_0071E,
    not_hispanic_latino_white_alone = DP05_0077E,
    not_hispanic_latino_black_alone = DP05_0078E,
    total_housing_units = DP05_0086E,
  )
raw_2021 <- raw_2021 %>%
  select(
    survey_year,
    geo_id = GEO_ID,
    name = NAME,
    total_population = DP05_0001E,
    total_male_population = DP05_0002E,
    total_female_population = DP05_0003E,
    sex_ratio = DP05_0004E,
    population_under_5 = DP05_0005E,
    population_under_18 = DP05_0019E,
    population_16_and_over = DP05_0020E,
    population_18_and_over = DP05_0021E,
    population_65_and_over = DP05_0024E,
    median_age = DP05_0018E,
    white_alone = DP05_0037E,
    black_alone = DP05_0038E,
    american_indian_alaska_native_alone = DP05_0039E,
    asian_alone = DP05_0044E,
    native_hawaiian_pacific_islander_alone = DP05_0052E,
    two_or_more_races = DP05_0035E,
    hispanic_latino = DP05_0071E,
    not_hispanic_latino_white_alone = DP05_0077E,
    not_hispanic_latino_black_alone = DP05_0078E,
    total_housing_units = DP05_0086E,
  )
raw_2022 <- raw_2022 %>%
  select(
    survey_year,
    geo_id = GEO_ID,
    name = NAME,
    total_population = DP05_0001E,
    total_male_population = DP05_0002E,
    total_female_population = DP05_0003E,
    sex_ratio = DP05_0004E,
    population_under_5 = DP05_0005E,
    population_under_18 = DP05_0019E,
    population_16_and_over = DP05_0020E,
    population_18_and_over = DP05_0021E,
    population_65_and_over = DP05_0024E,
    median_age = DP05_0018E,
    white_alone = DP05_0037E,
    black_alone = DP05_0038E,
    american_indian_alaska_native_alone = DP05_0039E,
    asian_alone = DP05_0044E,
    native_hawaiian_pacific_islander_alone = DP05_0052E,
    two_or_more_races = DP05_0035E,
    hispanic_latino = DP05_0073E, #Shifted from `DP05_0071E`
    not_hispanic_latino_white_alone = DP05_0079E, # Shifted from `DP05_0077E`
    not_hispanic_latino_black_alone = DP05_0080E, #Shifted from `DP05_0078E`
    total_housing_units = DP05_0088E, #Shifted from `DP05_0086E`
  )
raw_2023 <- raw_2023 %>%
  select(
    survey_year,
    geo_id = GEO_ID,
    name = NAME,
    total_population = DP05_0001E,
    total_male_population = DP05_0002E,
    total_female_population = DP05_0003E,
    sex_ratio = DP05_0004E,
    population_under_5 = DP05_0005E,
    population_under_18 = DP05_0019E,
    population_16_and_over = DP05_0020E,
    population_18_and_over = DP05_0021E,
    population_65_and_over = DP05_0024E,
    median_age = DP05_0018E,
    white_alone = DP05_0037E,
    black_alone = DP05_0038E,
    american_indian_alaska_native_alone = DP05_0039E,
    asian_alone = DP05_0047E, #Shifted from `DP05_0044E`
    native_hawaiian_pacific_islander_alone = DP05_0055E, #Shifted from `DP05_0052E`
    two_or_more_races = DP05_0035E,
    hispanic_latino = DP05_0076E, #Shifted from `DP05_0071E` and then again from `DP05_0073E`
    not_hispanic_latino_white_alone = DP05_0082E, # Shifted from `DP05_0077E` and then again from `DP05_0079E`
    not_hispanic_latino_black_alone = DP05_0083E, #Shifted from `DP05_0078E` and then again from `DP05_0080E`
    total_housing_units = DP05_0091E, #Shifted from `DP05_0086E` and then again from `DP05_0088E`
  )
raw_2024 <- raw_2024 %>%
  select(
    survey_year,
    geo_id = GEO_ID,
    name = NAME,
    total_population = DP05_0001E,
    total_male_population = DP05_0002E,
    total_female_population = DP05_0003E,
    sex_ratio = DP05_0004E,
    population_under_5 = DP05_0005E,
    population_under_18 = DP05_0019E,
    population_16_and_over = DP05_0020E,
    population_18_and_over = DP05_0021E,
    population_65_and_over = DP05_0024E,
    median_age = DP05_0018E,
    white_alone = DP05_0037E,
    black_alone = DP05_0045E, #Shifted from `DP05_0038E`
    american_indian_alaska_native_alone = DP05_0053E, #Shifted from `DP05_0039E`
    asian_alone = DP05_0061E, #Shifted from `DP05_0044E` and then again from `DP05_0047E`
    native_hawaiian_pacific_islander_alone = DP05_0069E, #Shifted from `DP05_0052E` and then again from `DP05_0055E`
    two_or_more_races = DP05_0035E,
    hispanic_latino = DP05_0090E, #Shifted from `DP05_0071E` and then again from `DP05_0073E` and one more time from `DP05_0076E`
    not_hispanic_latino_white_alone = DP05_0096E, # Shifted from `DP05_0077E` and then again from `DP05_0079E` and more time from `DP05_0082E`
    not_hispanic_latino_black_alone = DP05_0097E, #Shifted from `DP05_0078E` and then again from `DP05_0080E` and more time from `DP05_0083E`
    total_housing_units = DP05_0105E, #Shifted from `DP05_0086E` and then again from `DP05_0088E` and more time from `DP05_0091E`
  )

Verify Column Selection

We confirm that all five dataframes now share identical column names and the expected dimensions before combining them.

identical(names(raw_2019), names(raw_2021)) &
  identical(names(raw_2021), names(raw_2022)) &
  identical(names(raw_2022), names(raw_2023)) &
  identical(names(raw_2023), names(raw_2024))
## [1] TRUE
tibble(
  year = c(2019, 2021, 2022, 2023, 2024),
  rows = c(
    nrow(df_2019),
    nrow(df_2021),
    nrow(df_2022),
    nrow(df_2023),
    nrow(df_2024)
  ),
  cols = c(
    ncol(df_2019),
    ncol(df_2021),
    ncol(df_2022),
    ncol(df_2023),
    ncol(df_2024)
  )
)

Combining into a Single Dataset

With column names aligned across all five years, we stack them into a single longitudinal dataframe using bind_rows().

census_raw <- bind_rows(raw_2019, raw_2021, raw_2022, raw_2023, raw_2024)

glimpse(census_raw)
## Rows: 4,244
## Columns: 23
## $ survey_year                            <dbl> 2019, 2019, 2019, 2019, 2019, 2…
## $ geo_id                                 <chr> "0500000US01003", "0500000US010…
## $ name                                   <chr> "Baldwin County, Alabama", "Cal…
## $ total_population                       <chr> "223234", "113605", "83768", "7…
## $ total_male_population                  <chr> "109192", "54285", "40579", "35…
## $ total_female_population                <chr> "114042", "59320", "43189", "35…
## $ sex_ratio                              <chr> "95.7", "91.5", "94.0", "99.6",…
## $ population_under_5                     <chr> "10616", "6699", "5310", "4578"…
## $ population_under_18                    <chr> "46903", "24474", "18813", "174…
## $ population_16_and_over                 <chr> "183875", "92308", "66939", "56…
## $ population_18_and_over                 <chr> "176331", "89131", "64955", "54…
## $ population_65_and_over                 <chr> "47688", "20556", "15423", "119…
## $ median_age                             <chr> "43.0", "39.6", "41.9", "37.7",…
## $ white_alone                            <chr> "190912", "82323", "N", "59305"…
## $ black_alone                            <chr> "18338", "25226", "N", "688", "…
## $ american_indian_alaska_native_alone    <chr> "2428", "201", "N", "792", "204…
## $ asian_alone                            <chr> "2160", "225", "N", "17", "884"…
## $ native_hawaiian_pacific_islander_alone <chr> "0", "85", "N", "339", "0", "12…
## $ two_or_more_races                      <chr> "4711", "2359", "N", "2045", "5…
## $ hispanic_latino                        <chr> "10534", "4614", "3752", "10775…
## $ not_hispanic_latino_white_alone        <chr> "185180", "N", "N", "N", "59236…
## $ not_hispanic_latino_black_alone        <chr> "18338", "N", "N", "N", "17768"…
## $ total_housing_units                    <chr> "119425", "53809", "38256", "31…

Special Value Replacement and Type Conversion

The ACS file use several placeholder values in numeric columns to indicate suppressed or unavailable estimates. Based on the Census Bureau’s official table notes, these include "N" (insufficient sample cases), "(X)" (not applicable), and "-" (estimate could not be computed), as well as empty strings and numeric placeholders like -888888888 and -666666666. We replace all of these with NA before converting numeric columns from character to double.

The column geo_id and name are excluded from conversion since they are intentionally character.

non_numeric_cols <- c("geo_id", "name")

cols_to_convert <- census_raw %>%
  select(-all_of(non_numeric_cols)) %>%
  select(where(is.character)) %>%
  names()

census_clean <- census_raw %>%
  mutate(across(
    all_of(cols_to_convert),
    ~ ifelse(. %in% c("N", "(X)", "-", "", "NA", "-888888888", "-666666666"), NA, .)
  )) %>%
  mutate(across(
    all_of(cols_to_convert),
    as.numeric
  ))

Verify Type Conversion

census_clean %>%
  select(-geo_id, -name) %>%
  summarise(across(everything(), class)) %>%
  pivot_longer(everything(), names_to = "column", values_to = "type")
census_clean %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "column", values_to = "na_count") %>%
  filter(na_count > 0) %>%
  arrange(desc(na_count))

The missingness pattern is concentrated in racial composition columns. The 76 missing values shared across all racial breakdown columns represent county-year observations where the Census Bureau suppressed the full racial profile, likely due to small sample sizes. The not_hispanic_latino_white_alone and not_hispanic_latino_black_alone columns carry additional suppression (180 each), suggesting the Hispanic/non-Hispanic ethnicity breakdown was withheld from a broader set of counties. Core variables including total_population, median_age, sex_ratio, and total_housing_units are fully complete across all 4,244 observations.

Geographic Parsing

The name column cotains values formatted as "County Name, State Name". We split this into two separate columns and immediately normalize county_name by stripping geographic suffixes (County, Parish, Borough, Municipality, and Census Area) that vary across records but carry no analytical value.

census_clean <- census_clean %>%
  separate(name, into = c("county_name", "state_name"), sep = ", ") %>%
  mutate(county_name = str_remove(county_name, " County| Parish| Borough| Municipality| Census Area"))
census_clean %>%
  select(county_name, state_name) %>%
  distinct() %>%
  arrange(state_name) %>%
  head(15)

Regional Classification

We assign each state to one of the four U.S. Census Bureau regions and nine divisions using case_when(). Puerto Rico is classified separately as a Territory since it falls outside of the standard regional framework.

census_clean <- census_clean %>%
  mutate(
    region = case_when(
      state_name %in% c("Connecticut", "Maine", "Massachusetts", "New Hampshire", "Rhode Island", "Vermont", "New Jersey", "New York", "Pennsylvania")
      ~ "Northeast",
      state_name %in% c("Illinois", "Indiana", "Michigan", "Ohio", "Wisconsin", "Iowa", "Kansas", "Minnesota", "Missouri", "Nebraska", "North Dakota", "South Dakota") ~ "Midwest",
      state_name %in% c("Delaware", "Florida", "Georgia", "Maryland", "North Carolina", "South Carolina", "Virginia", "District of Columbia", "West Virginia", "Alabama", "Kentucky", "Mississippi", "Tennessee", "Arkansas", "Louisiana", "Oklahoma", "Texas") ~ "South",
      state_name %in% c("Arizona", "Colorado", "Idaho", "Montana", "Nevada", "New Mexico", "Utah", "Wyoming", "Alaska", "California", "Hawaii", "Oregon", "Washington") ~ "West",
      state_name == "Puerto Rico" ~ "Territory",
      TRUE ~ NA_character_
    ),
    division = case_when(
      state_name %in% c("Connecticut", "Maine", "Massachusetts", "New Hampshire", "Rhode Island", "Vermont") ~ "New England",
      state_name %in% c("New Jersey", "New York", "Pennsylvania") ~ "Middle Atlantic",
      state_name %in% c("Illinois", "Indiana", "Michigan", "Ohio", "Wisconsin") ~ "East North Central",
      state_name %in% c("Iowa", "Kansas", "Minnesota", "Missouri", "Nebraska", "North Dakota", "South Dakota") ~ "West North Central",
      state_name %in% c("Delaware", "Florida", "Georgia", "Maryland", "North Carolina", "South Carolina", "Virginia", "District of Columbia", "West Virginia") ~ "South Atlantic",
      state_name %in% c("Alabama", "Kentucky", "Mississippi", "Tennessee") ~ "East South Central",
      state_name %in% c("Arkansas", "Louisiana", "Oklahoma", "Texas") ~ "West South Central",
      state_name %in% c("Arizona", "Colorado", "Idaho", "Montana", "Nevada", "New Mexico", "Utah", "Wyoming") ~ "Mountain",
      state_name %in% c("Alaska", "California", "Hawaii", "Oregon", "Washington") ~ "Pacific",
      state_name == "Puerto Rico" ~ "Territory",
      TRUE ~ NA_character_
    )
  )
census_clean %>%
  distinct(state_name, region, division) %>%
  arrange(region, division, state_name)

Exploratory Data Analysis

Before calculating any derived variables, we explore the combined dataset to understand its structure, distributions, and patterns. Each exploration informs the decisions we make in the next phase.

Population Distribution

County population in this dataset are highly skewed (a small number of very large counties coexist with a large number of small ones). We look at the overall distribution first, then break it down by region.

census_clean %>%
  filter(survey_year == 2024) %>%
  ggplot(aes(x = total_population))+
  geom_histogram(bins = 50, fill = "steelblue", color = "white")+
  scale_x_continuous(labels = comma)+
  labs(
    title = "Distribution of County Population (2024)",
    subtitle = "Each bin represents a range of county population sizes",
    x = "Total Population",
    y = "Number of Counties",
    caption = "Source: ACS DP05 1-Year Estimates"
  )

The raw distribution makes the skew immediately visible. The vast majority of counties in the dataset cluster below 500,000 residents, while a handful of very large counties (Los Angeles, Cook, Harris) stretch the x-axis out to nearly 10 million. This kind of right skew is typical of county population data in the United States and has an important implication for analysis: any regional average that treats counties as equal units will be dominated by the largest ones.

census_clean %>%
  filter(survey_year == 2024) %>%
  ggplot(aes(x = total_population))+
  geom_histogram(bins = 50, fill = "steelblue", color = "white")+
  scale_x_log10(labels = comma)+
  labs(
    title = "Distribution of County Population on Log Scale (2024)",
    subtitle = "Log scale reveals the full range of county sizes more clearly",
    x = "Total Population (Log Scale)",
    y = "Number of Counties",
    caption = "Source: ACS DP05 1-Year Estimates"
  )

On the log scale the distribution becomes much more informative. County populations are spread fairly continuously from roughly 65,000 (The ACS 1-Year threshold) up through the millions, with the bulk of counties falling between 100,000 and 500,000 residents. The log scale will be the appropriate choice whenever we plot population on an axis going forward.

census_clean %>%
  filter(survey_year == 2024) %>%
  summarise(
    min = min(total_population, na.rm = TRUE),
    q25 = quantile(total_population, 0.25, na.rm = TRUE),
    median = median(total_population, na.rm = TRUE),
    mean = mean(total_population, na.rm = TRUE),
    q75 = quantile(total_population, 0.75, na.rm = TRUE),
    max = max(total_population, na.rm = TRUE)
  )

The summary confirms the skew numerically. The median county population in 2024 is 163,839 while the mean is 340,954 (nearly double). This gap between median and mean is the fingerprint of a right-skewed distribution and tells us that a small number of very large counties are pulling the average upward significantly.

Median Age Distribution

census_clean %>%
  filter(survey_year == 2024) %>%
  ggplot(aes(x = median_age))+
  geom_histogram(bins = 40, fill = "steelblue", color = "white")+
  labs(
    title = "Distribution of County Median Age (2024)",
    subtitle = "Each bin represents a range of median ages across counties",
    x = "Median Age",
    y = "Number of Counties",
    caption = "Source: ACS DP05 1-Year Estimates"
  )

The overall distribution of county median age is approximately normal and centered around 40 to 42 years, with a right tail extending toward counties with median age above 55. These high-age outliers likely represent retirement communities or rural counties experiencing sustained outmigration of younger residents.

census_clean %>%
  filter(survey_year == 2024) %>%
  ggplot(aes(x = median_age, fill = region))+
  geom_density(alpha = 0.4)+
  labs(
    title = "Median Age Distribution by Region (2024)",
    subtitle = "Density curves show how age distributions differ across Census regions",
    x = "Median Age",
    y = "Density",
    fill = "Region",
    caption = "Source: ACS DP05 1-Year Estimates"
  )

The regional density plot reveals a pattern that raw averages would obscure. Puerto Rico stands apart with a narrow, sharply peaked distribution centered around 47 to 48 years (its counties cluster tightly around an older median age than any continental region and show far less internal variation). The continental regions overlap considerably in the 37 to 44 range, though the Northeast skews slightly older and the South and West show broader, flatter distributions suggesting more internal diversity in age structure.

census_clean %>%
  filter(survey_year == 2024) %>%
  group_by(region) %>%
  summarise(
    min = min(median_age, na.rm = TRUE),
    q25 = quantile(median_age, 0.25, na.rm = TRUE),
    median = median(median_age, na.rm = TRUE),
    mean = mean(median_age, na.rm = TRUE),
    q75 = quantile(median_age, 0.75, na.rm = TRUE),
    max = max(median_age, na.rm = TRUE)
  ) %>%
  arrange(desc(median))

The summary confirms Puerto Rico as the olderst region by median (44.9) despite being a single territory rather than a collection of diverse states. Among continental regions, the Northeast leads at 41.9, while the South and West both sits at 39.0. The maximum of 68.0 in the South is a notable outlier worth identifying.

census_clean %>%
  filter(survey_year == 2024, region == "South") %>%
  arrange(desc(median_age)) %>%
  select(county_name, state_name, median_age, total_population) %>%
  head(5)

Sumter County, Florida emerges as the clear outlier at a median age of 68.0 with a total population of 154,693. This is not a data quality issue, Sumter County is home to The Villages, one of the largest planned retirement communities in the United States, and its age structure reflects that reality. This county will appear repeatedly as an outlier in age-related analyses throughout this notebook and should be interpreted in that context rather than treated as an anomaly to be removed.

Sex Ratio Distributions

Sex ration is expressed as the number of males per 100 females. A perfectly balanced county would show a value of 100. We first look at the overall distribution, then identify where the extremes are concentrated.

census_clean %>%
  filter(survey_year == 2024) %>%
  ggplot(aes(x = sex_ratio))+
  geom_histogram(bins = 40, fill = "steelblue", color = "white")+
  geom_vline(xintercept = 100, linetype = "dashed", color = "gray40")+
  labs(
    title = "Distribution of County Sex Ratios (2024)",
    subtitle = "Dashed line marks a balanced ratio of 100 males per 100 females",
    x = "Sex Ratio (Males per 100 Females",
    y = "Number of Counties",
    caption = "Source: ACS DP05 1-Year Estimates"
  )

The distribution of county sex ratios is approximately normal and centered just below 100, meaning the typical county has slightly more females than males. The median of 97.8 and mean of 97.99 are nearly identical, confirming the distribution is symmetric for the majority of counties. The dashed line at 100 falls near the peak but slightly to the right, making the female-heavy skew of most counties visible.

census_clean %>%
  filter(survey_year == 2024) %>%
  summarise(
    min = min(sex_ratio, na.rm = TRUE),
    q25 = quantile(sex_ratio, 0.25, na.rm = TRUE),
    median = median(sex_ratio, na.rm = TRUE),
    mean = mean(sex_ratio, na.rm = TRUE),
    q75 = quantile(sex_ratio, 0.75, na.rm = TRUE),
    max = max(sex_ratio, na.rm = TRUE)
  )
census_clean %>%
  filter(survey_year == 2024) %>%
  mutate(
    q1 = quantile(sex_ratio, 0.25, na.rm = TRUE),
    q3 = quantile(sex_ratio, 0.75, na.rm = TRUE),
    iqr = q3 - q1,
    lower = q1 - 1.5 * iqr,
    upper = q3 + 1.5 * iqr
  ) %>%
  filter(sex_ratio < lower | sex_ratio > upper) %>%
  select(county_name, state_name, region, sex_ratio, total_population) %>%
  arrange(desc(sex_ratio))

The 34 outlier counties identified by the 1.5 * IQR rule fall into two distinct and interpretable groups. The high-ratio outliers (led by Walker County, Texas (134.3), Fairbanks North Start Borough, Alaska (126.5), Kings County, California (124.2), and Onslow County, North Carolina (121.8)) are almost entirely explained by institutional population. Walker and Kings counties host large state and federal prisons, Fairbanks North Start and Onslow are home to major military installations. These are not demographic anomalies, they are accurate reflections of the population living in those counties at the time of the survey.

The low-ratio outliers tell a different story. Three Puerto Rico municipios appear at the bottom of the distribution (82.6-85.0), consistent with the broader pattern of male outmigration from the island that has accelerated since Hurricane Maria in 2017. Baltimore City (86.4) and several majority-Black counties in the South follow, a patter that research has consitently linked to higher rates of male incareceration and premature mortality in those communities. Both the high and low extremes underscore that sex ratio outliers are rarely random, they reflect structural conditions worth investigating further.

Missing Data by Region and Year

Earlier we identified that missingness is concentrated in racial composition columns. Here we explore whether that suppression is randomly distributed across regions and years or whether it follows a pattern.

census_clean %>%
  group_by(survey_year) %>%
  summarise(across(
    c(white_alone, black_alone, asian_alone, native_hawaiian_pacific_islander_alone, american_indian_alaska_native_alone, two_or_more_races), ~ sum(is.na(.))
  )) %>%
  pivot_longer(-survey_year, names_to = "column", values_to = "na_count") %>%
  ggplot(aes(x = survey_year, y = na_count, color = column))+
  geom_line(linewidth = 1)+
  geom_point(size = 2)+
  scale_x_continuous(breaks = c(2019, 2021, 2022, 2023, 2024))+
  labs(
    title = "Missing Values in Racial Composition Columns by Year",
    subtitle = "Tracking Census suppression patterns across five survey years",
    x = "Survey Year",
    y = "Number of Missing Values",
    color = "Column",
    caption = "Source: ACS DP05 1-Year Estimates"
  )

Two patterns emerge clearly from this exploration. First, suppression is heavily concentrated in 2019. The by-year chart shows roughly 60 missing values in white_alone and two_or_more_races in 2019, dropping sharply to single digits in every subsequent year. This is directly linked to the Census Bureau’s redesign of the race and ethnicity question following the 2020 Census, the updated question captured racial identity more completely, reducing the number of records where racial composition could not be reported.

census_clean %>%
  group_by(region) %>%
  summarise(across(
    c(white_alone, black_alone, asian_alone, native_hawaiian_pacific_islander_alone, american_indian_alaska_native_alone, two_or_more_races), ~ sum(is.na(.))
  )) %>%
  pivot_longer(-region, names_to = "column", values_to = "na_count") %>%
  ggplot(aes(x = region, y = na_count, fill = column))+
  geom_col(position = "dodge")+
  labs(
    title = "Missing Values in Racial Composition Columns by Region",
    subtitle = "Suppression patterns across Census regions",
    x = "Region",
    y = "Number of Missing Values",
    fill = "Column",
    caption = "Source: ACS DP05 1-Year Estimates"
  )

Second, the South accounts for the largest share of suppression by region, with approximately 48 missing values across racial columns, roughly half the total. This is proportional rather than structural: the South simply contains more counties than any other region (355 to 368 depending on the year), so it naturally accumulates more instances of suppression even when its suppression rate is lower than other regions.

census_clean %>%
  group_by(region, survey_year) %>%
  summarise(
    missing_any_race = sum(is.na(white_alone) | is.na(black_alone) | is.na(asian_alone) | is.na(two_or_more_races)),
    total_counties = n(),
    pct_missing = round((missing_any_race / total_counties) * 100, 1),
    .groups = "drop"
  ) %>%
  arrange(desc(pct_missing))

The percent table confirms that suppression rates are low across the board. The highest rate observed is 18.2% for the Territory in 2019, which represents just 2 of 11 Puerto Rico municipios, a small absolute count that looks large only because the Territory has so few counties. Among continental regions, the South in 2019 at 9.9% is the only case approaching a meaningful suppression rate, and it drops to under 1% in every subsequent year. Given this pattern, missingness in this dataset is not randomly distributed, it is concentrated in 2019 and disproportionately affects the South, but its overall scale is small enough that it will not materially affect regional or longitudinal analysis.

County Coverage by Year and Region

The ACS 1-Year survey only covers counties with populations above 65,000. As the U.S. population grows and more counties cross that threshold, coverage expands over time. here we look at how many counties are represented in each region per year to understand the shape of our dataset before moving into analysis.

census_clean %>%
  group_by(survey_year, region) %>%
  summarise(county_count = n(), .groups = "drop") %>%
  ggplot(aes(x = survey_year, y = county_count, color = region))+
  geom_line(linewidth = 1)+
  geom_point(size = 2)+
  scale_x_continuous(breaks = c(2019, 2021, 2022, 2023, 2024)) +
  labs(
    title = "County Coverage by Region and Year",
    subtitle = "Number of counties included in the ACS 1-year survey per region",
    x = "Survey Year",
    y = "Number of Counties",
    color = "Region",
    caption = "Source: ACS DP05 1-Year Estimates"
  )

census_clean %>%
  group_by(survey_year, region) %>%
  summarise(county_count = n(), .groups = "drop") %>%
  pivot_wider(names_from = survey_year, values_from = county_count) %>%
  arrange(region)

Coverage is stable across all five years with modest growth in most regions. The South is the largest region throughout, growing from 355 to 368 counties between 2019 and 2024 and accounting for roughly 43% of all county-year observations in the dataset. The Midwest and West hold relatively steady while the Northeast shows a slight dip in 2021, dropping from 137 to 135 counties, before recovering to 138 by 2023. This temporary contraction likely reflects a small number of counties falling below the 65,000 population threshold during the pandemic, when urban population loss was documented in several Northeastern metros.

The Territory remains flat at 11 municipios across all five years, as Puerto Rico’s administrative geography is fixed. This consistency is useful analytically, any changes we observe in Territory-level figures over time reflect genuine demographic shifts rather than changes in coverage.

This coverage table serves as an important reference for interpreting regional comparisons throughout the rest of the analysis. Regions with more counties will naturally produce more stable regional aggregates, while the Territory’s small count means its regional figures are more sensitive to individual municipio-level changes.

Derived Variables

With the dataset cleaned and explored, we now calculate the variables needed for deeper analysis. Each derived variable is validated immediately after creation before moving to the next.

Senior Population Share

Senior population share measures the percentage of a county’s total population that is 65 years or older. Counties are then classified into three tiers based on that share: Aging (>= 20%), Transitional (15-19.9%), and Young (<15%).

census_clean <- census_clean %>%
  mutate(
    senior_pop_share = (population_65_and_over / total_population) * 100, 
    senior_tier = case_when(
      senior_pop_share >= 20 ~ "Aging",
      senior_pop_share >= 15 ~ "Transitional",
      TRUE ~ "Young"
    )
  )
census_clean %>%
  select(county_name, state_name, survey_year, total_population, population_65_and_over,senior_pop_share, senior_tier) %>%
  arrange(desc(senior_pop_share)) %>%
  head(10)
census_clean %>%
  count(senior_tier) %>%
  mutate(pct = round(n / sum(n) * 100, 1))

Sumter County, Florida appears in all five survey years at the top of the senior share ranking, with values consistently above 56%, meaning more than half of its population is 65 or older in every year of the study. Charlotte County, Florida follows at roughly 40% across all years, another established retirement destination on Florida’s Gulf Coast. The dominance of Florida counties at the top of this ranking is consistent with decades of retiree migration to the state and will appear as a recurring pattern throughout the age-related analyses.

The tier distribution splits into roughly half Transitional (50.8%), a quarter Aging (25.7%), and a quarter Young (23.6%). This means that as of the years covered in this study, one in four county-year observations already meets the Aging threshold of 20% or more seniors, a striking share that reflects the broader national trend of population aging driven by the Baby Boomer generation moving into retirement age.

Working Age and Youth Population Share

Working age population share measures the percentage of the total population between 18 and 64 years, derived by substracting the 65-and-over population from the 18-and-over population. Youth population share measures the percentage under 18.

census_clean <- census_clean %>%
  mutate(
    working_age_pop_share = ((population_18_and_over - population_65_and_over) / total_population) * 100,
    youth_pop_share = (population_under_18 / total_population) * 100
  )
census_clean %>%
  select(county_name, state_name, survey_year, total_population, senior_pop_share, working_age_pop_share, youth_pop_share) %>%
  mutate(check_sum = round(senior_pop_share + working_age_pop_share + youth_pop_share, 1)) %>%
  arrange(desc(check_sum)) %>%
  head(10)

The validation check confirms that senior, working age, and youth population shares sum to exactly 100 for every county-year observation, meaning the three variables together account for the full population with no gaps or overlaps. Working age share dominates across all counties as expected, typically ranging between 57% and 66%, while senior and youth shares compete for the remaining portions in ways that reflect each county’s demographic character.

Population Change Tier

To classify counties by population change we need to compare each county’s 2019 and 2024 total population directly. We pivot to wide format so both years sit side by side, calculate the percentage change, classify into tiers, and then join the result back to census_clean.

pop_wide <- census_clean %>%
  filter(survey_year %in% c(2019, 2024)) %>%
  select(geo_id, survey_year, total_population) %>%
  pivot_wider(names_from = survey_year, values_from = total_population) %>%
  rename(pop_2019 = `2019`, pop_2024 = `2024`) %>%
  drop_na() %>%
  mutate(
    pct_change = ((pop_2024 - pop_2019) / pop_2019) * 100,
    pop_change_tier = case_when(
      pct_change > 10 ~ "High Growth",
      pct_change >= 2 ~ "Moderate Growth",
      pct_change >= -2 ~ "Stable",
      TRUE ~ "Decline"
    )
  )
census_clean <- census_clean %>%
  left_join(
    pop_wide %>% select(geo_id, pct_change, pop_change_tier),
    by = "geo_id"
  )
census_clean %>%
  filter(survey_year == 2024) %>%
  count(pop_change_tier) %>%
  mutate(pct = round(n / sum(n) * 100, 1)) %>%
  arrange(desc(n))
pop_wide %>%
  left_join(
    census_clean %>%
      filter(survey_year == 2024) %>%
      select(geo_id, county_name, state_name),
    by = "geo_id"
  ) %>%
  select(county_name, state_name, pop_2019, pop_2024, pct_change, pop_change_tier) %>%
  arrange(desc(pct_change)) %>%
  slice(c(1:5, (nrow(.)-4):nrow(.)))

The tier distribution shows that nearly half of all counties (48.8%) fall into Moderate Growth, with Stable counties accounting for another 25.9%. Only 8% of counties show outright Decline, though that still represents 69 county-year observations where population contracted by more than 2% over the five year window. The 35 NA values represent counties that appear in one survey year but not the other (typically smaller counties that crossed the 65,000 threshold after 2019 or fell below it by 2024), and cannot be assigned a change tier.

The fastest growing counties are concentrated entirely in Texas, led by Kaufman County at 45.3% growth between 2019 and 2024. Rockwall, Liberty, Bastrop, and Comal counties round out the top five, all exurban counties in the Dallas-Fort Worth and Austin metropolitan areas absorbing population spillover from those rapidly expanding metros.

The steepest declines tell a different story. Apache County, Arizona (-9.9%) is a large rural county within the Navajo Nation that has faced sustained outmigration.Robeson County, North Carolina (-9.2%) and Hinds County, Mississippi (-8.6%) are majority-minority counties in the rural South experiencing long-term population loss. Orleans Parish, Louisiana (-7.0%) continues its slow recovery from the population displacement caused by Hurricane Katrina nearly two decades ago. Toa Alta Municipio in Puerto Rico (-8.2%) reflects the broader outmigration from the island that has accelerated since Hurricane Maria in 2017.

Racial Diversity Index

The Racial Diversity Index measures how evenly a county’s population is distributed across racial groups. It is calculated using a Herfindalh-based formula: 1 minus the sum of each racial group’s squared proportion of the total population. A score of 0 indicates complete homogeneity (the entire population belongs to one racial group) while a score approaching 1 indicates perfect diversity across many equally sized groups.

We calculate proportions for the six racial categories available in the dataset. Hispanic or Latino is an ethnicity designation rather than a racial category in Census methodology and is therefore excluded from this calculation to avoid double-counting with the racial groups.

census_clean <- census_clean %>%
  mutate(
    prop_white = white_alone/total_population,
    prop_black = black_alone/total_population,
    prop_aian = american_indian_alaska_native_alone/total_population,
    prop_asian = asian_alone/total_population,
    prop_nhpi = native_hawaiian_pacific_islander_alone/total_population,
    prop_two_or_more = two_or_more_races/total_population
  )
census_clean <- census_clean %>%
  mutate(
    diversity_index = 1 - (prop_white^2 + prop_black^2 + prop_aian^2 + prop_asian^2 + prop_nhpi^2 + prop_two_or_more^2)
  )
census_clean %>%
  summarise(
    min = min(diversity_index, na.rm = TRUE),
    max = max(diversity_index, na.rm = TRUE),
    mean = mean(diversity_index, na.rm = TRUE),
    n_negative = sum(diversity_index < 0, na.rm = TRUE),
    n_over_1 = sum(diversity_index > 1, na.rm = TRUE)
  )

The diversity index ranges from 0.069 to 0.983 with a mean of 0.426, and crucially contains no negative values or values above 1. This is a cleaner result than our original analysis, where median imputation introduced inflated racial counts that pushed proportions above 1 and produced impossible negative index values for 72 county-year observations. Since we have not applied median imputation in this version of the analysis, the index is mathematically valid across all rows.

div_cutoffs <- quantile(census_clean$diversity_index, probs = c(0.33, 0.66), na.rm = TRUE)
div_cutoffs
##       33%       66% 
## 0.3218009 0.5161755
census_clean <- census_clean %>%
  mutate(diversity_tier = case_when(
    is.na(diversity_index) ~ NA_character_,
    diversity_index <= div_cutoffs[1] ~ "Low",
    diversity_index <= div_cutoffs[2] ~ "Moderate",
    TRUE ~ "High"
  ))
census_clean %>%
  count(diversity_tier) %>%
  mutate(pct = round(n / sum(n) * 100, 1)) %>%
  arrange(desc(n))
bind_rows(
  census_clean %>%
    filter(survey_year == 2024) %>%
    arrange(desc(diversity_index)) %>%
    arrange(desc(diversity_index)) %>%
    head(5),
  census_clean %>%
    filter(survey_year == 2024) %>%
    arrange(diversity_index) %>%
    head(5)
) %>%
  select(county_name, state_name, diversity_index, diversity_tier, total_population)

The quantile cutoffs fall at 0.322 for the 33rd percentile and 0.516 for the 66th percentile, producing a well-balanced tier distribution of 33.4% High, 32.4% Moderate, and 32.4% Low diversity. The remaining 1.8% of observations (76 records) could not be assigned a tier because their diversity index is NA, reflecting county-year observations where racial composition data was fully suppressed by the Census Bureau.

The most diverse counties in 2024 are led by three Puerto Rico Municipios (Toa Baja, Mayagüez, and Bayamón) with index scores above 0.86. This result is initially surprising given Puerto Rico’s reputation as a culturally homogeneous island, but refelects how the Census race question captures the population there: a large share of respondents identify across multiple racial categories simultaneously, producing high diversity scores under the Herfindahl formula. Among continental counties, Merced County California and the Bronx in New York round out the top five, both long-established as among the most racially mixed counties in the country.

The least diverse counties are uniformly small, predominantly white rural counties concentrated in the Northeast and Midwest. Armstrong County Pennsylvania (0.106), Geauga County Ohio (0.116), Crow Wing County Minnesota (0.120), Franklin County Missouri (0.124), and Belknap County New Hampshire (0.124) all have populations where a single racial group accounts for the overwhelming majority of residents, leaving little distributional spread for the index to capture.

Housing Population Ratio

The housing to population ratio divides total housing units by total population to produce a county-level proxy for housing availability. Counties in the top 10% of this ratio have relatively more housing units per resident, while counties in the bottom 10% face potential housing scarcity relative to their population.

census_clean <- census_clean %>%
  mutate(
    housing_to_pop_ratio = total_housing_units/total_population,
    housing_ratio_flag = case_when(
      is.na(housing_to_pop_ratio) ~ NA_character_,
      housing_to_pop_ratio >= quantile(housing_to_pop_ratio, 0.90, na.rm = TRUE) ~ "Top 10%",
      housing_to_pop_ratio <= quantile(housing_to_pop_ratio, 0.10, na.rm = TRUE) ~ "Bottom 10%",
      TRUE ~ "Typical"
    )
  )
census_clean %>%
  summarise(
    min = min(housing_to_pop_ratio, na.rm = TRUE),
    q10 = quantile(housing_to_pop_ratio, 0.10, na.rm = TRUE),
    median = median(housing_to_pop_ratio, na.rm = TRUE),
    mean = mean(housing_to_pop_ratio, na.rm = TRUE),
    q90 = quantile(housing_to_pop_ratio, 0.90, na.rm = TRUE),
    max = max(housing_to_pop_ratio, na.rm = TRUE)
  )
census_clean %>%
  count(housing_ratio_flag) %>%
  mutate(pct = round(n/sum(n) * 100, 1))
bind_rows(
  census_clean %>%
    filter(survey_year == 2024) %>%
    arrange(desc(housing_to_pop_ratio)) %>%
    head(5),
  census_clean %>%
    filter(survey_year == 2024) %>%
    arrange(housing_to_pop_ratio) %>%
    head(5)
) %>%
  select(county_name, state_name, total_population, total_housing_units, housing_to_pop_ratio, housing_ratio_flag)

The housing to population ratio ranges from 0.287 to 1.042, with a median of 0.434 meaning the typical county has roughly one housing unit for every 2.3 residents. The flag distribution splits cleanly into 10% Top, 10% Bottom, and 80% Typical as intended.

The top 10% counties with the highest ratios are dominated by coastal resort and vacation destinations. Cape May, New Jersey leads at 1.04 (more housing units than residents) reflecting a large seasonal vacation housing stock that sits largely empty outside summer months. Monroe County Florida (Florida Keys), Barnstable Massachusetts (Cape Cod), and Walton County Florida follow the same pattern. These counties do not have a housing surplus in a meaningful sense; rather, their permanent resident population is small relative to the housing built to serve seasonal visitors.

The bottom 10% tells the opposite story. Kaufman County Texas appears here despite being one of the fastest growing counties in the dataset (rapid population growth has outpaced housing construction, creating genuine scarcity). Utah County Utah and Kings and Merced counties in California round out the bottom, all fast-growing areas where housing has struggled to keep pace with demand.

Visualizations

Senior Population Share Over Time by Region

census_clean %>%
  group_by(survey_year, region) %>%
  summarise(
    total_65_over = sum(population_65_and_over, na.rm = TRUE),
    total_pop = sum(total_population, na.rm = TRUE),
    senior_share = (total_65_over/total_pop) * 100,
    .groups = "drop"
  ) %>%
  ggplot(aes(x = survey_year, y = senior_share, color = region))+
  geom_line(linewidth = 1)+
  geom_point(size = 2)+
  scale_color_manual(values = region_colors)+
  scale_x_continuous(breaks = c(2019,2021,2022,2023,2024))+
  scale_y_continuous(labels = label_number(suffix = "%"))+
  labs(
    title = "Senior Population Share by Census Region (2019-2024)",
    subtitle = "Weigthed share of regional population 65 and over",
    x = "Survey Year",
    y = "Senior Population Share",
    color = "Region",
    caption = "Source: ACS DP05 1-Year Estimates"
  )

Every Census region experienced an increase in senior population share between 2019 and 2024 with no exceptions, confirming that population aging is a universal trend across the United States rather than a regional one. The trajectory is consistent and uninterrupted across all five survey years.

Puerto Rico stands apart from the continental regions throughout the entire period. Starting at 22.4% in 2019 and reaching 25.6% by 2024, the Territory’s senior share is already well above the 20% Aging threshold and continues to widen its lead over continental regions. This accelerating gap reflects two compounding forces: an aging resident population and sustained outmigration of younger residents, particularly following Hurricane Maria in 2017, which has left behind a disproportionately older demographic profile.

Among continental regions the Northeast leads throughout, crossing 19% by 2024 and accelerating noticeably after 2021. The Midwest follows closely while the South and West remain the youngest regions, starting lowest in 2019 and remaining tightly clustered through 2024. The convergence of the South and West lines suggests that despite their different demographic compositions, both regions are aging at nearly identical rates driven by the nationwide Baby Boomer cohort moving into retirement age.

Working Age Population Share Over Time by Region

census_clean %>%
  group_by(survey_year, region) %>%
  summarise(
    total_working_age = sum(population_18_and_over - population_65_and_over, na.rm = TRUE),
    total_pop = sum(total_population, na.rm = TRUE),
    working_age_share = (total_working_age/total_pop) * 100,
    .groups = "drop"
  ) %>%
  ggplot(aes(x = survey_year, y = working_age_share, color = region))+
  geom_line(linewidth = 1)+
  geom_point(size = 2)+
  scale_color_manual(values = region_colors)+
  scale_x_continuous(breaks = c(2019,2021,2022,2023,2024))+
  scale_y_continuous(labels = label_number(suffix = "%"))+
  labs(
    title = "Working Age Population Share by Census Region (2019-2024)",
    subtitle = "Weighted share of regional population between ages 18 and 64",
    x = "Survey Year",
    y = "Working Age Population Share",
    color = "Region",
    caption = "Source ACS DP05 1-Year Estimates"
  )

Working age population share declined in every Census region between 2019 and 2024, the direct counterpart to the universal increase in senior share seen in the previous visualization. As the Baby Boomer generation continues moving out of the 18-64 age band, every region is losing ground in the share of its population that participates in the worforce.

The West maintains the highest working age share throughout the period, starting at 62.% in 2019 and ending at 61.9% in 2024 (a relatively modest decline compared to other regions. This resilience likely reflects continued in-migration of working age adults to Western metros, partially offsetting the aging of the existing population).

The Northeast experiences the steepest decline among continental regions, dropping sharply after 2022 and converging with the South and Midwest by 2024. This acceleration mirrors the Northeast’s rising senior share seen in the previous chart and suggests the region’s aging dynamic is intensifying rather than stabilizing.

The Territory remains the lowest throughout and continues its steady decline, bottoming out at 60.0% by 2024. Combined with its senior share already above 25%, Puerto Rico’s working population is being compressed from both ends (a shrinking share of residents in prime working years and a growing share of seniors) with significant implications for the island’s economic productivity and social services capacity.

Population Change Tier Distribution by Region

census_clean %>%
  filter(survey_year == 2024, !is.na(pop_change_tier)) %>%
  count(region, pop_change_tier) %>%
  group_by(region) %>%
  mutate(pct = n / sum(n) * 100) %>%
  ungroup() %>%
  mutate(pop_change_tier = factor(pop_change_tier, levels = c("High Growth", "Moderate Growth", "Stable", "Decline"))) %>%
  ggplot(aes(x = region, y = pct, fill = pop_change_tier))+
  geom_col()+
  scale_fill_manual(values = tier_colors)+
  scale_y_continuous(labels = label_number(suffix = "%"))+
  labs(
    title = "Population Change Tier Distribution by Region (2019-2024)",
    subtitle = "Share of counties in each growth tier per region",
    x = "Region",
    y = "Share of Counties",
    fill = "Population Change Tier",
    caption = "Source: ACS DP05 1-Year Estimates"
  )

The population change tier distribution reveals sharp regional contrasts in growth patterns between 2019 and 2024. The South leads all regions in dynamism, it carries the largest share of High Growth counties and the smallest share of counties in Decline, reflecting the sustained Sun Belt migration that has defined American demographic geography over the past decade. Texas counties in particular, as identified in the derived variables section, anchor much of this High Growth concentration.

The Territory presents the starkest contrast. Puerto Rico has the largest share of Decline counties of any region by a significant margin, with roughly 30% of its municipios experiencing population contraction of more than 2% over the five year window. This is consistent with the sustained outmigration from the island documented in both the sex ratio and the senior share analyses (Puerto Rico is losing population across multiple dimensions simultaneously).

The Midwest shows the highest share of Stable counties, suggesting a demographic landscape that is neither growing rapidly nor contracting meaningfully. This stability masks internal variation (the region contains both growing exurban counties near major metros and quietly shrinking rural counties) but at the regional level the dominant pattern is stasis.

The Northeast is heavily concentrated in Moderate Growth with minimal representation at either extreme, while the West combines a healthy High Growth segment with low Decline, positioning it as the second most dynamic region after the South.

Diversity Index Distribution by Region

census_clean %>%
  filter(!is.na(diversity_index)) %>%
  ggplot(aes(x = region, y = diversity_index, fill = region))+
  geom_boxplot(alpha = 0.7, outlier.shape = 21, outlier.size = 1.5)+
  scale_fill_manual(values = region_colors)+
  scale_y_continuous(limits = c(0,1))+
  labs(
    title = "Racial Diversity Index Distribution by Region",
    subtitle = "Herfindahl-based diversity score; higher values indicate greater diversity",
    x = "Region",
    y = "Diversity Index",
    fill = "Region",
    caption = "Source: ACS DP05 1-Year Estimates"
  )+
  theme(legend.position = "none")

census_clean %>%
  filter(!is.na(diversity_index)) %>%
  group_by(region) %>%
  summarise(
    min = round(min(diversity_index, na.rm = TRUE), 3),
    q25 = round(quantile(diversity_index, 0.25, na.rm = TRUE), 3),
    median = round(median(diversity_index, na.rm = TRUE), 3),
    mean = round(mean(diversity_index, na.rm = TRUE), 3),
    q75 = round(quantile(diversity_index, 0.75, na.rm = TRUE), 3),
    max = round(max(diversity_index, na.rm = TRUE), 3)
  ) %>%
  arrange(desc(median))

The diversity index distribution reveals a clear regional hierarchy with the Territory at the top and the Midwest at the bottom. Puerto Rico’s municipios cluster tightly around a median of 0.751 with a relatively narrow interquartile range, meaning high diversity is consistent across the island rather than driven by a few exceptional municipios. As noted in the derived variables section, this reflects how the Census race question captures Puerto Rico’s population across multiple racial categories simultaneously.

The South and West occupy the middle ground with medians of 0.499 and 0.475 respectively, but their distributions tell different stories. The South’s box is more compact while the West’s interquartile range is wider, stretching from 0.327 to 0.666 (reflecting the contrast between highly diverse coastal California counties and more homogeneous rural Mountain West counties within the same region).

The Northeast sits at 0.328 with a wide spread of 0.203 to 0.509, capturing the sharp contrast between densely diverse urban counties like the Bronx and Kings County and the predominantely white rural counties of northern New England and Appalachian Pennsylvania.

The Midwest is the least diverse region with a media of 0.296 and the most compressed distribution of any continental region, meaning low diversity is the norm across the region rather than the exception. A single high outlier near 0/75 stands apart from the rest (likely a county anchored by a large university or military installation that draws a more diverse population that its surrounding area).

census_clean %>%
  filter(region == "Midwest", diversity_index > 0.70) %>%
  select(county_name, state_name, survey_year, diversity_index, total_population) %>%
  arrange(desc(diversity_index))

The Midwest outlier above 0.70 is not a single county but two (Wyandotte County, Kansas and Cook County, Illinois) both appearing consistently across multiple years. Wyandotte County, anchored by Kansas City’s urban core, carries a diversity index of 0.75-0.76 across 2021 through 2024. Cook County, home to Chicago, follows closely at 0.74. Both are major metropolitan counties whose diversity reflects decades of immigration and internal migration into large Midwestern cities, standing in sharp contrast to the predominantly rural, low-diversity counties that define the region’s overall distribution.

Racial Diversity Over Time by Census Division

census_clean %>%
  filter(!is.na(diversity_index), !is.na(division)) %>%
  group_by(division, survey_year) %>%
  summarise(
    avg_diversity = mean(diversity_index, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(division = reorder(division, avg_diversity),
         survey_year = factor(survey_year)) %>% # Added this and commented out the scale_x_continuous to avoid having a gap between 2019-2021 due not 2020 not being present
  ggplot(aes(x = survey_year, y = division, fill = avg_diversity))+
  geom_tile(color = "white", linewidth = 0.5)+
  scale_fill_gradient(
    low = "#0072B2",
    high = "#009E73",
    name = "Avg Diversity\nIndex"
  )+
  #scale_x_continuous(breaks = c(2019,2021,2022,2023,2024))+
  labs(
    title = "Average Racial Diversity Index by Census Division (2019-2024)",
    subtitle = "Divisions ordered by average diversity index across all years",
    x = "Survey Year",
    y = "Census Division",
    caption = "Source: ACS DP05 1-Year Estimates"
  )

census_clean %>%
  filter(!is.na(diversity_index), !is.na(division)) %>%
  group_by(division, survey_year) %>%
  summarise(
    avg_diversity = round(mean(diversity_index, na.rm = TRUE), 3),
    .groups = "drop"
  ) %>%
  pivot_wider(names_from = survey_year, values_from = avg_diversity) %>%
  mutate(change_2019_2024 = round(`2024` - `2019`, 3)) %>%
  arrange(desc(change_2019_2024))

Every Census division increased in average racial diversity between 2019 and 2024 with no exceptions, confirming that diversification is a nationwide trend operating across all geographic contexts. However the pace and pattern vary considerably across divisions.

The Territory shows the largest absolute change at +0.189, but the trajectory is unusual (diversity jumps sharply from 0.543 in 2019 to 0.772 in 2021 before plateauing through 2024). This discontinuity almost certainly reflects the Census Bureau’s redesign of the race question following the 2020 Census rather than a genuine demographic shift of that magnitude occurring in a single year. The 2019 value for the Territory should be interpreted with caution when making longitudinal comparisons.

Among continental divisions, West South Central (+0.174) and Mountain (+0.166) lead in absolute change, driven by rapid diversification in Texas, Arizona, Nevada, and Colorado (states experiencing both high population growth and significant shifts in racial composition driven by Hispanic and Asian population increases). The Pacific division follows at +0.159, consistent with California’s long-established trajectory toward greater diversity.

The East South Central division shows the smallest increase among continental divisions at +0.059, suggesting that Alabama, Kentucky, Mississippi, and Tennessee are diversifying more slowly than the rest of the country. This division occupies the lower middle of the heatmap throughout the study period, neither the least diverse nor showing meaningful acceleration toward greater diversity.

The heatmap makes the overall direction unmistakable (every row gets greener from left to right) while the ordering of divisions by average diversity reveals a persistent hierarchy that population change is shifting but not yet fundamentally reordering.

Housing to Population Ration Distribution

census_clean %>%
  filter(survey_year == 2024, !is.na(housing_to_pop_ratio)) %>%
  ggplot(aes(x = housing_to_pop_ratio, fill = housing_ratio_flag))+
  geom_histogram(bins = 40, color = "white")+
  scale_fill_manual(
    values = c(
      "Top 10%" = "#009E73",
      "Typical" = "#999999",
      "Bottom 10%" = "#D55E00"
    )
  )+
  scale_x_continuous(labels = label_number(accuracy = 0.01))+
  labs(
    title = "Housing to Population Ration Distribution (2024)",
    subtitle = "Counties flagged in the top and bottom 10% of the distribution",
    x = "Housing Units per Resident",
    y = "Number of Counties",
    fill = "Flag",
    caption = "Source: ACS DP05 1-Year Estimates"
  )

census_clean %>%
  filter(!is.na(housing_to_pop_ratio), !is.na(region)) %>%
  group_by(region, survey_year) %>%
  summarise(
    avg_ratio = mean(housing_to_pop_ratio, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  ggplot(aes(x = survey_year, y = avg_ratio, color = region)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2)+
  scale_color_manual(values = region_colors)+
  scale_x_continuous(breaks = c(2019, 2021, 2022, 2023, 2024))+
  scale_y_continuous(labels = label_number(accuracy = 0.01))+
  labs(
    title = "Average Housing to Population Ratio by Region (2019-2024)",
    subtitle = "Average number of housing units per resident across counties in each region",
    x = "Survey Year",
    y = "Avg Housing Units per Resident",
    color = "Region",
    caption = "Source: ACS DP05 1-Year Estimates"
  )

The housing to population distribution is approximately normal for the bulk of counties, centered around 0.43 to 0.45 housing units per resident, with a pronounced right tail driven by seasonal resort counties where vacation housing stock inflates the ratio well above the typical range. The color coding makes the two tails immediately identifiable (Bottom 10% counties cluster tight below 0.40 while the Top 10% counties spread across a wider range extending past 0.60 and into the extreme outlier territory above 1.0).

The bottom of the distribution represents genuine housing scarcity relative to population. These are fast-growing counties where residential construction has not kept pace with population influx, or dense urban counties where land constraints limit new housing supply. The top of the distribution, as established in the derived variables section, is dominated by coastal vacation destinations where seasonal housing stock serves as transient population far larger than the permanent resident count.

The regional trend chart reveals two persistent patterns. The West maintains the lowest average ratio across all five survey years, confirming it as the most supply-constrained region relative to its population (a finding consistent with the well-documented housing affordability crisis in California, Oregon, and Washington). The Northeast holds the highest ratio among continental regions throughout, suggesting relatively greater housing availability per resident, though this average masks significant variation between dense urban cores and more spacious rural areas within the region.

The Territory shows a sharp dip in 2021 before recovering to above its 2019 level by 2023. This V-shaped pattern likely reflects disruptions in housing unit reporting during the pandemic year rather than a genuine contraction and rapid expansion of Puerto Rico’s housing stock in a two year window.

Housing Supply vs Population Growth

census_clean %>%
  filter(survey_year == 2024, !is.na(housing_to_pop_ratio), !is.na(pop_change_tier), !is.na(pct_change)) %>%
  ggplot(aes(x = pct_change, y = housing_to_pop_ratio, color = pop_change_tier))+
  geom_point(alpha = 0.6, size = 2)+
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray50")+
  geom_line(yintercept = quantile(census_clean$housing_to_pop_ratio, 0.10, na.rm = TRUE), linetype = "dotted", color = "gray50")+
  scale_color_manual(values = tier_colors)+
  scale_x_continuous(labels = label_number(suffix = "%"))+
  scale_y_continuous(labels = label_number(accuracy = 0.01))+
  annotate(
    "text", x = 15, y = 0.30,
    label = "High Growth,\nlow housing supply",
    size = 3,
    color = "gray30"
  )+
  labs(
    title = "Housing Supply vs Population Growth by County (2019 - 2024)",
    subtitle = "Each point is a county; dotted line marks the bottom 10% housing ratio threshold",
    x = "Population Change 2019-2024 (%)",
    y = "Housing Units per Resident (2024)",
    color = "Population Change Tier",
    caption = "Source: ACS DP05 1-Year Estimates"
  )

The scatter plot reveals a clear negative relationship between population growth and housing supply,as population change increases moving right along the x-axis, housing units per resident tend to decrease. This pattern reflects a fundamental constraint of rapid growth: residential construction takes time, permitting, and capital, and in the fastest growing counties population is arriving faster than housing can be built to accomodate it.

High Growth counties cluster visibly in the lower right quadrant of the chart, combining strong population increases with some of the lowest housing ratios in the dataset. These counties face the most acute housing pressure (their populations are expanding rapidly while their per-resident housing stock is already among the thinnest in the country). Kaufman County, Texas, identified earlier as the fastest growing county in the dataset at 45.3%, appears as the rightmost teal point and sits well below the bottom 10% housing ratio threshold marked by the dotted line.

Decline counties occupy the opposite corner (upper left) where shrinking populations leave behind a housing stock that grows relatively larger per remaining residents. This is not a sign of housing abundance but of demographic contraction, where vacancy rates rise as people leave and the built environment outlasts the community it was designed to serve.

The vertical cluster of points near zero percent change represents Stable counties, which show the widest range of housing ratios (from well-supplied to constrained) suggesting that housing availability in stable counties is determined more by local geography and land use policy than by population pressure.

The Aging Growth Tension

census_clean %>%
  filter(survey_year == 2024, !is.na(senior_pop_share), !is.na(pct_change), !is.na(pop_change_tier)) %>%
  ggplot(aes(x = pct_change, y = senior_pop_share, color = pop_change_tier))+
  geom_point(alpha = 0.6, size = 2)+
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray50")+
  geom_hline(yintercept = 20, linetype = "dotted", color = "gray50")+
  scale_color_manual(values = tier_colors)+
  scale_x_continuous(labels = label_number(suffix = "%"))+
  scale_y_continuous(labels = label_number(suffix = "%"))+
  annotate(
    "text", x = 30, y = 55,
    label = "Sumter County, FL",
    size = 3,
    color = "gray30"
  )+
  annotate(
    "text", x = -5, y = 22,
    label = "Growing older,\nshrinking population",
    size = 3,
    color = "gray10"
  )+
  annotate(
    "text", x = 30, y = 10,
    label = "Growing fast,\nstill young",
    size = 3,
    color = "gray30"
  )+
  labs(
    title = "Senior Population Share vs Population Growth by County (2019-2024)",
    subtitle = "Dotted line marks the 20% Aging threshold; dashed line marks zero population change",
    x = "Population Change 2019-2024",
    y = "Senior Population Share (2024)",
    color = "Population Change Tier",
    captio = "Source: ACS DP05 1-Year Estimates"
  )

The aging-growth scatter reveals a meaningful negative relationship between population growth and senior share, counties growing fastest tend to have younger populations, while counties losing residents or staying stable tend to skew older. This pattern reflects the demographic composition of internal migration in the United States: the households relocating to high-growth Sun Belt exurbs are disproportionately working-age families and younger adults, while older residents are more likely to age in place in slower-growing or declining communities.

Sumter County, Florida, sits entirely apart from the rest of the dataset (a High Growth county with a senior share above 55%), occupying a position that no other county approaches. It is the only county that simultaneously grows rapidly and maintains an extreme elderly majority, driven by the continued expansion of The Villages retirement community rather than the family-driven migration typical of other High Growth counties.

The upper left quadrant captures a demographically vulnerable cluster of counties (those simultaneously losing population and crossing the 20% Aging threshold). These communities face a compounding challenge: a shrinking tax base, rising demand for senior services, and a declining working-age population to provide and fund those services. This combination represents one of the more pressing policy implications to emerger from this analysis.

The lower right quadrant (growing fast and still young), is dominated by High Growth teal counties, primarily Texas exurbs and Mountain West counties absorbing population from expensive coastal metros. These counties face the opposite challenge: rapid infrastructure demand driven by population growth rather than aging, with housing supply already under pressure as identified in the previous visualization.

Sex Ratio vs Racial Diversity Index

census_clean %>%
  filter(
    survey_year == 2024,
    !is.na(sex_ratio),
    !is.na(diversity_index),
    !is.na(region)
  ) %>%
  mutate(
    q1 = quantile(sex_ratio, 0.25, na.rm = TRUE),
    q3 = quantile(sex_ratio, 0.75, na.rm = TRUE),
    iqr = q3 - q1,
    lower = q1 - 1.5 * iqr,
    upper = q3 + 1.5 * iqr,
    outlier_status = if_else(
      sex_ratio < lower | sex_ratio > upper,
      "Outlier",
      "Typical"
    )
  ) %>%
  ggplot(aes(x = diversity_index, y = sex_ratio, color = region))+
  geom_hline(yintercept = 100, linetype = "dashed", color = "gray50")+
  geom_point(aes(size = outlier_status, alpha = outlier_status))+
  scale_color_manual(values = region_colors)+
  scale_size_manual(values = c("Outlier" = 3, "Typical" = 1.5))+
  scale_alpha_manual(values = c("Outlier" = 0.9, "Typical" = 0.4))+
  scale_x_continuous(limits = c(0, 1))+
  scale_y_continuous(labels = label_number(accuracy = 0.1))+
  annotate(
    "text", x = 0.15, y = 133,
    label = "Low diversity,\nhigh male ratio\n(prison/military)",
    size = 3,
    color = "gray30"
  )+
  annotate(
    "text", x = 0.15, y = 84,
    label = "Low diversity,\nlow male ratio",
    size = 3,
    color = "gray30"
  )+
  labs(
    title = "Sex Ratio vs Racial Diversity Index by County (2024)",
    subtitle = "Dashed line marks a balanced sex ratio of 100; outliers defined by 1.5 x IQR rule",
    x = "Racial Diversity Index",
    y = "Sex Ratio (Males per 100 Females",
    color = "Region",
    size = "",
    alpha = "",
    caption = "Source: ACS DP05 1-Year Estimates"
  )+
  theme(legend.position = "bottom")+
  guides(size = "none", alpha = "none")

census_clean %>%
  filter(
    survey_year == 2024,
    sex_ratio > 130,
    diversity_index > 0.40
  ) %>%
  select(county_name, state_name, sex_ratio, diversity_index, total_population)

The relationship between sex ratio and racial diversity reveals a clear pattern: extreme sex ratio outliers are concentrated among low diversity counties on the left side of the chart, and become increasingly rare as diversity increases. This confirms that the institutional population effects driving sex ratio extremes (prisons and military installations) are predominantly located in less diverse, more rural counties where their demographic footprint is proportionally larger to the surrounding civilian population.

The high male ratio outliers in the upper left quadrant are almost entirely explained by correctional facilities and military bases in predominantly white rural counties across the Midwest, Northeast, and South. Walker County, Texas stands as the notable exception, with a diversity index of 0.61 and a sex ratio of 134.3, it sits well to the right of the other high-ratio outliers. This reflects the racial composition of the Texas Department of Criminal Justice’s Huntsville prison complex, which incarcerates a diverse inmate population that simultaneously inflates both the male ratio and the diversity score. It is a striking example of how a single institution can distort multiple demographic indicators at once.

The low male ratio outliers in the lower left (predominantly Southern counties with low diversity indices) reflect a different structural reality. Research consistently links depressed male-to-female ratios in majority-Black rural communities to higher rates of male incarceration and premature mortality, both of which remove men from the counted residential population.

As diversity increases toward the right side of the chart, counties converge tightly around the balanced ratio of 100, with outliers becoming rare above a diversity index of 0.60. The Territory counties cluster at the far right with consistently sub-100 ratios, reinforcing the Puerto Rico outmigration pattern identified throughout this analysis. Taken together, this visualization suggests that demographic complexity, more racial groups, larger and more mixed populations, acts as a stabilizing force on sex ratios, absorbing and diluting the distortions that institutional populations introduce in more homogeneous communities.