In this story I analyse the Infrastructure investment and Jobs Act funding allocation and see if the its equitable based on state population or if it favors the political interests of the ruling administration. With this first project I try to focus on 2 data visualization quality features namely Fidelity and simplicity
To enforce fidelity I imported the data from official US government websites to maintain data accuracy.
For 2020 election results
https://catalog.data.gov/dataset/2020-presidential-general-election-results
Census data for population numbers : https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html :
# Load IIJA funding data
funding <- read_excel("IIJA FUNDING AS OF MARCH 2023.xlsx") %>%
rename(
state = 1,
funding_billions = 2
) %>%
mutate(state = toupper(trimws(state)))
# Load population data
population <- read_excel(
"NST-EST2025-POP.xlsx",
range = "A10:H60",
col_names = FALSE
) %>%
transmute(
state = toupper(trimws(gsub("^\\.", "", ...1))),
population_2025 = ...8
)
## New names:
## • `` -> `...1`
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
# Combine funding and population
combined_data <- funding %>%
left_join(population, by = "state") %>%
filter(!is.na(population_2025)) %>% # drop territories
mutate(
funding_millions = funding_billions * 1000,
population_millions = population_2025 / 1e6,
funding_per_capita = funding_millions / population_millions
) %>%
select(
state,
funding_millions,
population_millions,
funding_per_capita
)
# Load 2020 Presidential Election results
election_data <- read_excel("2020_presidential_results_by_state.xlsx") %>%
mutate(
state = toupper(trimws(state))
)
# Combine election results with combined data
combined_data <- combined_data %>%
left_join(election_data, by = "state")
head(combined_data)
## # A tibble: 6 × 6
## state funding_millions population_millions funding_per_capita winner percent
## <chr> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 ALABAMA 3000 5.19 578. REP 62.0
## 2 ALASKA 3700 0.737 5019. REP 52.8
## 3 ARIZONA 3500 7.62 459. DEM 49.4
## 4 ARKANS… 2800 3.11 899. REP 62.4
## 5 CALIFO… 18400 39.4 468. DEM 63.5
## 6 COLORA… 3200 6.01 532. DEM 55.4
summary(combined_data$funding_per_capita)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 349.5 503.4 566.1 890.8 829.6 5018.5
We start with a simple scatterplot to look at the relationship between population and Funding.
# Scatterplot: Funding vs Population with a light regression line
ggplot(combined_data, aes(x = population_millions, y = funding_millions)) +
geom_point(size = 3, color = "steelblue") + # simple points
geom_smooth(method = "lm", se = FALSE, color = "grey70") + # light regression line
labs(
title = "Funding vs State Population",
x = "Population (Millions)",
y = "Funding (Millions USD)"
) +
theme_minimal(base_size = 14) +
theme(
axis.title.x = element_text(hjust = 0), # left-align x label
axis.title.y = element_text(vjust = 1) # top-align y label
)
## `geom_smooth()` using formula = 'y ~ x'
Most states on the line showing considerable equity between the
population and the funding allocated. We can observe the outliers with
the help of a residual plot.
# Fit linear model
lm_model <- lm(funding_millions ~ population_millions, data = combined_data)
# Compute residuals
combined_data <- combined_data %>%
mutate(
predicted = predict(lm_model, newdata = .),
residual = funding_millions - predicted,
abs_residual = abs(residual)
)
# Top 10 states farthest from regression line
top10_residuals <- combined_data %>%
arrange(desc(abs_residual)) %>%
slice(1:15) %>%
mutate(
state_order = 1:15 # assign evenly spaced positions on x-axis based on residual amount
)
# Plot
ggplot(top10_residuals, aes(x = state_order, y = residual)) +
geom_hline(yintercept = 0, color = "grey70") + # baseline
geom_segment(aes(x = state_order, xend = state_order, y = 0, yend = residual),
color = "grey60", linetype = "dashed") + # vertical lines
geom_point(aes(size = population_millions), color = "steelblue", alpha = 0.7, show.legend = FALSE) +
geom_text_repel(aes(label = state), color = "black", size = 3.3, nudge_y = 400, segment.color = NA) + # black labels
labs(
title = "Top 10 States Farthest from Expected Funding",
x = NULL,
y = "Deviation (Residuals)"
) +
scale_size_continuous(range = c(4,10)) +
theme_minimal(base_size = 14) +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank()
)
These are the top 15 states affecting the equity conversation. There
seems to be no correlation between the population size of the states and
its deviation from the model with high and low population states
occupying both sides of the graph
Now lets see how the political interests reflect on these outliers.
ggplot(top10_residuals, aes(x = state_order, y = residual, color = winner)) +
geom_hline(yintercept = 0, color = "grey70") + # baseline
geom_segment(aes(x = state_order, xend = state_order, y = 0, yend = residual),
color = "grey60", linetype = "dashed") + # vertical lines
geom_point(aes(size = population_millions), alpha = 0.7, show.legend = FALSE) +
geom_text_repel(aes(label = state), color = "black", size = 3.3, nudge_y = 400, segment.color = NA) + # black labels
scale_color_manual(values = c("DEM" = "steelblue", "REP" = "red")) + # party colors
labs(
title = "Electoral funding bias",
x = NULL,
y = "Deviation (Residuals)"
) +
scale_size_continuous(range = c(4,10)) +
theme_minimal(base_size = 14) +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
legend.position = "none" # remove legend
)
No clear consistent pattern favoring Democrats. Red states dominate both
extremes, could indicate that funding deviations are not strictly along
party lines, but the largest outliers happen to be Republican.Overall,
the allocation seems mixed politically
We can also use a simple boxplot to now compare the distribution across all of the states and not just the outliers to see how the values closer to the model perform when viewed in light of the political leaning.
# Filter out states with NA winner
combined_data <- combined_data %>%
filter(!is.na(winner))
# Calculate medians
medians <- combined_data %>%
group_by(winner) %>%
summarise(median_funding = median(funding_per_capita))
# Horizontal boxplot with labeled medians, axes visible
ggplot(combined_data, aes(y = winner, x = funding_per_capita, fill = winner)) +
geom_boxplot(alpha = 0.7, show.legend = FALSE) +
geom_text(data = medians,
aes(y = winner, x = median_funding,
label = paste0("Median = ", round(median_funding, 0))),
color = "black",
hjust = -0.8, # shift right
vjust = -0.9, # shift up
size = 3.5) +
scale_fill_manual(values = c("DEM" = "steelblue", "REP" = "red")) +
labs(
title = "Funding per capita grouped by winning party ",
y = "",
x = "Funding per Capita (USD)"
) +
coord_cartesian(xlim = c(0, 3000)) + # zoom in on main chunk
theme_minimal(base_size = 14) +
theme(
axis.line = element_line(color = "black") # show axes lines
) +
theme(
axis.title.x = element_text(hjust = 0), # left-align x label
)
Here we see that the red states have a much more spread out distribution
with a slighly higher median value. To further understand where this
split happens we can go back to our original graph of funding vs
population.
ggplot(combined_data, aes(x = population_millions, y = funding_millions, color = winner)) +
geom_point(size = 3) + # simple points
geom_smooth(method = "lm", se = FALSE, color = "grey70") + # light regression line
scale_color_manual(values = c("DEM" = "steelblue", "REP" = "red")) +
labs(
title = "Funding vs State Population",
x = "Population (Millions)",
y = "Funding (Millions USD)"
) +
theme_minimal(base_size = 14) +
theme(legend.position = "none") +
theme(
axis.title.x = element_text(hjust = 0), # left-align x label
axis.title.y = element_text(vjust = 1) # top-align y label
)
## `geom_smooth()` using formula = 'y ~ x'
The analysis shows that the allocation of IIJA funding per capita is largely equitable across states. The scatterplots, and residual plots show us that while there is no systemic bias favoring either party, red states tend to dominate the outliers in the positives and the negatives. The boxplot reveals that the median funding for Democratic and Republican states is very similar with the red states pulling ahead thanks to overfunding in states with population below 10 million.