Story1_DATA608

Introduction

In this story I analyse the Infrastructure investment and Jobs Act funding allocation and see if the its equitable based on state population or if it favors the political interests of the ruling administration. With this first project I try to focus on 2 data visualization quality features namely Fidelity and simplicity

Data imports

To enforce fidelity I imported the data from official US government websites to maintain data accuracy.

For 2020 election results
https://catalog.data.gov/dataset/2020-presidential-general-election-results

Census data for population numbers : https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html :

# Load IIJA funding data

funding <- read_excel("IIJA FUNDING AS OF MARCH 2023.xlsx") %>%
  rename(
    state = 1,
    funding_billions = 2
  ) %>%
  mutate(state = toupper(trimws(state)))

# Load population data

population <- read_excel(
  "NST-EST2025-POP.xlsx",
  range = "A10:H60",
  col_names = FALSE
) %>%
  transmute(
    state = toupper(trimws(gsub("^\\.", "", ...1))),
    population_2025 = ...8
  )

## New names:
## • `` -> `...1`
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`

# Combine funding and population

combined_data <- funding %>%
  left_join(population, by = "state") %>%
  filter(!is.na(population_2025)) %>%  # drop territories
  mutate(
    funding_millions = funding_billions * 1000,
    population_millions = population_2025 / 1e6,
    funding_per_capita = funding_millions / population_millions
  ) %>%
  select(
    state,
    funding_millions,
    population_millions,
    funding_per_capita
  )

# Load 2020 Presidential Election results

election_data <- read_excel("2020_presidential_results_by_state.xlsx") %>%
  mutate(
    state = toupper(trimws(state))
  )


# Combine election results with combined data

combined_data <- combined_data %>%
  left_join(election_data, by = "state")


head(combined_data)

## # A tibble: 6 × 6
##   state   funding_millions population_millions funding_per_capita winner percent
##   <chr>              <dbl>               <dbl>              <dbl> <chr>    <dbl>
## 1 ALABAMA             3000               5.19                578. REP       62.0
## 2 ALASKA              3700               0.737              5019. REP       52.8
## 3 ARIZONA             3500               7.62                459. DEM       49.4
## 4 ARKANS…             2800               3.11                899. REP       62.4
## 5 CALIFO…            18400              39.4                 468. DEM       63.5
## 6 COLORA…             3200               6.01                532. DEM       55.4

summary(combined_data$funding_per_capita)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   349.5   503.4   566.1   890.8   829.6  5018.5

Q1) Is the allocation equitable based on the population of each of the States and Territories, or is bias apparent?

We start with a simple scatterplot to look at the relationship between population and Funding.

# Scatterplot: Funding vs Population with a light regression line

ggplot(combined_data, aes(x = population_millions, y = funding_millions)) +
  geom_point(size = 3, color = "steelblue") +                     # simple points
  geom_smooth(method = "lm", se = FALSE, color = "grey70") +      # light regression line
  labs(
    title = "Funding vs State Population",
    x = "Population (Millions)",
    y = "Funding (Millions USD)"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    axis.title.x = element_text(hjust = 0),  # left-align x label
    axis.title.y = element_text(vjust = 1)   # top-align y label
  )

## `geom_smooth()` using formula = 'y ~ x'

Most states on the line showing considerable equity between the population and the funding allocated. We can observe the outliers with the help of a residual plot.

#  Fit linear model
lm_model <- lm(funding_millions ~ population_millions, data = combined_data)

#  Compute residuals
combined_data <- combined_data %>%
  mutate(
    predicted = predict(lm_model, newdata = .),
    residual = funding_millions - predicted,
    abs_residual = abs(residual)
  )

#  Top 10 states farthest from regression line
top10_residuals <- combined_data %>%
  arrange(desc(abs_residual)) %>%
  slice(1:15) %>%
  mutate(
    state_order = 1:15  # assign evenly spaced positions on x-axis based on residual amount 
  )

#  Plot
ggplot(top10_residuals, aes(x = state_order, y = residual)) +
  geom_hline(yintercept = 0, color = "grey70") +                        # baseline
  geom_segment(aes(x = state_order, xend = state_order, y = 0, yend = residual),
               color = "grey60", linetype = "dashed") +                  # vertical lines
  geom_point(aes(size = population_millions), color = "steelblue", alpha = 0.7, show.legend = FALSE) +
  geom_text_repel(aes(label = state), color = "black", size = 3.3, nudge_y = 400, segment.color = NA) +  # black labels
  labs(
    title = "Top 10 States Farthest from Expected Funding",
    x = NULL,
    y = "Deviation (Residuals)"
  ) +
  scale_size_continuous(range = c(4,10)) +
  theme_minimal(base_size = 14) +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank()
  )

These are the top 15 states affecting the equity conversation. There seems to be no correlation between the population size of the states and its deviation from the model with high and low population states occupying both sides of the graph

Q2 ) Does the allocation favor the political interests of the Biden administration?

Now lets see how the political interests reflect on these outliers.

ggplot(top10_residuals, aes(x = state_order, y = residual, color = winner)) +
  geom_hline(yintercept = 0, color = "grey70") +                        # baseline
  geom_segment(aes(x = state_order, xend = state_order, y = 0, yend = residual),
               color = "grey60", linetype = "dashed") +                  # vertical lines
  geom_point(aes(size = population_millions), alpha = 0.7, show.legend = FALSE) +
  geom_text_repel(aes(label = state), color = "black", size = 3.3, nudge_y = 400, segment.color = NA) +  # black labels
  scale_color_manual(values = c("DEM" = "steelblue", "REP" = "red")) +  # party colors
  labs(
    title = "Electoral funding bias",
    x = NULL,
    y = "Deviation (Residuals)"
  ) +
  scale_size_continuous(range = c(4,10)) +
  theme_minimal(base_size = 14) +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    legend.position = "none"  # remove legend
  )

No clear consistent pattern favoring Democrats. Red states dominate both extremes, could indicate that funding deviations are not strictly along party lines, but the largest outliers happen to be Republican.Overall, the allocation seems mixed politically

We can also use a simple boxplot to now compare the distribution across all of the states and not just the outliers to see how the values closer to the model perform when viewed in light of the political leaning.

# Filter out states with NA winner
combined_data <- combined_data %>%
  filter(!is.na(winner))

# Calculate medians
medians <- combined_data %>%
  group_by(winner) %>%
  summarise(median_funding = median(funding_per_capita))


# Horizontal boxplot with labeled medians, axes visible
ggplot(combined_data, aes(y = winner, x = funding_per_capita, fill = winner)) +
  geom_boxplot(alpha = 0.7, show.legend = FALSE) +  
  geom_text(data = medians, 
            aes(y = winner, x = median_funding, 
                label = paste0("Median = ", round(median_funding, 0))), 
            color = "black", 
            hjust = -0.8,   # shift right
            vjust = -0.9,   # shift up
            size = 3.5) +
  scale_fill_manual(values = c("DEM" = "steelblue", "REP" = "red")) +
  labs(
    title = "Funding per capita grouped by winning party ",
    y = "",
    x = "Funding per Capita (USD)"
  ) +
  coord_cartesian(xlim = c(0, 3000)) +  # zoom in on main chunk
  theme_minimal(base_size = 14) +
  theme(
    axis.line = element_line(color = "black")  # show axes lines
  )  +
  theme(
    axis.title.x = element_text(hjust = 0),  # left-align x label

  )

Here we see that the red states have a much more spread out distribution with a slighly higher median value. To further understand where this split happens we can go back to our original graph of funding vs population.

ggplot(combined_data, aes(x = population_millions, y = funding_millions, color = winner)) +
  geom_point(size = 3) +                     # simple points
  geom_smooth(method = "lm", se = FALSE, color = "grey70") +      # light regression line
    scale_color_manual(values = c("DEM" = "steelblue", "REP" = "red")) +
  labs(
    title = "Funding vs State Population",
    x = "Population (Millions)",
    y = "Funding (Millions USD)"
  ) +
  theme_minimal(base_size = 14) + 
    theme(legend.position = "none")  +
  theme(
    axis.title.x = element_text(hjust = 0),  # left-align x label
    axis.title.y = element_text(vjust = 1)   # top-align y label
  )

## `geom_smooth()` using formula = 'y ~ x'

Conclusion

The analysis shows that the allocation of IIJA funding per capita is largely equitable across states. The scatterplots, and residual plots show us that while there is no systemic bias favoring either party, red states tend to dominate the outliers in the positives and the negatives. The boxplot reveals that the median funding for Democratic and Republican states is very similar with the red states pulling ahead thanks to overfunding in states with population below 10 million.