Question 1: Is the allocation equitable based on the population of each of the States and Territories, or is bias apparent?

Loading the libraries

library(tidyverse)
library(rvest)
library(plotly)
library(gridExtra)

Preparing and cleaning the data

#Importing the excel file my github
funding <- read.csv("https://raw.githubusercontent.com/LeJQC/MSDS/main/DATA%20608/IIJA%20FUNDING%20AS%20OF%20MARCH%202023(1).csv")

glimpse(funding)
## Rows: 57
## Columns: 2
## $ State..Teritory.or.Tribal.Nation <chr> "ALABAMA", "ALASKA", "AMERICAN SAMOA"…
## $ Total..Billions.                 <dbl> 3.0000, 3.7000, 0.0686, 3.5000, 2.800…

When I did a google search of population by state, I found that many websites only had the population from the 50 states and not the territories. Luckily, there was a table on wikipedia that had 56 of the 57 territories in the funding excel file. I extracted the table from wikipedia, which contained data gathered from the 2020 US Census.

#Extracting the table from wikipedia
wiki_url <- "https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population#cite_note-Census2020-8"

webpage <- read_html(wiki_url)

table <- html_nodes(webpage, "table")

pop_df <- html_table(table[[3]])[,c(1,3)]

#Renaming the columns to perform so I can merge them
pop_df <- pop_df %>% 
  rename(`State..Teritory.or.Tribal.Nation` = `State/federal district/territory/division/region`) %>% 
  rename('Population' = `2020 pop.`)

pop_df$`State..Teritory.or.Tribal.Nation` <- toupper(pop_df[["State..Teritory.or.Tribal.Nation"]])

#Performing an inner join with the funding dataframe
funding_pop_df <- left_join(funding, pop_df, by= "State..Teritory.or.Tribal.Nation")

missing_value <- funding_pop_df %>% 
  filter(is.na(`Population`)) %>% 
  print()
##   State..Teritory.or.Tribal.Nation Total..Billions. Population
## 1                         DELEWARE           0.7920       <NA>
## 2               TRIBAL COMMUNITIES           3.0000       <NA>
## 3                US VIRGIN ISLANDS           0.1483       <NA>

There were a couple of missing values. These values were in the wikipedia table but it did not merge correctly due to issues. Delaware was spelled incorrectly, US Virgin islands vs U.S Virgin islands, and tribal communities is not in the population dataframe. I added in these population values manually. The tribal communities population is 3.7 million according to: https://www.census.gov/newsroom/facts-for-features/2022/aian-month.html.

#Filling in the missing populations 
funding_pop_df <- funding_pop_df %>%
  mutate(`Population` = ifelse(`State..Teritory.or.Tribal.Nation` == "DELEWARE","989,948",`Population`)) %>%
  mutate(`Population` = ifelse(`State..Teritory.or.Tribal.Nation` == "TRIBAL COMMUNITIES",  "3,700,000",`Population`)) %>% 
  mutate(`Population` = ifelse(`State..Teritory.or.Tribal.Nation` == "US VIRGIN ISLANDS",   "87,146",`Population`)) 
#Changing the columns to numeric
funding_pop_df <- funding_pop_df %>%
  mutate(`Population` = as.numeric(str_replace_all(Population, ",", ""))) %>% 
  mutate(`Population` = (`Population`/1000000)) %>% 
  rename("Total ($ Billions)" = `Total..Billions.`) %>% 
  rename("Population (Millions)" = `Population`)

Plotting the data

top_states <- funding_pop_df %>%
  arrange(desc(`Population (Millions)`)) %>%
  head(4)

p <- funding_pop_df %>% 
  ggplot(aes(x = `Population (Millions)`, y = `Total ($ Billions)`)) +
  geom_jitter(aes(color = `State..Teritory.or.Tribal.Nation`), size = 3, alpha = 0.6) +
  theme_bw() +
  labs(title= "Population vs Funding", x="Population (Millions)", y= "Funding ($ Billions)",
       subtitle = "Analyzing the relationship between the population of State and funding allocation") +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  theme(plot.title = element_text(hjust = 0.5,size= 20, color = "darkblue",face = "bold"), 
        axis.title.x = element_text(face = "bold", color = "darkblue"), 
        axis.title.y = element_text(face = "bold", color = "darkblue"),
        legend.position = "none")

#Hovering over the data point will show the state, population, and funding
ggplotly(p)
#p + geom_text(data = top_states, aes(label = `State..Teritory.or.Tribal.Nation`), vjust = 1, hjust= 1)

Question 1: Conclusion

Based on this visual, there is a positive correlation between the population of State, Territory or Tribal Nation and funding allocation. The states with larger populations received more in funding than those that were smaller in population. California, Texas, Florida, and New York have the largest population among the states. Thus, they received more funding than the other states. There was no bias apparent.

Question 2: Does the allocation favor the political interests of the Biden administration?

Gathering the data

To get the official election results of the 2020 Presidential election, I went to the national archives website (https://www.archives.gov/electoral-college/2020) which has the electoral votes by state in a table. I scrapped the table using rvest and merged the table with the funding excel table.

#Extracting the table from the archives website
vote_url <- "https://www.archives.gov/electoral-college/2020"

webpage <- read_html(vote_url)

table <- html_nodes(webpage, "table")

vote_df <- html_table(table[[2]])[,c(1,3,4)]

vote_df$`State` <- toupper(vote_df[["State"]])

#Cleaning up the data
column_names <- c("State", "Biden", "Trump")

colnames(vote_df) <- column_names

vote_df$State <- gsub("\\*", "", vote_df$State)

#Fixing the Delaware spelling on the funding data frame
funding <- funding %>%
  mutate(`State..Teritory.or.Tribal.Nation` = ifelse(`State..Teritory.or.Tribal.Nation` == "DELEWARE",
          "DELAWARE",`State..Teritory.or.Tribal.Nation`)) %>% 
  rename( "State" =`State..Teritory.or.Tribal.Nation`)
  
#Merging the votes and funding data frames
funding_vote_df <- left_join(vote_df, funding, by= "State")

funding_vote_df <- na.omit(funding_vote_df)

#Creating a column for total votes
funding_vote_df$Biden <- as.numeric(gsub("-", "0", funding_vote_df$Biden))
funding_vote_df$Trump <- as.numeric(gsub("-", "0", funding_vote_df$Trump))
#funding_vote_df$Total_votes <- funding_vote_df$Biden + funding_vote_df$Trump

glimpse(funding_vote_df)
## Rows: 51
## Columns: 4
## $ State            <chr> "ALABAMA", "ALASKA", "ARIZONA", "ARKANSAS", "CALIFORN…
## $ Biden            <dbl> 0, 0, 11, 0, 55, 9, 7, 3, 3, 0, 16, 4, 0, 20, 0, 0, 0…
## $ Trump            <dbl> 9, 3, 0, 6, 0, 0, 0, 0, 0, 29, 0, 0, 4, 0, 11, 6, 6, …
## $ Total..Billions. <dbl> 3.000, 3.700, 3.500, 2.800, 18.400, 3.200, 2.500, 0.7…

Summarizing the data

#Adding a new column for the winner of each state
funding_vote_df <- funding_vote_df %>%
  mutate(Winner = ifelse(Biden > Trump, "Biden", "Trump"))

#Calculating the total votes and total funding
total_votes <- sum(funding_vote_df$Biden, funding_vote_df$Trump)
total_funding <- sum(funding_vote_df$`Total..Billions.`)

#Creating a summary table
summary_table <- funding_vote_df %>%
  group_by(Winner) %>%
  summarise(
    Total_Votes = sum(ifelse(Winner == "Biden", Biden, Trump)),
    Total_Funding = sum(`Total..Billions.`),
    Percentage_of_Votes = Total_Votes / total_votes * 100,
    Percentage_of_Funding = Total_Funding / total_funding * 100)

knitr::kable(summary_table)
Winner Total_Votes Total_Funding Percentage_of_Votes Percentage_of_Funding
Biden 305 103.0959 56.69145 53.80903
Trump 231 88.5000 42.93680 46.19097

Note: The total vote for Biden should be 306 and 232 for Trump. Maine and Nebraska are not winner take all states so both candidates received a vote in these states even though the other candidate had the majority of the votes.

Plotting the data

#Bar graph for Percentage_of_Votes
plot1 <- ggplot(summary_table, aes(x = Winner, y = Percentage_of_Votes, fill = Winner)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label= round(Percentage_of_Votes)), vjust = 1, size = 3.5) + 
  labs(title = "Percentage of Votes", x = "Winner", y = "Percentage of Votes",
       caption = "Total votes for Biden: 306 \nTotal votes for Trump: 232") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5,size= 12),
    legend.position = "none") 

#Bar graph for Percentage_of_Funding
plot2 <- ggplot(summary_table, aes(x = Winner, y = Percentage_of_Funding, fill = Winner)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label= round(Percentage_of_Funding)), vjust = 1, size = 3.5) + 
  labs(title = "Percentage of Funding", x = "Winner", y = "Percentage of Funding",
       caption = "Funding for Biden States: 103 Billion \nFunding for Trump States: 88.5 Billion") +
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5, size= 12),
    legend.position = "none") 
  
grid.arrange(top = "Allocation of Funding to States by Electoral Votes ",plot1, plot2, ncol=2)

Question 2: Conclusion

Based on the results from the 2020 election, Biden received 56.9%(306/538) of the electoral votes while trump received 43.1% (232/538) of the votes. In addition, states that voted for Biden seem to have received more funding than states that voted for Trump, 103 vs 88.5 billion. It is not definite that the allocation favored the political interests of Biden. Since the proportion of states that voted for Biden and proportion of funding allocated to these states are similar, it is difficult to conclude that there was bias in funding allocation.

Sources: