Story # 1

Importing and Cleaning The Data

First, I have sourced population of each US states and territory from the World Population Review website, and I web scraped the 2020 election results from The American Presidency Project that shows how many people in the United States voted in the 2020 election, seperated by states.

## Import the relevant data and then clean the following data..
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readxl)
library(rvest)

## 
## Attaching package: 'rvest'
## 
## The following object is masked from 'package:readr':
## 
##     guess_encoding

library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(ggrepel)
Data <-  read_excel("C:\\Users\\Al Haque\\OneDrive\\Desktop\\Data 608\\IIJA FUNDING AS OF MARCH 2023(1).xlsx")
Data$`State, Teritory or Tribal Nation` <- tolower(Data$`State, Teritory or Tribal Nation`)
Data2 <- read_csv("C:\\Users\\Al Haque\\OneDrive\\Desktop\\Data 608\\state-population-table.csv")

## Rows: 50 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): state
## dbl (13): fips, densityMi, pop2023, pop2022, pop2020, pop2019, pop2010, grow...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data2$state <- tolower(Data2$state)
## select the relevant column
Data2 <- Data2 %>%
  select('state','pop2023')

Web_link <- 'https://www.presidency.ucsb.edu/statistics/elections/2020'

Web_pg <- read_html(Web_link) %>%
  html_nodes("table") %>% html_table() %>%
  .[[1]]
## Clean and format the table..
Data3 <- Web_pg %>%
  select(c(X1:X10)) %>%
  slice(13:90)
  
Data3 <- Data3 %>%
  select(c(X1,X2,X3,X6,X9)) %>%
  slice(1:59)

Data3 <- Data3 %>%
  filter(X1 != "CD-1" & X1 != "CD-2" & X1 != "CD-3")

colnames(Data3) <- c('State','Total_Votes','DemoCratic_Vote','Republican_Vote','Other_Vote')

Data3$State <- tolower(Data3$State)

Attempt A Left-Join:

After cleaning up the relevant data and removing what I deemed as irrelevant columns to the questions, I then merged all the data sets into one central data set that contains the current population, and the votes of each state that are separated by political party.

Dataset <- left_join(Data,Data2, by = c(`State, Teritory or Tribal Nation` = 'state'))
head(Dataset)

## # A tibble: 6 × 3
##   `State, Teritory or Tribal Nation` `Total (Billions)`  pop2023
##   <chr>                                           <dbl>    <dbl>
## 1 alabama                                        3       5098746
## 2 alaska                                         3.7      732984
## 3 american samoa                                 0.0686       NA
## 4 arizona                                        3.5     7453517
## 5 arkansas                                       2.8     3063152
## 6 california                                    18.4    38915693

## Now left-join with this data3 and clean the column names
Dataset <- left_join(Dataset,Data3, by = c(`State, Teritory or Tribal Nation` = 'State'))
Dataset <- Dataset %>%
  clean_names()
head(Dataset)

## # A tibble: 6 × 7
##   state_teritory_or_tribal…¹ total_billions pop2023 total_votes demo_cratic_vote
##   <chr>                               <dbl>   <dbl> <chr>       <chr>           
## 1 alabama                            3       5.10e6 2,323,282   849,624         
## 2 alaska                             3.7     7.33e5 359,530     153,778         
## 3 american samoa                     0.0686 NA      <NA>        <NA>            
## 4 arizona                            3.5     7.45e6 3,387,326   1,672,143       
## 5 arkansas                           2.8     3.06e6 1,219,069   423,932         
## 6 california                        18.4     3.89e7 17,500,881  11,110,250      
## # ℹ abbreviated name: ¹state_teritory_or_tribal_nation
## # ℹ 2 more variables: republican_vote <chr>, other_vote <chr>

Cleaning The Data Further

In order to perform some data wrangling we had to convert some of the columns into numeric in order to perform some analysis, this included imputing NA values with 0 and removing commas from the numbers and then converting it into a numeric type.

## Remove the commas and convert it into numeric

Dataset$total_votes <- as.numeric(gsub(",","",Dataset$total_votes))
Dataset$other_vote <- as.numeric(gsub(",","",Dataset$other_vote))
Dataset$demo_cratic_vote <- as.numeric(gsub(",","",Dataset$demo_cratic_vote))
Dataset$republican_vote <- as.numeric(gsub(",","",Dataset$republican_vote))

Manually Inputing values

Here I will manually impute the population value for each territory and replace all NA’s value with 0 Unfortunately American territories can not vote in US presidential elections.. so I will impute their NA values with 0.

Dataset <- Dataset %>%
  mutate(across(total_votes:other_vote, ~ replace_na(.,0)))

## Manually impute the US territory population with values I found with google search
Dataset[3,3] <- 43914
Dataset[9,3] <- 1031985
Dataset[10,3] <- 643301
Dataset[13,3] <- 172952
Dataset[38,3] <- 49796
Dataset[43,3] <- 3260314
Dataset[49,3] <- 6790000
Dataset[50,3] <- 98750

In this step I created a new column that compares the number of votes per political party and then deem the state as a Democratic or Republican state if one party vote is greater than the other, “A winner take all state” as deemed during the elections. Since The U.S Territory are not counted during the presidential election I merely placed their party affilation as Not-Affilated to make it understandable that they were not allowed to vote.

## Remove the commas and convert it into numeric
Dataset <- Dataset %>%
  mutate(Party = ifelse(demo_cratic_vote > republican_vote, "Democratic","Republican")) 

Dataset$Party[Dataset$Party == "Republican" & Dataset$total_votes == 0] <- "Not-Affilated"

Dataset$Party <- as.factor(Dataset$Party)

head(Dataset)

## # A tibble: 6 × 8
##   state_teritory_or_tribal…¹ total_billions pop2023 total_votes demo_cratic_vote
##   <chr>                               <dbl>   <dbl>       <dbl>            <dbl>
## 1 alabama                            3       5.10e6     2323282           849624
## 2 alaska                             3.7     7.33e5      359530           153778
## 3 american samoa                     0.0686  4.39e4           0                0
## 4 arizona                            3.5     7.45e6     3387326          1672143
## 5 arkansas                           2.8     3.06e6     1219069           423932
## 6 california                        18.4     3.89e7    17500881         11110250
## # ℹ abbreviated name: ¹state_teritory_or_tribal_nation
## # ℹ 3 more variables: republican_vote <dbl>, other_vote <dbl>, Party <fct>

Data Visualization

I’ve created a bunch of exploratory plots that I thought would be relevant to the following questions

Q: Is the allocation equitable based on the population of each of the state and territories or is bias apparent?

A: Looking at the scatterplot the allocation is equitable based on the population of each state if we look at the scatterplot, we see that there is a positive correlation between the population of the state and the funds allocated to each state, in other words, the bigger the population the state has the more funding they will recieve. The scatter plot shows that states like California,Texas, New York, Pennsylvania and Florida all receive higher funding than most of the other state since their state’s population is much higher than those of the other state. Since Texas, Florida and Ohio are red state with a robust population I would say that there is no bias apparent in the allocation of funds

Q: Does the allocation favor the politcal interest of the Biden Adminstration?

The allocation does favor the politcal interest of the Biden Adminstration, since the majority of these states that have big population are mostly blue-states. Hence, the funds are mostly allocated to major blue states which would bide well for the Biden Adminstration.

library(ggplot2)
ggplot(Dataset,aes(x = pop2023, y= total_billions,label = state_teritory_or_tribal_nation,color = Party)) + geom_point() +
  xlab("Current Population of 2023") +
  ylab('Total_Allocated\n(In Billions)') +
  ggtitle('Relationship between Funds Allocated and Population')+ geom_text_repel(data = Dataset %>% filter(total_billions >= 5.0),max.overlaps = Inf,size = 3,
        nudge_y = 3,
        nudge_x = 3,
        hjust = -2, 
        direction = "y") + theme_bw() +
  scale_color_manual(values = c("Democratic" = "blue", "Republican" = "red","Not-Affilated" = "black")) + labs(subtitle = "A postitive correlation between Funds Allocated and Population",caption = "States are labeled with fund allocation greater than 5 billion")

We can see despite the plot facetted by party we can see a positive correlation between each funds and allocated despite party affiliation..

ggplot(Dataset,aes(x = pop2023,y = total_billions,label = state_teritory_or_tribal_nation,color = Party)) + geom_point() +
  xlab("Current Population of 2023") +
  ylab('Total_Allocated\n(In Billions)') +
  ggtitle('Relationship between Funds Allocated and Population') +
  facet_wrap(~Party, nrow  = 3) + theme_bw() +
  scale_color_manual(values = c("Democratic" = "blue", "Republican" = "red","Not-Affilated" = "black"))

Miscellanous Plots

Here are some other plots I constructed since I was interested in other aspects of the data I was curious about the distribution of the funds allocated so I’ve constructed a histogram and a bar plot colored by Party affilation.

ggplot(Dataset,aes(reorder(state_teritory_or_tribal_nation,total_billions),y = total_billions,fill = Party)) + 
  geom_col(width = 0.7) + coord_flip() + scale_fill_manual(values = c("Democratic" = "blue", "Republican" = "red","Not-Affilated" = "black")) + labs(x = "States/Territory",y = 'Funds Allocated\n(In Billions)')

ggplot(Dataset,aes(x = total_billions)) + geom_histogram(bins = 15,fill = "skyblue",color = "black") +
  labs(title = "Distribution of Resources Allocated\n(In Billions)",x = "Values", y = "Frequency") + theme_classic()

Work-Cited

Here are the websites in which I’ve sourced my data from.

“2020: The American Presidency Project.” 2020 | The American Presidency Project, www.presidency.ucsb.edu/statistics/elections/2020. Accessed 29 Aug. 2023.

This is where I web-scraped the 2020 election results for each state

US States - Ranked by Population 2023, worldpopulationreview.com/states. Accessed 29 Aug. 2023.

This is where I downloaded the current 2023 population for each US state and territory