Story - 1 : Infrastructure Investment & Jobs Act Funding Allocation

Assignment Overview:

The attached Excel file contains data on the present allocation of the Infrastructure Investment and Jobs Act funding by State and Territory.

Your story (Data Visualization(s) ) should address the following questions:

  • Is the allocation equitable based on the population of each of the States and Territories, or is bias apparent?
  • Does the allocation favor the political interests of the Biden administration?

Notes:

  • You will need to source data on the current (estimated) population of each of the States and Territories (accuracy is more important than precision) and on the official election results of the 2020 Presidential election.

  • You may choose to develop visualizations using a desktop application or a code library.

Installing Packages

library(readxl)
library(ggplot2)
library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

For the purposes of this assignment, I started by installing the necessary packages. Ggplot2 is an excellent tool for building visuals, readxl helps to read the data from the excel file, and tidyr was a helpful package for handling all these sets of data.

Data Exploration

data <- read_excel("IIJA FUNDING AS OF MARCH 2023(1).xlsx")

df <- as.data.frame(data)

cat("Variables in dataframe:\n") 
## Variables in dataframe:
print(names(df))
## [1] "State, Teritory or Tribal Nation" "Total (Billions)"

I began by loading the data into a dataframe. I wanted to take a look at the variable names to get a better sense of the data before we can start addressing the question. Here, we can see that we are working with the following variables: State, Territory or Tribal Nation, and Total (Billions). This tells us that data is being collected from different sources to represent a total amount in the billions. We also see that the funding data contains 57 observations across 2 variables. These variables will help us to determine whether the amounts are distributed fairly across states and territories.

print(summary(df))
##  State, Teritory or Tribal Nation Total (Billions) 
##  Length:57                        Min.   : 0.0686  
##  Class :character                 1st Qu.: 1.3000  
##  Mode  :character                 Median : 2.7000  
##                                   Mean   : 3.4392  
##                                   3rd Qu.: 3.9000  
##                                   Max.   :18.4000

To further explore the data, I also printed some summary statistics. Here we see that there is a median of $2.7 billion allocated, and a mean of about $3.43. We see that the minimum amount allocation was about $68.6 million, and the maximum was $18.4 billion. The range between these values is very high.

Dataframe

head(df)
##   State, Teritory or Tribal Nation Total (Billions)
## 1                          ALABAMA           3.0000
## 2                           ALASKA           3.7000
## 3                   AMERICAN SAMOA           0.0686
## 4                          ARIZONA           3.5000
## 5                         ARKANSAS           2.8000
## 6                       CALIFORNIA          18.4000

Here, I included the first few rows of the dataframe, which lists the states, territories, or tribal nations, in alphabetical order, and the amount allocated toward their investments to the right.

Election Data

url <- "https://raw.githubusercontent.com/rkasa01/DATA608/main/Popular%20vote%20backend%20-%20Sheet1.csv"
votingdf <- read.csv(url)

head(votingdf)
##                  state called final  dem_votes  rep_votes other_votes
## 1           U.S. Total      D   Yes 81,282,916 74,223,369   2,891,441
## 2 15 Key Battlegrounds              31,908,248 33,002,287     950,354
## 3    Non-Battlegrounds              49,374,668 41,221,082   1,941,087
## 4 15 Key Battlegrounds                                               
## 5              Arizona      D   Yes  1,672,143  1,661,686      53,497
## 6              Florida      R   Yes  5,297,045  5,668,731     101,680
##   dem_percent rep_percent other_percent dem_this_margin margin_shift
## 1       51.3%       46.9%          1.8%            4.5%         2.4%
## 2       48.4%       50.1%          1.4%           -1.7%         2.0%
## 3       53.4%       44.5%          2.1%            8.8%         2.7%
## 4                                                                   
## 5       49.4%       49.1%          1.6%            0.3%         3.9%
## 6       47.9%       51.2%          0.9%           -3.4%        -2.2%
##   vote_change stateid EV  X  Y State_num Center_X Center_Y X.1 X2016.Margin
## 1       15.9%         NA NA NA        NA       NA       NA  NA   0.02099456
## 2       17.2%         NA NA NA        NA       NA       NA  NA  -0.03621953
## 3       15.1%         NA NA NA        NA       NA       NA  NA   0.06097844
## 4                     NA NA NA        NA       NA       NA  NA           NA
## 5       31.6%      AZ 11  2  2         4      205      374  NA  -0.03545595
## 6       17.5%      FL 29  8  0        12      787      520  NA  -0.01198626
##   Total.2016.Votes
## 1      136,639,848
## 2       56,209,173
## 3       80,430,331
## 4                 
## 5        2,573,165
## 6        9,420,039
colnames(votingdf)
##  [1] "state"            "called"           "final"            "dem_votes"       
##  [5] "rep_votes"        "other_votes"      "dem_percent"      "rep_percent"     
##  [9] "other_percent"    "dem_this_margin"  "margin_shift"     "vote_change"     
## [13] "stateid"          "EV"               "X"                "Y"               
## [17] "State_num"        "Center_X"         "Center_Y"         "X.1"             
## [21] "X2016.Margin"     "Total.2016.Votes"

Using data reported by the Cook Political Report which tracked the 2020 elections across America, I uploaded the data to Github, which I was then able to read into my assignment for R using the raw data link. The vote tracker data can be found here: https://www.cookpolitical.com/2020-national-popular-vote-tracker)

cat("Variables in dataframe:\n") 
## Variables in dataframe:
print(names(votingdf))
##  [1] "state"            "called"           "final"            "dem_votes"       
##  [5] "rep_votes"        "other_votes"      "dem_percent"      "rep_percent"     
##  [9] "other_percent"    "dem_this_margin"  "margin_shift"     "vote_change"     
## [13] "stateid"          "EV"               "X"                "Y"               
## [17] "State_num"        "Center_X"         "Center_Y"         "X.1"             
## [21] "X2016.Margin"     "Total.2016.Votes"

By looking at the column names from the voting data, we can develop a better sense of what the voter tracking recorded. Here, we have the state names, the frequency of democratic and republican votes, the percentage of votes for each, and the margin of error.

print(summary(votingdf))
##     state              called             final            dem_votes        
##  Length:61          Length:61          Length:61          Length:61         
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   rep_votes         other_votes        dem_percent        rep_percent       
##  Length:61          Length:61          Length:61          Length:61         
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  other_percent      dem_this_margin    margin_shift       vote_change       
##  Length:61          Length:61          Length:61          Length:61         
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    stateid                EV               X                Y        
##  Length:61          Min.   : 1.000   Min.   : 0.000   Min.   :0.000  
##  Class :character   1st Qu.: 3.000   1st Qu.: 3.000   1st Qu.:2.000  
##  Mode  :character   Median : 7.000   Median : 6.000   Median :4.000  
##                     Mean   : 9.607   Mean   : 6.214   Mean   :3.696  
##                     3rd Qu.:11.000   3rd Qu.: 9.000   3rd Qu.:5.000  
##                     Max.   :55.000   Max.   :12.000   Max.   :9.000  
##                     NA's   :5        NA's   :5        NA's   :5      
##    State_num         Center_X        Center_Y       X.1         
##  Min.   :  1.00   Min.   : 75.0   Min.   : 58.0   Mode:logical  
##  1st Qu.: 17.75   1st Qu.:429.0   1st Qu.:169.0   NA's:61       
##  Median : 31.50   Median :607.5   Median :252.0                 
##  Mean   : 35.57   Mean   :576.0   Mean   :264.6                 
##  3rd Qu.: 46.25   3rd Qu.:808.2   3rd Qu.:351.8                 
##  Max.   :105.00   Max.   :919.0   Max.   :570.0                 
##  NA's   :5        NA's   :5       NA's   :5                     
##   X2016.Margin      Total.2016.Votes  
##  Min.   :-0.54188   Length:61         
##  1st Qu.:-0.20031   Class :character  
##  Median :-0.03546   Mode  :character  
##  Mean   :-0.04324                     
##  3rd Qu.: 0.09595                     
##  Max.   : 0.86776                     
##  NA's   :2

Data Visualization

Infrastructure Investment and Jobs Act Funding Allocation by State (2020)

ggplot(df, aes(x = reorder(`State, Teritory or Tribal Nation`, -`Total (Billions)`), y = `Total (Billions)`)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  labs(title = "Infrastructure Investment and Jobs Act Funding Allocation by State",
       x = "State, Territory or Tribal Nation",
       y = "Total Funding (Billions)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

In this visual, we can see that “State, Territory or Tribal Nation” is on the x-axis, labeling all of the states, territories, and tribal nations from the data set in descending order of total funding (billions). In this case, the y-axis labels “Total Funding (Billions)”. Here, we can see that the state of California had the most allocated in infrastructure investment and the jobs act funding. The territory of American Samoa had the least allocated in terms of infrastructure investment and jobs act funding. We will perform further analysis to determine whether the aforementioned allocations are equitable based on the population of the state or territory.

Votes in 2020 Presidential Election by State (2020)

votingdf$dem_votes <- as.numeric(gsub(",", "", votingdf$dem_votes))
votingdf$rep_votes <- as.numeric(gsub(",", "", votingdf$rep_votes))
votingdf$other_votes <- as.numeric(gsub(",", "", votingdf$other_votes))

votingdf_filtered <- votingdf[complete.cases(votingdf[c("dem_votes", "rep_votes","other_votes")]), ]

plot_data <- tidyr::gather(votingdf_filtered, key = "Party", value = "Votes", dem_votes, rep_votes,other_votes)

ggplot(plot_data, aes(x = reorder(state, Votes), y = Votes, fill = Party)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(title = "Votes in 2020 Presidential Election by State",
       x = "State, Territory, or Tribal Nation",
       y = "Number of Votes (Millions)",
       fill = "Party") +
  scale_y_continuous(labels = scales::comma_format(scale = 1e-6)) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

To display the voter data from the 2020 presidential election, I removed the commas to change the votes to numeric type. I decided to stack them due to the number of states, territories and tribal nations displayed to avoid making the graph even more dense. With the stacking, we can also get a clearer estimate of the total number of voters just by looking at the visual, and all the parties compared to the other

Votes by Percentage in 2020 Presidential Election by State (2020)

votingdf$dem_votess <- as.numeric(gsub("%", "", votingdf$dem_percent))
votingdf$rep_votess <- as.numeric(gsub("%", "", votingdf$rep_percent))
votingdf$other_votess <- as.numeric(gsub("%", "", votingdf$other_percent))

percentvotingdf_filtered <- votingdf[complete.cases(votingdf[c("dem_percent", "rep_percent", "other_percent")]), ]

plot_data <- tidyr::gather(percentvotingdf_filtered, key = "Party", value = "Votes", dem_votess, rep_votess, other_votess)

ggplot(plot_data, aes(x = reorder(state, Votes), y = Votes, fill = Party)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(title = "Votes in 2020 Presidential Election by State",
       x = "State, Territory, or Tribal Nation",
       y = "Votes by Percent",
       fill = "Party") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Removed 6 rows containing missing values (`position_stack()`).

Displaying the data in its percentage form is an efficient way to compare the three parties between the states, based on the number of people who voted. This, in a way, standardizes the votes regardless of the size of the sample of voters from the population. This has both benefits and downsides – it may be possible that a percentage shown from the sample data does not accurately represent the population of each state, territory or tribal nation. It is also possible that not enough people voted compared to each population. However, assuming that voters reflect, to a degree, the opinions of the general population, then this data is useful in generalizing votes.

Data Analysis

Is the allocation equitable based on the population of each of the States and Territories, or is bias apparent?

Hypotheses:

Null Hypothesis (H0): The funding allocation is not related to the population size of each state or territory.

Alternative Hypothesis (H1): The funding allocation is related to the population size of each state or territory.

Statistical Analysis

populationdata <- read_excel("POP_DATA 2.xlsx")
populationdata <- populationdata[-c(1:5),]
populationdata$"Geographic Area" <- gsub("\\.", "", populationdata$"Geographic Area")

new_data <- data.frame(Name = c("American Samoa", "Guam", "US Virgin Islands", "Northern Marina Islands", "Tribal Communities"),
                       `2020` = c(46189, 169231, 106290, 49587, 96000000))
colnames(new_data) <- colnames(populationdata)
populationdata <- rbind(populationdata, new_data)

First, I started by uploading the Census 2020 population data. I also input and corrected any missing values.

merged_data <- merge(df, populationdata)
populationdata$`Geographic Area` <- toupper(populationdata$`Geographic Area`)
merged_data <- merge(df, populationdata, by.x = "State, Teritory or Tribal Nation", by.y = "Geographic Area")
head(merged_data)
##   State, Teritory or Tribal Nation Total (Billions)     2020
## 1                          ALABAMA           3.0000  5031864
## 2                           ALASKA           3.7000   732964
## 3                   AMERICAN SAMOA           0.0686    46189
## 4                          ARIZONA           3.5000  7186683
## 5                         ARKANSAS           2.8000  3014348
## 6                       CALIFORNIA          18.4000 39503200

Then I merged the two datasets. They were not exactly identical in terms of columns, and this was causing issues with the merging at first. This dataframe shows that we have successfully merged the datasets, with a column for “State, Territory or Tribal Nation”, “Total (Billions)” for allocated funds, and “2020” for the population data.

correlation_test <- cor.test(merged_data$`Total (Billions)`, merged_data$`2020`, na.rm = TRUE)
print(correlation_test)
## 
##  Pearson's product-moment correlation
## 
## data:  merged_data$`Total (Billions)` and merged_data$`2020`
## t = 3.9534, df = 53, p-value = 0.0002297
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2426419 0.6590745
## sample estimates:
##       cor 
## 0.4772194

The Pearson’s correlation analysis was conducted to examine the relationship between the allocation of funds (“Total (Billions)”) from the Infrastructure Investment and Jobs Act and the population data for the year 2020. As a result, the correlation coefficient, r, was 0.4772, signifying a positive correlation between the two variables. Since it is positive, this suggests as the population of states and territories increases, the allocated funds also tend to increase. There was t-statistic of 3.9534 with 53 degrees of freedom, where we got a p-value of 0.0002297. This is less than 0.05 making it significant, and we can therefore reject the null hypothesis and accept the alternative hypothesis. This further demonstrates that there is a significant correlation between the allocation of funds and the population size.

Does the allocation favor the political interests of the Biden administration?

Hypotheses:

Null Hypothesis (H0): The funding allocation is not influenced by the political interests of the Biden administration.

Alternative Hypothesis (H1): The funding allocation is influenced by the political interests of the Biden administration.

Statistical Analysis

You can perform a t-test or ANOVA to compare the funding allocation between states that voted for Biden and states that did not. Another approach is to use regression analysis with political affiliation (Biden votes percentage) as an independent variable. Steps:

Obtain the official election results of the 2020 Presidential election. Merge the election results data with the funding allocation data. Perform a t-test, ANOVA, or regression analysis. Analyze the results to determine if there is a significant difference in funding allocation based on political affiliation.

mergedvotingdata <- merge(df, votingdf)
model <- lm(`Total (Billions)` ~ dem_percent, data = mergedvotingdata)
summary(model)
## 
## Call:
## lm(formula = `Total (Billions)` ~ dem_percent, data = mergedvotingdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3706 -2.1392 -0.7392  0.4608 14.9608 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.439e+00  3.084e-01   11.15   <2e-16 ***
## dem_percent22.4% -5.156e-15  5.342e-01    0.00        1    
## dem_percent26.6% -1.022e-14  5.342e-01    0.00        1    
## dem_percent29.7% -3.865e-15  5.342e-01    0.00        1    
## dem_percent31.8% -2.713e-15  5.342e-01    0.00        1    
## dem_percent32.3% -3.063e-14  5.342e-01    0.00        1    
## dem_percent33.1% -3.641e-15  5.342e-01    0.00        1    
## dem_percent34.8% -5.054e-15  5.342e-01    0.00        1    
## dem_percent35.6%  1.215e-15  5.342e-01    0.00        1    
## dem_percent36.2%  2.921e-15  5.342e-01    0.00        1    
## dem_percent36.6% -2.326e-14  5.342e-01    0.00        1    
## dem_percent37.5% -5.640e-15  5.342e-01    0.00        1    
## dem_percent37.6% -3.333e-15  5.342e-01    0.00        1    
## dem_percent39.4% -3.169e-15  5.342e-01    0.00        1    
## dem_percent39.9% -8.364e-15  5.342e-01    0.00        1    
## dem_percent40.5%  1.483e-14  5.342e-01    0.00        1    
## dem_percent41.0% -2.535e-15  5.342e-01    0.00        1    
## dem_percent41.1% -4.607e-15  5.342e-01    0.00        1    
## dem_percent41.3% -5.412e-15  5.342e-01    0.00        1    
## dem_percent41.4% -4.803e-15  5.342e-01    0.00        1    
## dem_percent41.6% -3.204e-15  5.342e-01    0.00        1    
## dem_percent42.8%  1.590e-14  5.342e-01    0.00        1    
## dem_percent43.4% -6.185e-15  5.342e-01    0.00        1    
## dem_percent44.8% -3.494e-15  5.342e-01    0.00        1    
## dem_percent44.9% -8.083e-16  5.342e-01    0.00        1    
## dem_percent45.2% -3.134e-15  5.342e-01    0.00        1    
## dem_percent46.5% -1.340e-15  5.342e-01    0.00        1    
## dem_percent47.9% -4.108e-15  5.342e-01    0.00        1    
## dem_percent48.4% -3.427e-15  5.342e-01    0.00        1    
## dem_percent48.6%  1.814e-15  5.342e-01    0.00        1    
## dem_percent49.4% -3.885e-15  4.362e-01    0.00        1    
## dem_percent49.5% -1.796e-15  5.342e-01    0.00        1    
## dem_percent50.0%  7.819e-15  5.342e-01    0.00        1    
## dem_percent50.1% -3.912e-15  5.342e-01    0.00        1    
## dem_percent50.6% -1.280e-14  5.342e-01    0.00        1    
## dem_percent51.3% -2.110e-14  5.342e-01    0.00        1    
## dem_percent52.2% -8.114e-15  5.342e-01    0.00        1    
## dem_percent52.4% -4.474e-15  5.342e-01    0.00        1    
## dem_percent52.7% -1.059e-14  5.342e-01    0.00        1    
## dem_percent53.1% -3.971e-15  5.342e-01    0.00        1    
## dem_percent53.4% -8.530e-15  5.342e-01    0.00        1    
## dem_percent54.1%  1.121e-15  5.342e-01    0.00        1    
## dem_percent54.3% -7.490e-15  5.342e-01    0.00        1    
## dem_percent55.4% -4.485e-15  5.342e-01    0.00        1    
## dem_percent56.5% -8.810e-15  5.342e-01    0.00        1    
## dem_percent57.3% -4.650e-15  5.342e-01    0.00        1    
## dem_percent57.5% -5.157e-15  5.342e-01    0.00        1    
## dem_percent58.0%  1.185e-15  5.342e-01    0.00        1    
## dem_percent58.7% -5.400e-15  5.342e-01    0.00        1    
## dem_percent59.3% -9.610e-15  5.342e-01    0.00        1    
## dem_percent59.4% -5.319e-15  5.342e-01    0.00        1    
## dem_percent60.1% -6.037e-15  5.342e-01    0.00        1    
## dem_percent60.9% -4.426e-15  5.342e-01    0.00        1    
## dem_percent63.5% -6.188e-15  5.342e-01    0.00        1    
## dem_percent63.7% -6.685e-15  5.342e-01    0.00        1    
## dem_percent65.4% -5.462e-15  5.342e-01    0.00        1    
## dem_percent65.6% -4.491e-15  5.342e-01    0.00        1    
## dem_percent66.1% -5.175e-15  5.342e-01    0.00        1    
## dem_percent92.1% -5.160e-15  5.342e-01    0.00        1    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.293 on 3418 degrees of freedom
## Multiple R-squared:  8.046e-30,  Adjusted R-squared:  -0.01697 
## F-statistic: 4.742e-28 on 58 and 3418 DF,  p-value: 1

A linear regression analysis was used to determine whether or not the allocated funds favored the Biden administration. The intercept is about 3.439e+00 with a standard error of 5.929e-02 and a t-value of 58.01. The p-value suggests statistical significance (p-value < 2e-16). However, the coefficient for dem_percent is estimated at 8.063e-23, with a standard error of 4.471e-09 and a t-value close to zero, and a p-value of 1. This shows that the variable dem_percent, representing the percentage of Democratic votes, does not contribute significantly to explaining the variance in the Total (Billions) allocated.

Conclusion

In conclusion, the analysis indicates a significant positive correlation between the allocation of Infrastructure Investment and Jobs Act funds and the population size of states and territories. This suggests that larger populations tend to receive more funding. However, the analysis does not provide sufficient evidence to support the idea that funding allocation is influenced by the political interests of the Biden administration. The linear regression analysis indicates that the percentage of Democratic votes (dem_percent) is not a significant predictor of the total allocated funds. Therefore, while population size plays a role in funding distribution, political affiliation does not appear to be a determining factor in the observed allocations.