The attached Excel file contains data on the present allocation of the Infrastructure Investment and Jobs Act funding by State and Territory.
Your story (Data Visualization(s) ) should address the following questions:
You will need to source data on the current (estimated) population of each of the States and Territories (accuracy is more important than precision) and on the official election results of the 2020 Presidential election.
You may choose to develop visualizations using a desktop application or a code library.
library(readxl)
library(ggplot2)
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
For the purposes of this assignment, I started by installing the necessary packages. Ggplot2 is an excellent tool for building visuals, readxl helps to read the data from the excel file, and tidyr was a helpful package for handling all these sets of data.
data <- read_excel("IIJA FUNDING AS OF MARCH 2023(1).xlsx")
df <- as.data.frame(data)
cat("Variables in dataframe:\n")
## Variables in dataframe:
print(names(df))
## [1] "State, Teritory or Tribal Nation" "Total (Billions)"
I began by loading the data into a dataframe. I wanted to take a look at the variable names to get a better sense of the data before we can start addressing the question. Here, we can see that we are working with the following variables: State, Territory or Tribal Nation, and Total (Billions). This tells us that data is being collected from different sources to represent a total amount in the billions. We also see that the funding data contains 57 observations across 2 variables. These variables will help us to determine whether the amounts are distributed fairly across states and territories.
print(summary(df))
## State, Teritory or Tribal Nation Total (Billions)
## Length:57 Min. : 0.0686
## Class :character 1st Qu.: 1.3000
## Mode :character Median : 2.7000
## Mean : 3.4392
## 3rd Qu.: 3.9000
## Max. :18.4000
To further explore the data, I also printed some summary statistics. Here we see that there is a median of $2.7 billion allocated, and a mean of about $3.43. We see that the minimum amount allocation was about $68.6 million, and the maximum was $18.4 billion. The range between these values is very high.
head(df)
## State, Teritory or Tribal Nation Total (Billions)
## 1 ALABAMA 3.0000
## 2 ALASKA 3.7000
## 3 AMERICAN SAMOA 0.0686
## 4 ARIZONA 3.5000
## 5 ARKANSAS 2.8000
## 6 CALIFORNIA 18.4000
Here, I included the first few rows of the dataframe, which lists the states, territories, or tribal nations, in alphabetical order, and the amount allocated toward their investments to the right.
url <- "https://raw.githubusercontent.com/rkasa01/DATA608/main/Popular%20vote%20backend%20-%20Sheet1.csv"
votingdf <- read.csv(url)
head(votingdf)
## state called final dem_votes rep_votes other_votes
## 1 U.S. Total D Yes 81,282,916 74,223,369 2,891,441
## 2 15 Key Battlegrounds 31,908,248 33,002,287 950,354
## 3 Non-Battlegrounds 49,374,668 41,221,082 1,941,087
## 4 15 Key Battlegrounds
## 5 Arizona D Yes 1,672,143 1,661,686 53,497
## 6 Florida R Yes 5,297,045 5,668,731 101,680
## dem_percent rep_percent other_percent dem_this_margin margin_shift
## 1 51.3% 46.9% 1.8% 4.5% 2.4%
## 2 48.4% 50.1% 1.4% -1.7% 2.0%
## 3 53.4% 44.5% 2.1% 8.8% 2.7%
## 4
## 5 49.4% 49.1% 1.6% 0.3% 3.9%
## 6 47.9% 51.2% 0.9% -3.4% -2.2%
## vote_change stateid EV X Y State_num Center_X Center_Y X.1 X2016.Margin
## 1 15.9% NA NA NA NA NA NA NA 0.02099456
## 2 17.2% NA NA NA NA NA NA NA -0.03621953
## 3 15.1% NA NA NA NA NA NA NA 0.06097844
## 4 NA NA NA NA NA NA NA NA
## 5 31.6% AZ 11 2 2 4 205 374 NA -0.03545595
## 6 17.5% FL 29 8 0 12 787 520 NA -0.01198626
## Total.2016.Votes
## 1 136,639,848
## 2 56,209,173
## 3 80,430,331
## 4
## 5 2,573,165
## 6 9,420,039
colnames(votingdf)
## [1] "state" "called" "final" "dem_votes"
## [5] "rep_votes" "other_votes" "dem_percent" "rep_percent"
## [9] "other_percent" "dem_this_margin" "margin_shift" "vote_change"
## [13] "stateid" "EV" "X" "Y"
## [17] "State_num" "Center_X" "Center_Y" "X.1"
## [21] "X2016.Margin" "Total.2016.Votes"
Using data reported by the Cook Political Report which tracked the 2020 elections across America, I uploaded the data to Github, which I was then able to read into my assignment for R using the raw data link. The vote tracker data can be found here: https://www.cookpolitical.com/2020-national-popular-vote-tracker)
cat("Variables in dataframe:\n")
## Variables in dataframe:
print(names(votingdf))
## [1] "state" "called" "final" "dem_votes"
## [5] "rep_votes" "other_votes" "dem_percent" "rep_percent"
## [9] "other_percent" "dem_this_margin" "margin_shift" "vote_change"
## [13] "stateid" "EV" "X" "Y"
## [17] "State_num" "Center_X" "Center_Y" "X.1"
## [21] "X2016.Margin" "Total.2016.Votes"
By looking at the column names from the voting data, we can develop a better sense of what the voter tracking recorded. Here, we have the state names, the frequency of democratic and republican votes, the percentage of votes for each, and the margin of error.
print(summary(votingdf))
## state called final dem_votes
## Length:61 Length:61 Length:61 Length:61
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## rep_votes other_votes dem_percent rep_percent
## Length:61 Length:61 Length:61 Length:61
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## other_percent dem_this_margin margin_shift vote_change
## Length:61 Length:61 Length:61 Length:61
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## stateid EV X Y
## Length:61 Min. : 1.000 Min. : 0.000 Min. :0.000
## Class :character 1st Qu.: 3.000 1st Qu.: 3.000 1st Qu.:2.000
## Mode :character Median : 7.000 Median : 6.000 Median :4.000
## Mean : 9.607 Mean : 6.214 Mean :3.696
## 3rd Qu.:11.000 3rd Qu.: 9.000 3rd Qu.:5.000
## Max. :55.000 Max. :12.000 Max. :9.000
## NA's :5 NA's :5 NA's :5
## State_num Center_X Center_Y X.1
## Min. : 1.00 Min. : 75.0 Min. : 58.0 Mode:logical
## 1st Qu.: 17.75 1st Qu.:429.0 1st Qu.:169.0 NA's:61
## Median : 31.50 Median :607.5 Median :252.0
## Mean : 35.57 Mean :576.0 Mean :264.6
## 3rd Qu.: 46.25 3rd Qu.:808.2 3rd Qu.:351.8
## Max. :105.00 Max. :919.0 Max. :570.0
## NA's :5 NA's :5 NA's :5
## X2016.Margin Total.2016.Votes
## Min. :-0.54188 Length:61
## 1st Qu.:-0.20031 Class :character
## Median :-0.03546 Mode :character
## Mean :-0.04324
## 3rd Qu.: 0.09595
## Max. : 0.86776
## NA's :2
ggplot(df, aes(x = reorder(`State, Teritory or Tribal Nation`, -`Total (Billions)`), y = `Total (Billions)`)) +
geom_bar(stat = "identity", fill = "skyblue", color = "black") +
labs(title = "Infrastructure Investment and Jobs Act Funding Allocation by State",
x = "State, Territory or Tribal Nation",
y = "Total Funding (Billions)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
In this visual, we can see that “State, Territory or Tribal Nation” is
on the x-axis, labeling all of the states, territories, and tribal
nations from the data set in descending order of total funding
(billions). In this case, the y-axis labels “Total Funding (Billions)”.
Here, we can see that the state of California had the most allocated in
infrastructure investment and the jobs act funding. The territory of
American Samoa had the least allocated in terms of infrastructure
investment and jobs act funding. We will perform further analysis to
determine whether the aforementioned allocations are equitable based on
the population of the state or territory.
votingdf$dem_votes <- as.numeric(gsub(",", "", votingdf$dem_votes))
votingdf$rep_votes <- as.numeric(gsub(",", "", votingdf$rep_votes))
votingdf$other_votes <- as.numeric(gsub(",", "", votingdf$other_votes))
votingdf_filtered <- votingdf[complete.cases(votingdf[c("dem_votes", "rep_votes","other_votes")]), ]
plot_data <- tidyr::gather(votingdf_filtered, key = "Party", value = "Votes", dem_votes, rep_votes,other_votes)
ggplot(plot_data, aes(x = reorder(state, Votes), y = Votes, fill = Party)) +
geom_bar(stat = "identity", position = "stack") +
labs(title = "Votes in 2020 Presidential Election by State",
x = "State, Territory, or Tribal Nation",
y = "Number of Votes (Millions)",
fill = "Party") +
scale_y_continuous(labels = scales::comma_format(scale = 1e-6)) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
To display the voter data from the 2020 presidential election, I removed
the commas to change the votes to numeric type. I decided to stack them
due to the number of states, territories and tribal nations displayed to
avoid making the graph even more dense. With the stacking, we can also
get a clearer estimate of the total number of voters just by looking at
the visual, and all the parties compared to the other
votingdf$dem_votess <- as.numeric(gsub("%", "", votingdf$dem_percent))
votingdf$rep_votess <- as.numeric(gsub("%", "", votingdf$rep_percent))
votingdf$other_votess <- as.numeric(gsub("%", "", votingdf$other_percent))
percentvotingdf_filtered <- votingdf[complete.cases(votingdf[c("dem_percent", "rep_percent", "other_percent")]), ]
plot_data <- tidyr::gather(percentvotingdf_filtered, key = "Party", value = "Votes", dem_votess, rep_votess, other_votess)
ggplot(plot_data, aes(x = reorder(state, Votes), y = Votes, fill = Party)) +
geom_bar(stat = "identity", position = "stack") +
labs(title = "Votes in 2020 Presidential Election by State",
x = "State, Territory, or Tribal Nation",
y = "Votes by Percent",
fill = "Party") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Removed 6 rows containing missing values (`position_stack()`).
Displaying the data in its percentage form is an efficient way to compare the three parties between the states, based on the number of people who voted. This, in a way, standardizes the votes regardless of the size of the sample of voters from the population. This has both benefits and downsides – it may be possible that a percentage shown from the sample data does not accurately represent the population of each state, territory or tribal nation. It is also possible that not enough people voted compared to each population. However, assuming that voters reflect, to a degree, the opinions of the general population, then this data is useful in generalizing votes.
Null Hypothesis (H0): The funding allocation is not related to the population size of each state or territory.
Alternative Hypothesis (H1): The funding allocation is related to the population size of each state or territory.
populationdata <- read_excel("POP_DATA 2.xlsx")
populationdata <- populationdata[-c(1:5),]
populationdata$"Geographic Area" <- gsub("\\.", "", populationdata$"Geographic Area")
new_data <- data.frame(Name = c("American Samoa", "Guam", "US Virgin Islands", "Northern Marina Islands", "Tribal Communities"),
`2020` = c(46189, 169231, 106290, 49587, 96000000))
colnames(new_data) <- colnames(populationdata)
populationdata <- rbind(populationdata, new_data)
First, I started by uploading the Census 2020 population data. I also input and corrected any missing values.
merged_data <- merge(df, populationdata)
populationdata$`Geographic Area` <- toupper(populationdata$`Geographic Area`)
merged_data <- merge(df, populationdata, by.x = "State, Teritory or Tribal Nation", by.y = "Geographic Area")
head(merged_data)
## State, Teritory or Tribal Nation Total (Billions) 2020
## 1 ALABAMA 3.0000 5031864
## 2 ALASKA 3.7000 732964
## 3 AMERICAN SAMOA 0.0686 46189
## 4 ARIZONA 3.5000 7186683
## 5 ARKANSAS 2.8000 3014348
## 6 CALIFORNIA 18.4000 39503200
Then I merged the two datasets. They were not exactly identical in terms of columns, and this was causing issues with the merging at first. This dataframe shows that we have successfully merged the datasets, with a column for “State, Territory or Tribal Nation”, “Total (Billions)” for allocated funds, and “2020” for the population data.
correlation_test <- cor.test(merged_data$`Total (Billions)`, merged_data$`2020`, na.rm = TRUE)
print(correlation_test)
##
## Pearson's product-moment correlation
##
## data: merged_data$`Total (Billions)` and merged_data$`2020`
## t = 3.9534, df = 53, p-value = 0.0002297
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2426419 0.6590745
## sample estimates:
## cor
## 0.4772194
The Pearson’s correlation analysis was conducted to examine the relationship between the allocation of funds (“Total (Billions)”) from the Infrastructure Investment and Jobs Act and the population data for the year 2020. As a result, the correlation coefficient, r, was 0.4772, signifying a positive correlation between the two variables. Since it is positive, this suggests as the population of states and territories increases, the allocated funds also tend to increase. There was t-statistic of 3.9534 with 53 degrees of freedom, where we got a p-value of 0.0002297. This is less than 0.05 making it significant, and we can therefore reject the null hypothesis and accept the alternative hypothesis. This further demonstrates that there is a significant correlation between the allocation of funds and the population size.
Null Hypothesis (H0): The funding allocation is not influenced by the political interests of the Biden administration.
Alternative Hypothesis (H1): The funding allocation is influenced by the political interests of the Biden administration.
You can perform a t-test or ANOVA to compare the funding allocation between states that voted for Biden and states that did not. Another approach is to use regression analysis with political affiliation (Biden votes percentage) as an independent variable. Steps:
Obtain the official election results of the 2020 Presidential election. Merge the election results data with the funding allocation data. Perform a t-test, ANOVA, or regression analysis. Analyze the results to determine if there is a significant difference in funding allocation based on political affiliation.
mergedvotingdata <- merge(df, votingdf)
model <- lm(`Total (Billions)` ~ dem_percent, data = mergedvotingdata)
summary(model)
##
## Call:
## lm(formula = `Total (Billions)` ~ dem_percent, data = mergedvotingdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3706 -2.1392 -0.7392 0.4608 14.9608
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.439e+00 3.084e-01 11.15 <2e-16 ***
## dem_percent22.4% -5.156e-15 5.342e-01 0.00 1
## dem_percent26.6% -1.022e-14 5.342e-01 0.00 1
## dem_percent29.7% -3.865e-15 5.342e-01 0.00 1
## dem_percent31.8% -2.713e-15 5.342e-01 0.00 1
## dem_percent32.3% -3.063e-14 5.342e-01 0.00 1
## dem_percent33.1% -3.641e-15 5.342e-01 0.00 1
## dem_percent34.8% -5.054e-15 5.342e-01 0.00 1
## dem_percent35.6% 1.215e-15 5.342e-01 0.00 1
## dem_percent36.2% 2.921e-15 5.342e-01 0.00 1
## dem_percent36.6% -2.326e-14 5.342e-01 0.00 1
## dem_percent37.5% -5.640e-15 5.342e-01 0.00 1
## dem_percent37.6% -3.333e-15 5.342e-01 0.00 1
## dem_percent39.4% -3.169e-15 5.342e-01 0.00 1
## dem_percent39.9% -8.364e-15 5.342e-01 0.00 1
## dem_percent40.5% 1.483e-14 5.342e-01 0.00 1
## dem_percent41.0% -2.535e-15 5.342e-01 0.00 1
## dem_percent41.1% -4.607e-15 5.342e-01 0.00 1
## dem_percent41.3% -5.412e-15 5.342e-01 0.00 1
## dem_percent41.4% -4.803e-15 5.342e-01 0.00 1
## dem_percent41.6% -3.204e-15 5.342e-01 0.00 1
## dem_percent42.8% 1.590e-14 5.342e-01 0.00 1
## dem_percent43.4% -6.185e-15 5.342e-01 0.00 1
## dem_percent44.8% -3.494e-15 5.342e-01 0.00 1
## dem_percent44.9% -8.083e-16 5.342e-01 0.00 1
## dem_percent45.2% -3.134e-15 5.342e-01 0.00 1
## dem_percent46.5% -1.340e-15 5.342e-01 0.00 1
## dem_percent47.9% -4.108e-15 5.342e-01 0.00 1
## dem_percent48.4% -3.427e-15 5.342e-01 0.00 1
## dem_percent48.6% 1.814e-15 5.342e-01 0.00 1
## dem_percent49.4% -3.885e-15 4.362e-01 0.00 1
## dem_percent49.5% -1.796e-15 5.342e-01 0.00 1
## dem_percent50.0% 7.819e-15 5.342e-01 0.00 1
## dem_percent50.1% -3.912e-15 5.342e-01 0.00 1
## dem_percent50.6% -1.280e-14 5.342e-01 0.00 1
## dem_percent51.3% -2.110e-14 5.342e-01 0.00 1
## dem_percent52.2% -8.114e-15 5.342e-01 0.00 1
## dem_percent52.4% -4.474e-15 5.342e-01 0.00 1
## dem_percent52.7% -1.059e-14 5.342e-01 0.00 1
## dem_percent53.1% -3.971e-15 5.342e-01 0.00 1
## dem_percent53.4% -8.530e-15 5.342e-01 0.00 1
## dem_percent54.1% 1.121e-15 5.342e-01 0.00 1
## dem_percent54.3% -7.490e-15 5.342e-01 0.00 1
## dem_percent55.4% -4.485e-15 5.342e-01 0.00 1
## dem_percent56.5% -8.810e-15 5.342e-01 0.00 1
## dem_percent57.3% -4.650e-15 5.342e-01 0.00 1
## dem_percent57.5% -5.157e-15 5.342e-01 0.00 1
## dem_percent58.0% 1.185e-15 5.342e-01 0.00 1
## dem_percent58.7% -5.400e-15 5.342e-01 0.00 1
## dem_percent59.3% -9.610e-15 5.342e-01 0.00 1
## dem_percent59.4% -5.319e-15 5.342e-01 0.00 1
## dem_percent60.1% -6.037e-15 5.342e-01 0.00 1
## dem_percent60.9% -4.426e-15 5.342e-01 0.00 1
## dem_percent63.5% -6.188e-15 5.342e-01 0.00 1
## dem_percent63.7% -6.685e-15 5.342e-01 0.00 1
## dem_percent65.4% -5.462e-15 5.342e-01 0.00 1
## dem_percent65.6% -4.491e-15 5.342e-01 0.00 1
## dem_percent66.1% -5.175e-15 5.342e-01 0.00 1
## dem_percent92.1% -5.160e-15 5.342e-01 0.00 1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.293 on 3418 degrees of freedom
## Multiple R-squared: 8.046e-30, Adjusted R-squared: -0.01697
## F-statistic: 4.742e-28 on 58 and 3418 DF, p-value: 1
A linear regression analysis was used to determine whether or not the
allocated funds favored the Biden administration. The intercept is about
3.439e+00 with a standard error of 5.929e-02 and a t-value of 58.01. The
p-value suggests statistical significance (p-value < 2e-16). However,
the coefficient for dem_percent is estimated at 8.063e-23, with a
standard error of 4.471e-09 and a t-value close to zero, and a p-value
of 1. This shows that the variable dem_percent, representing the
percentage of Democratic votes, does not contribute significantly to
explaining the variance in the Total (Billions)
allocated.
In conclusion, the analysis indicates a significant positive
correlation between the allocation of Infrastructure Investment and Jobs
Act funds and the population size of states and territories. This
suggests that larger populations tend to receive more funding. However,
the analysis does not provide sufficient evidence to support the idea
that funding allocation is influenced by the political interests of the
Biden administration. The linear regression analysis indicates that the
percentage of Democratic votes (dem_percent
) is not a
significant predictor of the total allocated funds. Therefore, while
population size plays a role in funding distribution, political
affiliation does not appear to be a determining factor in the observed
allocations.