Problem Statement:
The Provided Excel file contains data on the present allocation of the Infrastructure Investment and Jobs Act funding by State and Territory. Your story (Data Visualization(s) ) should address the following questions:
Is the allocation equitable based on the population of each of the States and Territories, or is bias apparent?
Does the allocation favor the political interests of the Biden administration?
Loading the data:
The dataset provided was in .xlsx format. In order to
make the work reproducible and easy I have changed the format to
.csv and uploaded it to my github profile. From where the
dataset has been loaded into the markdown using the code below:
github_link <- "https://raw.githubusercontent.com/Umerfarooq122/Data_sets/main/Untitled%20spreadsheet%20-%20Sheet1.csv"
#library(httr)
df <- read.csv(github_link)
Let’s display the head of the data to see if everything is loaded the way it was suppose to be:
knitr::kable(head(df))
| State..Teritory.or.Tribal.Nation | Total..Billions. |
|---|---|
| ALABAMA | 3.0000 |
| ALASKA | 3.7000 |
| AMERICAN SAMOA | 0.0686 |
| ARIZONA | 3.5000 |
| ARKANSAS | 2.8000 |
| CALIFORNIA | 18.4000 |
As we can see that the data loaded properly into the markdown but we still have to fix the column names as they did not come out the way they were in the original file so let’s fix that:
colnames(df) <- c("State_Territory", "Total_billions")
knitr::kable(head(df))
| State_Territory | Total_billions |
|---|---|
| ALABAMA | 3.0000 |
| ALASKA | 3.7000 |
| AMERICAN SAMOA | 0.0686 |
| ARIZONA | 3.5000 |
| ARKANSAS | 2.8000 |
| CALIFORNIA | 18.4000 |
Now let’s check out if there is any missing values in the given dataset:
sum(is.na(df))
## [1] 0
It seems like there are no missing values but we have a problem and that is that we have source data for population from outside in order to answer the first question. Similarly, for second question we have to source the data about the results of election.
Loading and wrangling Population data:
In order to acquire the population data I will use Wikipedia as a source and will get the latest 2022 census data from Wikipedia.
# Reading in the table from Wikipedia
page = read_html("https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population")
# Obtain the piece of the web page that corresponds to the "wikitable" node
my.table = html_node(page, ".wikitable")
# Convert the html table element into a data frame
my.table = html_table(my.table, fill = TRUE)
# Extracting and tidying a single column from the table and adding row names
New_data <- data.frame(my.table)
final <- data.frame(toupper(New_data$State.or.territory),New_data$Census.population.8..a., New_data$House.of.Reps.from.the.2022.elections)
Now we have got the data for population along side house of representative for each state but some of the values are not in correct form plus the column names needs to be consistent if we are performing a join with our original data so let’s fix the values and column names
final[final=="GUAM[10]"] <- "GUAM"
final[final == "U.S. VIRGIN ISLANDS[11]"] <- "US VIRGIN ISLANDS"
final[final == "AMERICAN SAMOA[12]"] <- "AMERICAN SAMOA"
final[final == "NORTHERN MARIANA ISLANDS[13]"] <- "NORTHERN MARIANA ISLANDS"
colnames(final) <- c("State_Territory", "Population", "Seats")
Since the column names and values are fixed now we can perfom a left join to our original data:
new <- left_join(df, final, by = "State_Territory")
Some rows has character
"N/A" instead of null values so let's fix that too:
new[new=="N/A"] <- NA
Now our dataset contain information about population but before going ahead let’s check out the data type of each column.
str(new)
## 'data.frame': 57 obs. of 4 variables:
## $ State_Territory: chr "ALABAMA" "ALASKA" "AMERICAN SAMOA" "ARIZONA" ...
## $ Total_billions : num 3 3.7 0.0686 3.5 2.8 18.4 3.2 2.5 0.792 1.1 ...
## $ Population : chr "5,074,296" "733,583" NA "7,359,197" ...
## $ Seats : chr "7" "1" "1*" "9" ...
As we can see that Population and Seats columns does not have the numeric type so let’s fix that too:
new$Population <- as.numeric(gsub(",","",new$Population))
new$Seats <- as.numeric(gsub('[^[:alnum:] ]', '', new$Seats))
Now our dataset is ready and let’s just reorder the data base on descending allocations:
new <- new %>% arrange(desc(Total_billions))
Loading the wrangling elections results data:
For the election results data I have used www.Kaggle.com
as a source. I have downloaded the .csv from Kaggle and
uploaded into my github from where I have loaded the data using the code
below:
winner <- read.csv("https://raw.githubusercontent.com/Umerfarooq122/Data_sets/main/Untitled%20spreadsheet%20-%20voting.csv.csv")
This file contain a lot of columns we don’t need so I will use the one that only contains the name of State or Territory and election results:
bwin <- data.frame(toupper(winner$state),winner$biden_win)
Let’s fix the column names before joining it our previous dataset:
colnames(bwin) <- c("State_Territory", "Biden_win")
We can perform join now:
new_biden <- left_join(new, bwin, by = "State_Territory")
Let;s make sure that our new column has the right type of data:
new_biden$Biden_win <- as.factor(new_biden$Biden_win)
knitr::kable(head(new_biden))
| State_Territory | Total_billions | Population | Seats | Biden_win |
|---|---|---|---|---|
| CALIFORNIA | 18.4 | 39029342 | 52 | 1 |
| TEXAS | 14.2 | 30029572 | 38 | 0 |
| NEW YORK | 10.1 | 19677151 | 26 | 1 |
| ILLINOIS | 8.4 | 12582032 | 17 | 1 |
| FLORIDA | 8.2 | 22244823 | 28 | 0 |
| PENNSYLVANIA | 8.1 | 12972008 | 17 | 1 |
str(new_biden)
## 'data.frame': 57 obs. of 5 variables:
## $ State_Territory: chr "CALIFORNIA" "TEXAS" "NEW YORK" "ILLINOIS" ...
## $ Total_billions : num 18.4 14.2 10.1 8.4 8.2 8.1 6.6 5.2 5.1 5 ...
## $ Population : num 39029342 30029572 19677151 12582032 22244823 ...
## $ Seats : num 52 38 26 17 28 17 15 13 12 14 ...
## $ Biden_win : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 1 2 2 2 ...
Loading and Wrangling Revenue Data:
To get a different perspective let’s add the revenue generated by
each state for year 2022 and see if the funds were allocated based on
the revenue generated by each state. I have downloaded the data for
revenue generated by each state using www.statista.com and
has been loaded into the markdown using the code chunk below:
reven <- read.csv('https://raw.githubusercontent.com/Umerfarooq122/Data_sets/main/Revenue%20Data.csv')
head(reven)
## X
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
## X.1
## 1
## 2 U.S. state government tax revenue FY 2022, by state
## 3 State government tax revenue in the United States in the fiscal year of 2022, by state (in billion U.S. dollars)
## 4
## 5 California
## 6 New York
## X.2
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 280.83
## 6 117.98
Let’s use the columns that we need:
reven <- data.frame(toupper(reven$X.1), reven$X.2)
Removing un-necessary row from the dataset:
reven <- reven[-(1:4),]
Fixing column names for left Join
colnames(reven) <- c("State_Territory", "Revenue_billions")
Performing left Join and Creating a final data set:
final_data <- left_join(new_biden, reven, by = "State_Territory")
Changing population row into millions:
final_data$Population <- as.numeric((final_data$Population)/1000000)
Displaying the first few row of our final data:
knitr::kable(head(final_data))
| State_Territory | Total_billions | Population | Seats | Biden_win | Revenue_billions |
|---|---|---|---|---|---|
| CALIFORNIA | 18.4 | 39.02934 | 52 | 1 | 280.83 |
| TEXAS | 14.2 | 30.02957 | 38 | 0 | 82.26 |
| NEW YORK | 10.1 | 19.67715 | 26 | 1 | 117.98 |
| ILLINOIS | 8.4 | 12.58203 | 17 | 1 | 62.57 |
| FLORIDA | 8.2 | 22.24482 | 28 | 0 | 59.24 |
| PENNSYLVANIA | 8.1 | 12.97201 | 17 | 1 | 53.68 |
Data Analysis:
In order to analyze the data and try to answer the questions below:
Is the allocation equitable based on the population of each of the States and Territories, or is bias apparent?
Does the allocation favor the political interests of the Biden administration?
We will plot the allocation for each state or territory and compare it with the population of the respective state. We will order the data represented in the graph by descending the amount of billions allocated. To answer the second we will use the election results for each territory or state. Since we have the results available in the form of win or lose i.e. 1, 0. Which is a factorized data so we can use that as a fill for our bar chart to show which territories and states were won by president Biden.
# Graph for total allocation
p1 <- ggplot(data = final_data, aes(x = reorder(State_Territory,Total_billions),y = Total_billions))+
geom_bar(stat = "identity")+labs(x ="", y = "Total (Billions)")+coord_flip()+theme_bw()+theme(axis.text.y = element_text(face="bold",size=10))
# Graph for population
p2 <- ggplot(data = final_data, aes(x = reorder(State_Territory, Total_billions) , y = Population, fill = Biden_win))+
geom_bar(stat = "identity")+labs(x ="", y = "Population (millions)")+coord_flip()+ scale_fill_discrete(name = "Biden Win")+theme_bw()+theme(axis.text.y = element_blank())
p1 + p2
The above horizontal bar chart is ordered in the descending order of funds allocated and by comparing both of the graphs above we can clearly see that Florida has way more population than Illinois and New York but still ended up below these two states. Similarly North Carolina has more population than Michigan, Virginia and New Jersey but again when it comes to fund allocation it is below these states in the list which advocates for un-equatable allocation of funds when it comes to population. Another similarity in the cases of above two states i.e. Florida and North Carolina is that both of these states were lost by Biden in the previous elections so the fund allocations could very well be a political move by Biden administration rather than looking at the population of these states. Having said that we still cannot conclude that those allocation had a political motive since there in another aspect to look at.
Most of the times governments allocate funds in anticipation of generating revenues so it would be unfair to conclude without looking at the revenue generated by each state. Another reason why I am referring back to more data is that even though Florida (not won by Biden) did not get enough fund allocations even though it has more population that New York and Illinois (won by Biden) but the same graph also shows that states like Massachusetts and Arizona (won by Biden) with more population are getting less funds than states like Kentucky, Louisiana, Alaska, Missouri and Tennessee (not won by Biden). Which hints about the fund allocation not being political.
On would argue that maybe those states have more electoral seats and that is why Biden administration is pouring more money into. In order to confirm that lets plot the seats too.
p3 <- ggplot(data = final_data, aes(x = reorder(State_Territory, Total_billions) , y = Seats, fill = Biden_win))+
geom_bar(stat = "identity")+ theme_bw()+scale_fill_discrete(name = "Biden Win")+theme(axis.text.y = element_blank())+labs(x ="", y = "Seats")+coord_flip()
p1+p3
We can clearly see that the number of seats does not affect the over all fund allocation or in other words the above graph is not conclusive enough to say that Biden administration allocated more funds to the states that has more Seats.
Let’s Use the revenue generated by each state and see if we find any clues:
p4 <- ggplot(data = final_data, aes(x = reorder(State_Territory, Total_billions), y = Revenue_billions, fill = Biden_win ))+
geom_bar(stat = "identity")+ scale_fill_discrete(name = "Biden Win")+labs(x ="", y = "Revenue (Billions)")+coord_flip()+theme_bw()+theme(axis.text.y = element_blank())
p1+ p2+theme(legend.position = 'none') +p4
The above graph advocates for the fact that even though Florida has way more population than Illinois and New York but it could not generate revenue as much as New York and Illinois did previous year and that could be a factor why Florida got less fund allocation. Similarly New Jersey has lower population than North Carolina but it generated more revenue. Another Noticeable point over here is that even though Texas could not generate as much revenue as New York but it got more fund than New York even though previous results shows that Biden Lost in Texas which again strengthens the view point of fund allocation not being political.
Conclusion:
The graphs have clearly shown that the fund allocation was not equatable to the population of the state or Territory. Florida and North Carolina have been the highlight of that while there was no conclusive answer to whether the fund allocated was purely based on political motive of Biden administration because even though both Florida and North Carolina were lost by Biden in previous elections irrespective of having more population they got less fund compared to New York and New Jersey respectively. But after looking at the revenue generated by New York and New Jersey compared to Florida and North Carolina respectively, one would argue that more funds were allocated based on revenue generated by these states.