DATA 608 Assignment 1

Overview

The attached Excel file contains data on the present allocation of the Infrastructure Investment and Jobs Act funding by State and Territory.

Your story (Data Visualization(s) ) should address the following questions:

Is the allocation equitable based on the population of each of the States and Territories, or is bias apparent?

Does the allocation favor the political interests of the Biden administration?

Notes:

You will need to source data on the current (estimated) population of each of the States and Territories (accuracy is more important than precision) and on the official election results of the 2020 Presidential election.
You may choose to develop you visualizations using a desktop application or a code library. Your submittal should be in the form of a report (document) or a presentation.
This assignment is due by the end of week two of the semester.

Data Sources

Current (estimated) population of each of the States and Territories:

This data is from the US Census https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html

Official election results of the 2020 Presidential election:

This data comes from the Associated Press https://www.kaggle.com/datasets/callummacpherson14/2020-us-presidential-election-results-by-state?resource=download

Load Libraries

library(statebins)
library(hrbrthemes)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(httr)
library(dplyr)
library(stringr)
library(rvest)

## 
## Attaching package: 'rvest'
## 
## The following object is masked from 'package:readr':
## 
##     guess_encoding

library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(stringr)
library(tidytext)
library(tibble)
library(textdata)

## 
## Attaching package: 'textdata'
## 
## The following object is masked from 'package:httr':
## 
##     cache_info

library(tidyr)
library(readr)
library(purrr)
library(forcats)
library(ggplot2)
library(colorspace)

Read in Data

# Read in funding data from 2023
fund_data <- read_csv('https://raw.githubusercontent.com/gillianmcgovern0/cuny-data-608/refs/heads/main/IIJA%20FUNDING%20AS%20OF%20MARCH%202023.csv', show_col_types = FALSE)
head(fund_data)

## # A tibble: 6 × 2
##   `State, Teritory or Tribal Nation` `Total (Billions)`
##   <chr>                                           <dbl>
## 1 ALABAMA                                        3     
## 2 ALASKA                                         3.7   
## 3 AMERICAN SAMOA                                 0.0686
## 4 ARIZONA                                        3.5   
## 5 ARKANSAS                                       2.8   
## 6 CALIFORNIA                                    18.4

# Read in population data from 2023
pop_data <- read_csv('https://raw.githubusercontent.com/gillianmcgovern0/cuny-data-608/refs/heads/main/NST-EST2024-POP.csv', show_col_types = FALSE)

## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`
## • `` -> `...13`
## • `` -> `...14`
## • `` -> `...15`
## • `` -> `...16`
## • `` -> `...17`
## • `` -> `...18`
## • `` -> `...19`
## • `` -> `...20`
## • `` -> `...21`
## • `` -> `...22`
## • `` -> `...23`
## • `` -> `...24`
## • `` -> `...25`
## • `` -> `...26`
## • `` -> `...27`
## • `` -> `...28`
## • `` -> `...29`
## • `` -> `...30`
## • `` -> `...31`
## • `` -> `...32`
## • `` -> `...33`
## • `` -> `...34`
## • `` -> `...35`
## • `` -> `...36`

head(pop_data)

## # A tibble: 6 × 36
##   table with row heade…¹ ...2  ...3     ...4    ...5    ...6    ...7 ...8  ...9 
##   <chr>                  <chr> <chr>   <dbl>   <dbl>   <dbl>   <dbl> <lgl> <lgl>
## 1 Annual Estimates of t… <NA>  <NA>  NA      NA      NA      NA      NA    NA   
## 2 Geographic Area        Apri… Popu… NA      NA      NA      NA      NA    NA   
## 3 <NA>                   <NA>  2020   2.02e3  2.02e3  2.02e3  2.02e3 NA    NA   
## 4 United States          331,… 331,…  3.32e8  3.34e8  3.37e8  3.40e8 NA    NA   
## 5 Northeast              57,6… 57,4…  5.73e7  5.72e7  5.74e7  5.78e7 NA    NA   
## 6 Midwest                68,9… 68,9…  6.89e7  6.89e7  6.92e7  6.96e7 NA    NA   
## # ℹ abbreviated name:
## #   ¹`table with row headers in column A and column headers in rows 3 through 4. (leading dots indicate sub-parts)`
## # ℹ 27 more variables: ...10 <lgl>, ...11 <lgl>, ...12 <lgl>, ...13 <lgl>,
## #   ...14 <lgl>, ...15 <lgl>, ...16 <lgl>, ...17 <lgl>, ...18 <lgl>,
## #   ...19 <lgl>, ...20 <lgl>, ...21 <lgl>, ...22 <lgl>, ...23 <lgl>,
## #   ...24 <lgl>, ...25 <lgl>, ...26 <lgl>, ...27 <lgl>, ...28 <lgl>,
## #   ...29 <lgl>, ...30 <lgl>, ...31 <lgl>, ...32 <lgl>, ...33 <lgl>, …

# Read in 2020 presidential election results data
election_data <- read_csv('https://raw.githubusercontent.com/gillianmcgovern0/cuny-data-608/refs/heads/main/voting.csv', show_col_types = FALSE)
head(election_data)

## # A tibble: 6 × 8
##   state  state_abr trump_pct biden_pct trump_vote biden_vote trump_win biden_win
##   <chr>  <chr>         <dbl>     <dbl>      <dbl>      <dbl>     <dbl>     <dbl>
## 1 Alaska AK             53.1      43       189543     153502         1         0
## 2 Hawaii HI             34.3      63.7     196864     366130         0         1
## 3 Washi… WA             39        58.4    1584651    2369612         0         1
## 4 Oregon OR             40.7      56.9     958448    1340383         0         1
## 5 Calif… CA             34.3      63.5    5982194   11082293         0         1
## 6 Idaho  ID             63.9      33.1     554128     287031         1         0

Clean and Prepare Data

Cleaning up the data:

# Cleaning

# Rename columns
colnames(fund_data)[colnames(fund_data) == 'State, Teritory or Tribal Nation'] <- 'state_territory_tribal_nation'
colnames(fund_data)[colnames(fund_data) == 'Total (Billions)'] <- 'total_funding_billions'

# Make states lowercase
fund_data$state_territory_tribal_nation <- tolower(str_to_title(fund_data$state_territory_tribal_nation))
head(fund_data, 20)

## # A tibble: 20 × 2
##    state_territory_tribal_nation total_funding_billions
##    <chr>                                          <dbl>
##  1 alabama                                       3     
##  2 alaska                                        3.7   
##  3 american samoa                                0.0686
##  4 arizona                                       3.5   
##  5 arkansas                                      2.8   
##  6 california                                   18.4   
##  7 colorado                                      3.2   
##  8 connecticut                                   2.5   
##  9 deleware                                      0.792 
## 10 district of columbia                          1.1   
## 11 florida                                       8.2   
## 12 georgia                                       5     
## 13 guam                                          0.143 
## 14 hawaii                                        1     
## 15 idaho                                         1.2   
## 16 illinois                                      8.4   
## 17 indiana                                       3.4   
## 18 iowa                                          2.4   
## 19 kansas                                        1.5   
## 20 kentucky                                      3.9

# Get rid of useless rows and columns
colnames(pop_data) <- pop_data[3, ]
names(pop_data)[1:2] <- c("state_territory_tribal_nation", "baseline")
new_pop_data <- pop_data %>%
  select("state_territory_tribal_nation", "2023")
new_pop_data <- new_pop_data[-(1:8), ]
new_pop_data <- new_pop_data[-(52:59), ]
# Get rid of characters in file
new_pop_data$state_territory_tribal_nation <- gsub("\\.", "", new_pop_data$state_territory_tribal_nation)
# Rename columns
colnames(new_pop_data) <- c("state_territory_tribal_nation", "total_population_2023")
# Make states lowercase (to merge later)
new_pop_data$state_territory_tribal_nation <- tolower(new_pop_data$state_territory_tribal_nation)
head(new_pop_data)

## # A tibble: 6 × 2
##   state_territory_tribal_nation total_population_2023
##   <chr>                                         <dbl>
## 1 alabama                                     5117673
## 2 alaska                                       736510
## 3 arizona                                     7473027
## 4 arkansas                                    3069463
## 5 california                                 39198693
## 6 colorado                                    5901339

Merge the data frames:

# Merge the data on state column
# Fix Delaware spelling in funding data
fund_data$state_territory_tribal_nation[fund_data$state_territory_tribal_nation == "deleware"] <- "delaware"

# Use states/territories from the US Census dataset (trusted source)
df_merged <- merge(new_pop_data, fund_data, by = "state_territory_tribal_nation", all.x = TRUE) 
df_merged

##    state_territory_tribal_nation total_population_2023 total_funding_billions
## 1                        alabama               5117673                 3.0000
## 2                         alaska                736510                 3.7000
## 3                        arizona               7473027                 3.5000
## 4                       arkansas               3069463                 2.8000
## 5                     california              39198693                18.4000
## 6                       colorado               5901339                 3.2000
## 7                    connecticut               3643023                 2.5000
## 8                       delaware               1036423                 0.7920
## 9           district of columbia                687324                 1.1000
## 10                       florida              22904868                 8.2000
## 11                       georgia              11064432                 5.0000
## 12                        hawaii               1441387                 1.0000
## 13                         idaho               1971122                 1.2000
## 14                      illinois              12642259                 8.4000
## 15                       indiana               6880131                 3.4000
## 16                          iowa               3218414                 2.4000
## 17                        kansas               2951500                 1.5000
## 18                      kentucky               4550595                 3.9000
## 19                     louisiana               4588071                 4.3000
## 20                         maine               1399646                 1.1000
## 21                      maryland               6217062                 2.7000
## 22                 massachusetts               7066568                 3.6000
## 23                      michigan              10083356                 5.2000
## 24                     minnesota               5753048                 2.7000
## 25                   mississippi               2943172                 2.3000
## 26                      missouri               6208038                 3.8000
## 27                       montana               1131302                 3.3000
## 28                      nebraska               1987864                 1.3000
## 29                        nevada               3214363                 1.7000
## 30                 new hampshire               1402199                 0.7518
## 31                    new jersey               9379642                 5.1000
## 32                    new mexico               2121164                 2.6000
## 33                      new york              19737367                10.1000
## 34                north carolina              10881189                 4.5000
## 35                  north dakota                789047                 1.8000
## 36                          ohio              11824034                 6.6000
## 37                      oklahoma               4063882                 2.9000
## 38                        oregon               4253653                 2.3000
## 39                  pennsylvania              13017721                 8.1000
## 40                  rhode island               1103429                 1.1000
## 41                south carolina               5387830                 2.3000
## 42                  south dakota                918305                 1.3000
## 43                     tennessee               7148304                 3.7000
## 44                         texas              30727890                14.2000
## 45                          utah               3443222                 1.8000
## 46                       vermont                648708                 0.8521
## 47                      virginia               8734685                 4.5000
## 48                    washington               7857320                 4.0000
## 49                 west virginia               1770495                 2.0000
## 50                     wisconsin               5930405                 2.8000
## 51                       wyoming                585067                 2.3000

We now have the data for the first question in a tidy format where each row represent a unique state/territory. Population size plays a big role in the amount of funds though, so it should be taken into consideration. Therefore let’s normalize the data and create a per capita feature.

# Normalize the data

# Add a per capita column - total funds / total population
df_merged <- df_merged |>
  mutate(
    funding_per_capita = (total_funding_billions * 1000000000) / total_population_2023,
  )
head(df_merged)

##   state_territory_tribal_nation total_population_2023 total_funding_billions
## 1                       alabama               5117673                    3.0
## 2                        alaska                736510                    3.7
## 3                       arizona               7473027                    3.5
## 4                      arkansas               3069463                    2.8
## 5                    california              39198693                   18.4
## 6                      colorado               5901339                    3.2
##   funding_per_capita
## 1           586.2039
## 2          5023.6928
## 3           468.3510
## 4           912.2117
## 5           469.4034
## 6           542.2498

We now have data that considers the funds given the population. Yet, for the story of this data, seeing the exact funds numbers won’t be needed. We just need a general idea of the allocation of funds. So let’s set up some bins to simplify the visualization:

# Create bins for simplification

df_merged_bins <- mutate(
  df_merged,
  funding_per_capita_bins = cut(
      funding_per_capita,
      breaks = c(0, 400, 600, 800, 1000, 6000),
      labels = c("< $400", "$400 to $600", "$600 to $800", "$800 to $1000", "> $1000"),
      right = FALSE
    )
  )
head(df_merged_bins)

##   state_territory_tribal_nation total_population_2023 total_funding_billions
## 1                       alabama               5117673                    3.0
## 2                        alaska                736510                    3.7
## 3                       arizona               7473027                    3.5
## 4                      arkansas               3069463                    2.8
## 5                    california              39198693                   18.4
## 6                      colorado               5901339                    3.2
##   funding_per_capita funding_per_capita_bins
## 1           586.2039            $400 to $600
## 2          5023.6928                 > $1000
## 3           468.3510            $400 to $600
## 4           912.2117           $800 to $1000
## 5           469.4034            $400 to $600
## 6           542.2498            $400 to $600

Now let’s look at the election data for the second part:

# Clean data

# Keep states/territory from the Associated Press
colnames(election_data)[colnames(election_data) == 'state'] <- 'state_territory_tribal_nation'
election_data$state_territory_tribal_nation <- tolower(election_data$state_territory_tribal_nation)
election_data_merged <- merge(election_data, df_merged_bins, by = "state_territory_tribal_nation", all.x = TRUE) 
election_data_merged

##    state_territory_tribal_nation state_abr trump_pct biden_pct trump_vote
## 1                        alabama        AL      62.2      36.7    1441170
## 2                         alaska        AK      53.1      43.0     189543
## 3                        arizona        AZ      49.1      49.4    1661686
## 4                       arkansas        AR      62.4      34.8     760647
## 5                     california        CA      34.3      63.5    5982194
## 6                       colorado        CO      41.9      55.4    1364471
## 7                    connecticut        CT      39.2      59.3     715291
## 8                       delaware        DE      39.8      58.8     200603
## 9           district of columbia        DC       5.4      93.0      18586
## 10                       florida        FL      51.2      47.9    5668731
## 11                       georgia        GA      49.3      49.5    2461837
## 12                        hawaii        HI      34.3      63.7     196864
## 13                         idaho        ID      63.9      33.1     554128
## 14                      illinois        IL      40.5      57.6    2438943
## 15                       indiana        IN      57.1      41.0    1729516
## 16                          iowa        IA      53.2      45.0     897672
## 17                        kansas        KS      56.5      41.3     752933
## 18                      kentucky        KY      62.1      36.2    1326646
## 19                     louisiana        LA      58.5      39.9    1255776
## 20                         maine        ME      44.0      53.1     360480
## 21                      maryland        MD      32.4      65.8     976414
## 22                 massachusetts        MA      32.3      65.9    1167202
## 23                      michigan        MI      47.8      50.6    2649842
## 24                     minnesota        MN      45.4      52.6    1484056
## 25                   mississippi        MS      57.6      41.1     756789
## 26                      missouri        MO      56.8      41.4    1718282
## 27                       montana        MT      56.9      40.6     343647
## 28                      nebraska        NE      58.5      39.4     556846
## 29                        nevada        NV      47.7      50.1     669890
## 30                 new hampshire        NH      45.5      52.9     365660
## 31                    new jersey        NJ      41.4      57.3    1883260
## 32                    new mexico        NM      43.5      54.3     401894
## 33                      new york        NY      41.5      57.1    3005636
## 34                north carolina        NC      50.1      48.7    2758775
## 35                  north dakota        ND      65.5      31.9     235595
## 36                          ohio        OH      53.3      45.3    3154834
## 37                      oklahoma        OK      65.4      32.3    1020280
## 38                        oregon        OR      40.7      56.9     958448
## 39                  pennsylvania        PA      48.8      50.0    3378263
## 40                  rhode island        RI      38.8      59.7     199922
## 41                south carolina        SC      55.1      43.4    1385103
## 42                  south dakota        SD      61.8      35.6     261043
## 43                     tennessee        TN      60.7      37.4    1849556
## 44                         texas        TX      52.1      46.5    5890347
## 45                          utah        UT      58.2      37.7     865140
## 46                       vermont        VT      30.8      66.4     112704
## 47                      virginia        VA      44.2      54.4    1962430
## 48                    washington        WA      39.0      58.4    1584651
## 49                 west virginia        WV      68.6      29.7     545382
## 50                     wisconsin        WI      48.9      49.6    1610065
## 51                       wyoming        WY      70.4      26.7     193559
##    biden_vote trump_win biden_win total_population_2023 total_funding_billions
## 1      849624         1         0               5117673                 3.0000
## 2      153502         1         0                736510                 3.7000
## 3     1672143         0         1               7473027                 3.5000
## 4      423932         1         0               3069463                 2.8000
## 5    11082293         0         1              39198693                18.4000
## 6     1804196         0         1               5901339                 3.2000
## 7     1080680         0         1               3643023                 2.5000
## 8      296268         0         1               1036423                 0.7920
## 9      317323         0         1                687324                 1.1000
## 10    5297045         1         0              22904868                 8.2000
## 11    2474507         0         1              11064432                 5.0000
## 12     366130         0         1               1441387                 1.0000
## 13     287031         1         0               1971122                 1.2000
## 14    3463260         0         1              12642259                 8.4000
## 15    1242413         1         0               6880131                 3.4000
## 16     759061         1         0               3218414                 2.4000
## 17     551144         1         0               2951500                 1.5000
## 18     772474         1         0               4550595                 3.9000
## 19     856034         1         0               4588071                 4.3000
## 20     434966         0         1               1399646                 1.1000
## 21    1985023         0         1               6217062                 2.7000
## 22    2382202         0         1               7066568                 3.6000
## 23    2804040         0         1              10083356                 5.2000
## 24    1717077         0         1               5753048                 2.7000
## 25     539508         1         0               2943172                 2.3000
## 26    1252902         1         0               6208038                 3.8000
## 27     244836         1         0               1131302                 3.3000
## 28     374583         1         0               1987864                 1.3000
## 29     703486         0         1               3214363                 1.7000
## 30     424935         0         1               1402199                 0.7518
## 31    2608327         0         1               9379642                 5.1000
## 32     501614         0         1               2121164                 2.6000
## 33    4137641         0         1              19737367                10.1000
## 34    2684292         1         0              10881189                 4.5000
## 35     114902         1         0                789047                 1.8000
## 36    2679165         1         0              11824034                 6.6000
## 37     503890         1         0               4063882                 2.9000
## 38    1340383         0         1               4253653                 2.3000
## 39    3459923         0         1              13017721                 8.1000
## 40     307486         0         1               1103429                 1.1000
## 41    1091541         1         0               5387830                 2.3000
## 42     150471         1         0                918305                 1.3000
## 43    1139332         1         0               7148304                 3.7000
## 44    5259126         1         0              30727890                14.2000
## 45     560282         1         0               3443222                 1.8000
## 46     242820         0         1                648708                 0.8521
## 47    2413568         0         1               8734685                 4.5000
## 48    2369612         0         1               7857320                 4.0000
## 49     235984         1         0               1770495                 2.0000
## 50    1630673         0         1               5930405                 2.8000
## 51      73491         1         0                585067                 2.3000
##    funding_per_capita funding_per_capita_bins
## 1            586.2039            $400 to $600
## 2           5023.6928                 > $1000
## 3            468.3510            $400 to $600
## 4            912.2117           $800 to $1000
## 5            469.4034            $400 to $600
## 6            542.2498            $400 to $600
## 7            686.2433            $600 to $800
## 8            764.1668            $600 to $800
## 9           1600.4097                 > $1000
## 10           358.0025                  < $400
## 11           451.8985            $400 to $600
## 12           693.7762            $600 to $800
## 13           608.7903            $600 to $800
## 14           664.4382            $600 to $800
## 15           494.1766            $400 to $600
## 16           745.7089            $600 to $800
## 17           508.2162            $400 to $600
## 18           857.0308           $800 to $1000
## 19           937.2130           $800 to $1000
## 20           785.9130            $600 to $800
## 21           434.2887            $400 to $600
## 22           509.4411            $400 to $600
## 23           515.7013            $400 to $600
## 24           469.3164            $400 to $600
## 25           781.4698            $600 to $800
## 26           612.1097            $600 to $800
## 27          2916.9930                 > $1000
## 28           653.9683            $600 to $800
## 29           528.8762            $400 to $600
## 30           536.1578            $400 to $600
## 31           543.7308            $400 to $600
## 32          1225.7421                 > $1000
## 33           511.7197            $400 to $600
## 34           413.5577            $400 to $600
## 35          2281.2329                 > $1000
## 36           558.1851            $400 to $600
## 37           713.6034            $600 to $800
## 38           540.7117            $400 to $600
## 39           622.2287            $600 to $800
## 40           996.8924           $800 to $1000
## 41           426.8880            $400 to $600
## 42          1415.6517                 > $1000
## 43           517.6053            $400 to $600
## 44           462.1209            $400 to $600
## 45           522.7662            $400 to $600
## 46          1313.5340                 > $1000
## 47           515.1874            $400 to $600
## 48           509.0794            $400 to $600
## 49          1129.6276                 > $1000
## 50           472.1431            $400 to $600
## 51          3931.1737                 > $1000

# Merge in the data
election_data_merged <- election_data_merged |>
  select(state_territory_tribal_nation, biden_win, total_funding_billions, funding_per_capita)

election_data_merged <- election_data_merged |>
  mutate(
    results = ifelse(biden_win == 1, "Biden", "Trump")
  )
# Final dataset
head(election_data_merged)

##   state_territory_tribal_nation biden_win total_funding_billions
## 1                       alabama         0                    3.0
## 2                        alaska         0                    3.7
## 3                       arizona         1                    3.5
## 4                      arkansas         0                    2.8
## 5                    california         1                   18.4
## 6                      colorado         1                    3.2
##   funding_per_capita results
## 1           586.2039   Trump
## 2          5023.6928   Trump
## 3           468.3510   Biden
## 4           912.2117   Trump
## 5           469.4034   Biden
## 6           542.2498   Biden

EDA

For this let’s just do a quick EDA (given the main point of this assignment are the visualizations).

summary(election_data_merged)

##  state_territory_tribal_nation   biden_win      total_funding_billions
##  Length:51                     Min.   :0.0000   Min.   : 0.7518       
##  Class :character              1st Qu.:0.0000   1st Qu.: 1.8000       
##  Mode  :character              Median :1.0000   Median : 2.8000       
##                                Mean   :0.5098   Mean   : 3.7568       
##                                3rd Qu.:1.0000   3rd Qu.: 4.1500       
##                                Max.   :1.0000   Max.   :18.4000       
##  funding_per_capita   results         
##  Min.   : 358.0     Length:51         
##  1st Qu.: 509.3     Class :character  
##  Median : 586.2     Mode  :character  
##  Mean   : 896.9                       
##  3rd Qu.: 821.5                       
##  Max.   :5023.7

summary(df_merged_bins)

##  state_territory_tribal_nation total_population_2023 total_funding_billions
##  Length:51                     Min.   :  585067      Min.   : 0.7518       
##  Class :character              1st Qu.: 1870808      1st Qu.: 1.8000       
##  Mode  :character              Median : 4550595      Median : 2.8000       
##                                Mean   : 6604044      Mean   : 3.7568       
##                                3rd Qu.: 7665174      3rd Qu.: 4.1500       
##                                Max.   :39198693      Max.   :18.4000       
##  funding_per_capita  funding_per_capita_bins
##  Min.   : 358.0     < $400       : 1        
##  1st Qu.: 509.3     $400 to $600 :25        
##  Median : 586.2     $600 to $800 :12        
##  Mean   : 896.9     $800 to $1000: 4        
##  3rd Qu.: 821.5     > $1000      : 9        
##  Max.   :5023.7

Analysis

Is the allocation equitable based on the population of each of the States and Territories, or is bias apparent?

Since this is comparing US states, territories and tribal nations, my first data visualization will be a bar graph. For this graph, I want to emphasize the population. A cartogram would be cool to show this feature, but I do not have geometry data (which I believe I would need). Instead I am going to show population by the size of the bars themselves. To do this, I chose the estimated population on the y axis, with the x axis being the states. I then chose to sort by descending population so one can instantly see the largest and smallest populated states. I will also flip the axes, to make it easier to read the states.

To show the funding, I chose to use my bins. The reason I did this is because for this story, it doesn’t seem as relevant to know the exact numbers for funding. We just want to know the general trend in allocation. Additionally, I would like to use a gradient, and using a color gradient for continuous data can make the visualization hard to read. I chose different gradients of blue, and desaturated/lightened them to make the colors easier to look at. Since we really want to know which states are getting the most money, I chose to darken the highest bin of funding. That way the eye is drawn to the highest, and the lowest bin has the lightest color. This makes the coloring more intuitive.

I chose to use the funding per capita since population is important for this data.

# This graph is inspired by the Fundamentals of Data Visualization textbook

df_merged_bins <- df_merged_bins |>
  arrange(desc(total_population_2023))

# Lighten the colors so they're not too harsh to look at
# Darken the highest bin so the eye focuses on it
funding_colors_bars <- c(desaturate(lighten(c("#ADD8E6", "#87CEEB", "#6495ED", "#4169E1"), .4), .8), darken("#56B4E9", .3))
funding_colors_axis <- c(rep("gray30", 4), darken("#56B4E9", .4))
funding_fontface <- c(rep("plain", 4), "bold")

ggplot(df_merged_bins, aes(x = reorder(state_territory_tribal_nation, total_population_2023), y = total_population_2023, fill = funding_per_capita_bins)) + 
  geom_col() + 
  scale_y_continuous(
    labels = c(0, 1, 2, 3, 4),
    name = "Estimated Population in 2023 in Millions",
  ) +
  scale_fill_manual(
    values = funding_colors_bars,
    breaks = c("< $400", "$400 to $600", "$600 to $800", "$800 to $1000", "> $1000"),
    name = "IIJA Funding Per Capita"
  ) +
  coord_flip() + 
  theme(
    text = element_text(color = "gray30"),
    axis.text.x = element_text(color = "gray30"),
    axis.title.y = element_blank(),
    axis.line.y = element_blank(),
    axis.ticks.length = unit(0, "pt"),
    axis.text.y = element_text(
      size = 8,
    ),
    legend.position.inside = c(.56, .68),
    legend.background = element_rect(fill = "#ffffffb0")
  ) +
  labs(title = "Allocation of IIJA Funding (Per Capita) for the United States")

Story visualization breakdown:

I can visually see how large each state/territory is by the bar size without using a cartogram
With the color gradient, I can easily pinpoint the states that have the largest funding per capita, which is all clustered towards the bottom of the graph (draws the eye)
- Wyoming, Vermont, DC, Alaska, North Dakota, and South Dakota, the six least populated states in the US, have the highest funding per capita with all of them having at least $1,000 per person or more
- The 4 largest states by population, California, Texas, Florida, and New York, have either the least or second to least smallest amount of funding. The contrast of the largest bars with the faintest colors really highlights how the largest states are getting the least amount of money.

The story of this data is that there is a bias towards the least populated states in the US in terms of allocation of IIJA funds. Why might this be? Possibly the more rural states actually need more funds for their infrastructure compared to the states with larger cities.

Drawbacks/possible future improvements (with more time):

The spacing size of the graph could use improvement - the text is small and very squished together making it hard to read
Play around with font color/boldness for the states - perhaps calling out certain states might be helpful
The 2 colors indicating the smallest funds bins are a bit similar to each other - could possibly differentiate better
Play around with reversing the order in the graph so instead of the eye drawing to the corner of the graph, it goes to the top portion which might be easier to read

Visualization #2 (Bonus Graph)

The reason why I chose to use R for this assignment was to try my own cartogram heatmap which I learned from the Fundamentals of Data Visualization textbook. I personally really like this style of visualization as it easily shows a visual of the US. The downside of this graph is that it doesn’t visually show population size, but for this type of graph, I’m assuming the audience is somewhat familiar with the US. Additionally, the data I’m using still incorporates the population size.

To make the cartogram heatmap, I used geom_statebins which automatically places the states in the correct position which is very convenient. I don’t need to add on geometry data. For this graph I used the same bins for funding to easily show the difference in funds.

# Bonus graph
# This graph is inspired by the Fundamentals of Data Visualization textbook

# geom_statebins only supports the 50 states, to my knowledge
state_names <- c("Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "Florida", "Georgia", "Hawaii", "Idaho", "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine", "Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri", "Montana", "Nebraska", "Nevada", "New Hampshire", "New Jersey", "New Mexico", "New York", "North Carolina", "North Dakota", "Ohio", "Oklahoma", "Oregon", "Pennsylvania", "Rhode Island", "South Carolina", "South Dakota", "Tennessee", "Texas", "Utah", "Vermont", "Virginia", "Washington", "West Virginia", "Wisconsin", "Wyoming")

df_merged_bins_updated <- df_merged_bins

colnames(df_merged_bins_updated)[colnames(df_merged_bins_updated) == 'state_territory_tribal_nation'] <- 'state'

filtered_df_merged_bins_updated <- df_merged_bins_updated %>%
  filter(state %in% tolower(state_names))
head(filtered_df_merged_bins_updated)

##          state total_population_2023 total_funding_billions funding_per_capita
## 1   california              39198693                   18.4           469.4034
## 2        texas              30727890                   14.2           462.1209
## 3      florida              22904868                    8.2           358.0025
## 4     new york              19737367                   10.1           511.7197
## 5 pennsylvania              13017721                    8.1           622.2287
## 6     illinois              12642259                    8.4           664.4382
##   funding_per_capita_bins
## 1            $400 to $600
## 2            $400 to $600
## 3                  < $400
## 4            $400 to $600
## 5            $600 to $800
## 6            $600 to $800

filtered_df_merged_bins_updated$state <- str_to_title(filtered_df_merged_bins_updated$state)

filter(filtered_df_merged_bins_updated) %>%
  ggplot(aes(state = state, fill = funding_per_capita_bins)) +
  geom_statebins(lbl_size = 14/.pt) +
  expand_limits(x = -1.3) + # make space for legend
  coord_equal(expand = FALSE) +
   scale_fill_discrete_sequential(
    h1 = -83, h2 = 20, c1 = 30, cmax = 40, c2 = 0, l1 = 20, l2 = 100, p1 = 1, p2 = 1.2, rev = TRUE, 
    name = "IIJA Funding Per Capita",
    nmax = 7,
    order = 2:6,
    guide = guide_legend(
      override.aes = list(colour = "white", size = 1),
      reverse = TRUE
    )
  ) +
  theme(
    panel.grid.major = element_blank(), # Removes grid lines
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    panel.background = element_blank(),
    axis.title.x = element_blank(), # Remove x-axis title
    axis.text.x = element_blank(),  # Remove x-axis text (labels)
    axis.title.y = element_blank(), # Remove y-axis title
    axis.text.y = element_blank(),   # Remove y-axis text (labels)
    axis.ticks.x = element_blank(), # Remove x-axis tick marks
    axis.ticks.y = element_blank(),  # Remove y-axis tick marks
    legend.background = element_blank(),
    legend.position = c(0, 1),
    legend.justification = c(0, 1),
    legend.spacing.x = grid::unit(2, "pt"),
    legend.spacing.y = grid::unit(2, "pt"),
    legend.title = element_text(hjust = 0.5),
    legend.key.width = grid::unit(18, "pt"),
    legend.key.height = grid::unit(15, "pt")
  )

## Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
## 3.5.0.
## ℹ Please use the `legend.position.inside` argument of `theme()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Story visualization breakdown:

I can visually see the layout of the US - the Midwest and Alaska are the darkest colored states indicating the states with the highest funds per capita
This graph assumes that the audience is familiar with the Midwest and Alaska, and how they are generally known to be less populated states in comparison with the states containing the largest cities like LA or NYC

Story of this data is that there is a bias towards the least populated states in the US in terms of allocation of IIJA funds.

Drawbacks of this graph:

Does not show the population size
To my knowledge, it can only show the 50 states, so you lose some data
I needed to remove a lot of unwanted features (such as a background grid) of geom_statebins which was a bit tedious

Overall this graph is very pleasing and clear to me, but not might be to all audiences.

Does the allocation favor the political interests of the Biden administration?

Since this part of the analysis also looks at comparing states, I will choose to use another bar graph. For this question, I want to focus on the allocation of funds, so I will make the total funds my y axis, with the fill for my bars representing the 2020 presidential election results. For the election results, I will choose blue for Biden which is the usual color for the democratic party and red for Trump for the republican party. I will also lighten the colors for this graph as well to make the colors less harsh for the eyes to look at.

For the first graph, let’s just look at total funding (not taking into account population size).

# Election bar graph

colors_bars <- c(lighten("blue", amount = 0.2), lighten("red", amount = 0.2))

ggplot(election_data_merged, aes(x = reorder(state_territory_tribal_nation, total_funding_billions), y = total_funding_billions, fill = results)) + 
  geom_col() + 
  scale_y_continuous(
    name = "IIJA Funding (in Billions)",
  ) +
  scale_fill_manual(
    values = colors_bars,
    name = "Election Results"
  ) +
  coord_flip() + 
  theme(
    text = element_text(color = "gray30"),
    axis.text.x = element_text(color = "gray30"),
    axis.title.y = element_blank(),
    axis.line.y = element_blank(),
    axis.ticks.length = unit(0, "pt"),
    axis.text.y = element_text(
      size = 8,
    ),
    legend.position.inside = c(.56, .68),
    legend.background = element_rect(fill = "#ffffffb0")
  ) +
  labs(title = "Allocation of IIJA Funding (in Billions) for the United States")

Story visualization breakdown:

I can visually see that 7 out of the top 10 states are blue indicating a Biden state
Smallest funded states are also Biden
Most Trump states land in the middle

Story of this data is that the highest funded states are Biden supporters. Yet, this data does not take into account population size which does matter.

Drawbacks/future improvements:

Does not include the population size
Similar to the other bar graph, the text and bars are quite small and hard to read

Now let’s take into consideration population by looking at per capita:

# Per capita election bar graph
# Final visualization

ggplot(election_data_merged, aes(x = reorder(state_territory_tribal_nation, funding_per_capita), y = funding_per_capita, fill = results)) + 
  geom_col() + 
  scale_y_continuous(
    name = "IIJA Funding (US Dollars Per Capita)",
  ) +
  scale_fill_manual(
    values = colors_bars,
    name = "Election Results"
  ) +
  coord_flip() + 
  theme(
    text = element_text(color = "gray30"),
    axis.text.x = element_text(color = "gray30"),
    axis.title.y = element_blank(),
    axis.line.y = element_blank(),
    axis.ticks.length = unit(0, "pt"),
    axis.text.y = element_text(
      size = 8,
    ),
    legend.position.inside = c(.56, .68),
    legend.background = element_rect(fill = "#ffffffb0")
  ) +
  labs(title = "Allocation of IIJA Funding (US Dollars Per Capita) for the United States")

Story visualization breakdown:

I can visually see that only 4 out of the top 10 states are blue indicating a Biden state
Smallest funded states are also Trump
Most Biden states land in the middle/towards the bottom of the graph

Story of this data is that the highest funded states (per capita) are Trump supporter states. This tells a different story than the previous graph. This tells us Biden is giving more money to states that did not support him back in 2020. This could indicate that Biden was trying to win these states by giving them more resources, despite their lower population size.

Drawbacks/future improvements:

Similar to the other bar graph, the text and bars are quite small and hard to read

Since this takes population into consideration, this would be the final visualization I would use.

Conclusions

Overall this assignment gave very good practice in creating detailed graphs. What these graphs showed us is that the least populated states are getting higher funds per capita. Although it may seem the states with the highest populations are getting the most funds, if you take into account the amount of people in each state, the funds are a lot less for the states with bigger cities. One reason for this could be due to infrastructure issues in rural areas that need the funds. Another reason could be the story from our 2nd set of graphs related to the election results. Possibly the Biden administration wanted to win over states that did not vote for him in 2020. The data comes from 2023, so it’s possible this allocation was heavily related to the 2024 election year.

Future Directions:

Although R created very nice graphs, I would have liked to use Python. I did not see a tool/library to help with the cartogram heatmp for Python, which is why I went with the R instead giving that this data is geospatial.
- Possibly next time, I could add in my own latitude/longitude data to map out the US to create my own graph
If I had more time, I would look into datasets that have more of the territories from the original dataset. Due to time, I just went with what the US Census and Associated Press had available.
Find a better way to include state data, and make the graph not so squished

DATA 608 Assignment 1

Gillian McGovern

2025-09-07

Overview

Data Sources

Load Libraries

Read in Data

Clean and Prepare Data

EDA

Analysis

Is the allocation equitable based on the population of each of the States and Territories, or is bias apparent?

Visualization #2 (Bonus Graph)

Does the allocation favor the political interests of the Biden administration?

Conclusions