Project 2 - An Exploration of Voting and Census Data

Author

Daniel B

1. Introduction

Topic
The topic I have decided to explore is voting results over time and trying to pull demographic data from the census. Since that was an tall order, I pulled the following variables:

  • Urban-rural codes as a grouping

  • Median income totals

  • Year used over time on the x axis

  • Party candidate totals and total vote

Background
I’m interested in how circumstances drive voting behavior and how districts and areas change over time. Most of us know by now that politics are very tribal, but I’m more interested in how it changes over time. The factors that drive their motivation or lack thereof. This potentially gives us insight in how, why, or where certain candidates are ideologically captured, and how voters respond to that because it has the potential to affect their standard of living or their lifestyle. It’s very easy for people to say certain groups vote against their self-interest, which is pretty presumptuous, because it’s just as easy to say certain groups do so for the exact opposite reason, namely it’s the exact case that motivates them when they don’t stand to benefit for the opposition candidate because of the conditions they face. It’s that constant tug of war that I like to find trends in time and space when it comes to political leanings.

I think this is especially pressing because of the urban-rural divide where after the great recession they may not have recovered as much as the rest of the Country. Of course the ‘gold standard’ trade deals didn’t help, which set the conditions for their attitudes and voting behavior. So I decided to explore some of the data to see what I could come up with.

Rural-Urban Codes for Reference

Data-sources
https://www.census.gov/programs-surveys/saipe/data/datasets.html
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VOQCHQ
https://www.ers.usda.gov/data-products/county-level-data-sets/county-level-data-sets-download-data
Code Description
Metro counties:
1 Counties in metro areas of 1 million population or more
2 Counties in metro areas of 250,000 to 1 million population
3 Counties in metro areas of fewer than 250,000 population
Non-metro counties:
4 Urban population of 20,000 or more, adjacent to a metro area
5 Urban population of 20,000 or more, not adjacent to a metro area
6 Urban population of 2,500 to 19,999, adjacent to a metro area
7 Urban population of 2,500 to 19,999, not adjacent to a metro area
8 Completely rural or less than 2,500 urban population, adjacent to a metro area
9 Completely rural or less than 2,500 urban population, not adjacent to a metro area

2. Some Cited Articles

https://www.washingtonpost.com/politics/2022/03/11/what-political-shift-rural-counties-looks-like-since-2000
https://www.marylandmatters.org/2022/02/04/opinion-democrats-lose-90-of-rural-counties-in-america-here-is-why

3. Cleaning

A few things on cleaning. The usual cleaning was performed to clean headers. The main two data-sets I used was the income dataset and voter data-sets, but luckily I had another dataset that I didn’t even use except for the rural-urban codes that was a fun variable to use considering there are many different regional characteristics. I got stuck a few times on the voter dataset because of the way the data was associated. Doing some back and fourth and doing some group by, it was only a problem in the last couple election cycles because the way it’s reported. Since there are different voting methods, the voting method itemized and separated ‘early voting’ and ‘mail in’ but the ‘total votes’ per county for each candidate was the same. I think if it was data I was unfamiliar with I wouldn’t have noticed and just thought it interesting, but looking at the results was clearly wrong.

Load the libraries and set the working directory

I’m loading in a few databases to explore and for later.

library(tidyverse)
library(tidyr)

setwd("C:/Users/Administrator/Documents/Data110/Datasets/voting")
voting_data <- read_csv("countypres_2000-2020.csv")
education_data <- read_csv("Education.csv")
income_data <- read_csv("County_Income_.csv")

clean up column names and view headers

names(voting_data) <- tolower(names(voting_data))
names(voting_data) <- gsub("[ ,()]", "_", names(voting_data))
head(voting_data)
# A tibble: 6 × 12
   year state   state_po county_name county_fips office       candidate    party
  <dbl> <chr>   <chr>    <chr>       <chr>       <chr>        <chr>        <chr>
1  2000 ALABAMA AL       AUTAUGA     01001       US PRESIDENT AL GORE      DEMO…
2  2000 ALABAMA AL       AUTAUGA     01001       US PRESIDENT GEORGE W. B… REPU…
3  2000 ALABAMA AL       AUTAUGA     01001       US PRESIDENT RALPH NADER  GREEN
4  2000 ALABAMA AL       AUTAUGA     01001       US PRESIDENT OTHER        OTHER
5  2000 ALABAMA AL       BALDWIN     01003       US PRESIDENT AL GORE      DEMO…
6  2000 ALABAMA AL       BALDWIN     01003       US PRESIDENT GEORGE W. B… REPU…
# ℹ 4 more variables: candidatevotes <dbl>, totalvotes <dbl>, version <dbl>,
#   mode <chr>
names(education_data) <- tolower(names(education_data))
names(education_data) <- gsub("[ ,()]", "_", names(education_data))
head(education_data)
# A tibble: 6 × 55
  federal_information_processing_standa…¹ state area_name 2003_rural-urban_con…²
  <chr>                                   <chr> <chr>                      <dbl>
1 00000                                   US    United S…                     NA
2 01000                                   AL    Alabama                       NA
3 01001                                   AL    Autauga …                      2
4 01003                                   AL    Baldwin …                      4
5 01005                                   AL    Barbour …                      6
6 01007                                   AL    Bibb Cou…                      1
# ℹ abbreviated names: ¹​federal_information_processing_standard__fips__code,
#   ²​`2003_rural-urban_continuum_code`
# ℹ 51 more variables: `2003_urban_influence_code` <dbl>,
#   `2013_rural-urban_continuum_code` <dbl>, `2013_urban_influence_code` <dbl>,
#   less_than_a_high_school_diploma__1970 <dbl>,
#   high_school_diploma_only__1970 <dbl>,
#   `some_college__1-3_years___1970` <dbl>, …
names(income_data) <- tolower(names(income_data))
names(income_data) <- gsub("[ ,()]", "_", names(income_data))
head(income_data)
# A tibble: 6 × 6
  county_fips postal name           year poverty_estimate median_household_inc…¹
  <chr>       <chr>  <chr>         <dbl>            <dbl>                  <dbl>
1 01001       AL     Autauga Coun…  2008             5366                  51622
2 01003       AL     Baldwin Coun…  2008            17110                  51957
3 01005       AL     Barbour Coun…  2008             6399                  30896
4 01007       AL     Bibb County    2008             3753                  41076
5 01009       AL     Blount County  2008             7461                  46086
6 01011       AL     Bullock Coun…  2008             3074                  26980
# ℹ abbreviated name: ¹​median_household_income

Make dataset for fips code and urban-rural codes

Since the urban-rural codes are a subset of one of my data-sets I’m going to make it it’s own dataset and rejoin when necessary.

# New dataset for urban-rural codes
urban_rural_codes <- education_data |>
  select("federal_information_processing_standard__fips__code",
         `2003_rural-urban_continuum_code`,
         `2003_urban_influence_code`,
         `2013_rural-urban_continuum_code`,
         `2013_urban_influence_code`)

Clean voting data with a group by

After some exploration I realized my data was coming out wrong. There are a few states that have different modes and when doing calculations on them it’s duplicating the total votes and making it difficult. I grouped by year, county_fips, and party and summed up the votes for each party, and then took the max of total votes, so it wouldn’t duplicate. The lazier version of doing a nestled query.

# Group by county_fips and party, and calculate the total_candidate_votes and total_votes
voting_cleaned <- voting_data |>
  group_by(year,county_fips,party) |>
  summarize(
    total_candidate_votes = sum(candidatevotes),
    total_votes = max(totalvotes),
    county_name = max(county_name),
    state = max(state)
    
  )

Join fips code with voter data and Income data

voting_results <- voting_cleaned |>
  left_join(urban_rural_codes,
            by = c("county_fips" =  "federal_information_processing_standard__fips__code")) |>

# Add to correct year
mutate(rural_urban_continuum_code = ifelse(year < 2013,
                                    `2003_rural-urban_continuum_code`,
                                    `2013_rural-urban_continuum_code`),
       urban_influence_code = ifelse(year < 2013, `2003_urban_influence_code`,
                                                  `2013_urban_influence_code`))

# Show summary to view NA
summary(voting_results)
      year      county_fips           party           total_candidate_votes
 Min.   :2000   Length:64144       Length:64144       Min.   :      0      
 1st Qu.:2004   Class :character   Class :character   1st Qu.:    196      
 Median :2012   Mode  :character   Mode  :character   Median :   1800      
 Mean   :2010                                         Mean   :  12206      
 3rd Qu.:2016                                         3rd Qu.:   6919      
 Max.   :2020                                         Max.   :3028885      
                                                                           
  total_votes      county_name           state          
 Min.   :      0   Length:64144       Length:64144      
 1st Qu.:   4994   Class :character   Class :character  
 Median :  10788   Mode  :character   Mode  :character  
 Mean   :  42207                                        
 3rd Qu.:  28485                                        
 Max.   :4264365                                        
                                                        
 2003_rural-urban_continuum_code 2003_urban_influence_code
 Min.   :1.000                   Min.   : 1.000           
 1st Qu.:3.000                   1st Qu.: 2.000           
 Median :6.000                   Median : 5.000           
 Mean   :5.097                   Mean   : 5.403           
 3rd Qu.:7.000                   3rd Qu.: 8.000           
 Max.   :9.000                   Max.   :12.000           
 NA's   :853                     NA's   :853              
 2013_rural-urban_continuum_code 2013_urban_influence_code
 Min.   :1.000                   Min.   : 1.000           
 1st Qu.:2.000                   1st Qu.: 2.000           
 Median :6.000                   Median : 5.000           
 Mean   :4.974                   Mean   : 5.211           
 3rd Qu.:7.000                   3rd Qu.: 8.000           
 Max.   :9.000                   Max.   :12.000           
 NA's   :853                     NA's   :853              
 rural_urban_continuum_code urban_influence_code
 Min.   :1.000              Min.   : 1.000      
 1st Qu.:3.000              1st Qu.: 2.000      
 Median :6.000              Median : 5.000      
 Mean   :5.053              Mean   : 5.334      
 3rd Qu.:7.000              3rd Qu.: 8.000      
 Max.   :9.000              Max.   :12.000      
 NA's   :853                NA's   :853         

There’s some counts with no rural-urban codes because there’s no fips code. They are from Maine, Rhode island, and DC. They look like state-wide write-ins.

income_cleaned <- income_data |>
  left_join(urban_rural_codes,
            by = c("county_fips" =  "federal_information_processing_standard__fips__code")) |>

# Add to correct year
mutate(rural_urban_continuum_code = ifelse(year < 2013,
                                    `2003_rural-urban_continuum_code`,
                                    `2013_rural-urban_continuum_code`),
       urban_influence_code = ifelse(year < 2013, `2003_urban_influence_code`,
                                                  `2013_urban_influence_code`))

# Show summary to view NA
head(income_cleaned)
# A tibble: 6 × 12
  county_fips postal name           year poverty_estimate median_household_inc…¹
  <chr>       <chr>  <chr>         <dbl>            <dbl>                  <dbl>
1 01001       AL     Autauga Coun…  2008             5366                  51622
2 01003       AL     Baldwin Coun…  2008            17110                  51957
3 01005       AL     Barbour Coun…  2008             6399                  30896
4 01007       AL     Bibb County    2008             3753                  41076
5 01009       AL     Blount County  2008             7461                  46086
6 01011       AL     Bullock Coun…  2008             3074                  26980
# ℹ abbreviated name: ¹​median_household_income
# ℹ 6 more variables: `2003_rural-urban_continuum_code` <dbl>,
#   `2003_urban_influence_code` <dbl>, `2013_rural-urban_continuum_code` <dbl>,
#   `2013_urban_influence_code` <dbl>, rural_urban_continuum_code <dbl>,
#   urban_influence_code <dbl>

4. Visualization of Income by Urban-Rural Codes

# Filter the dataset for the year 2020
filtered_data_2020 <- income_cleaned |>
  filter(year == 2020)

# Boxplot for each rural-urban code
boxplot_2020 <- ggplot(filtered_data_2020, aes(x = factor(rural_urban_continuum_code),
                                               y = median_household_income)) +
  geom_boxplot() +
  labs(x = "Rural-Urban Code", y = "Median Household Income") +
  ggtitle("Boxplots of Median Household Income by Rural-Urban Code in 2020") +
  theme_minimal()
print(boxplot_2020)

This is to satisfy the statistical analysis requirement. I tried a couple different variations of box-plots, but I found just focusing on one year, the most recent being the most pertinent the most interest. Not to mention, I love the simplicity of the visualization. Glaringly 1 is the most urban and there is going to be a huge spread between the min and max with the median higher. But you see a lot of interesting things happen. Like 8 and 9 have the same population, but 9 is closer to an urban area, and 8 is not. So if you are in a small rural area near a city, you have both the lower and higher end of the spectrum.

5. Three Visualizations using Highcharts

Highcharts Vote Percentage Visualization

This is largely to check to make sure the data is good by comparing it to wiki percentage totals for the candidates.

library(highcharter)

# Party vote percentages
party_votes <- voting_results |>
  group_by(year, party) |>
  summarize(Percentage = round(sum(total_candidate_votes) / sum(total_votes) * 100, 2))

# Highcharter line graph
line_chart <- party_votes |>
  hchart(type = "line", hcaes(x = year, y = Percentage, group = party)) |>
  hc_title(text = "Party Vote Percentage Over Time") |>
  hc_xAxis(title = list(text = "Year")) |>
  hc_yAxis(title = list(text = "Percentage"))

# Display
line_chart

Democratic percentage over time using urban-rural codes

# Percentage of Democratic votes by year and urban-rural code
percentage_by_year_and_code <- voting_results |>
  filter(party == "DEMOCRAT" & !is.na(rural_urban_continuum_code)) |>
  group_by(year, rural_urban_continuum_code) |>
  summarize(Democratic_percentage = round(sum(total_candidate_votes) / sum(total_votes) * 100, 2))

# Highchart line graph with legend
line_chart <- percentage_by_year_and_code |>
  hchart(type = "line", hcaes(x = year,
                              y = Democratic_percentage,
                              group = rural_urban_continuum_code)) |>
  
  # Legend
  hc_title(text = "Democratic Percentage Over Time by RUCC Code") |>
  hc_xAxis(title = list(text = "Year")) |>
  hc_yAxis(title = list(text = "Percentage of Democratic Votes")) |>
  hc_legend(enabled = TRUE)

# Display line chart
line_chart

The visualization above is interesting because you see metro areas largely flat, but democrats took a major dive with rural areas. However, after 2016 every region type trended up. I didn’t expect that. I’d almost expect it to go in opposite directions as certain areas double down or polarity increased. Another thing I found interesting is that in the early years the grouping between 1 and 9 is a lot tighter compared to now.

Third-party percentage by RUCC Code

# Third-party percentage by year and rural-urban code
percentage_by_year_and_thirdparty <- voting_results |>
  filter(!(party %in% c("DEMOCRAT", "REPUBLICAN"))
        & !is.na(rural_urban_continuum_code)) |>
  group_by(year, rural_urban_continuum_code) |>
  summarize(Democratic_percentage = round(sum(total_candidate_votes) / sum(total_votes) * 100, 2))

# Highchart line graph with a legend
line_chart <- percentage_by_year_and_thirdparty |>
  hchart(type = "line", hcaes(x = year,
                              y = Democratic_percentage,
                              group = rural_urban_continuum_code)) |>
  
  # Legend
  hc_title(text = "Third-Party Percentage Over Time by RUCC Code") |>
  hc_xAxis(title = list(text = "Year")) |>
  hc_yAxis(title = list(text = "Percentage of Votes")) |>
  hc_legend(enabled = TRUE)

# Display line chart
line_chart

This one is fun as well. There is a somewhat flat line to slightly increasing in 2004 of people doing protest votes. But in 2016, with both candidates being severely unpopular, there was a major spike of people voting for candidates other than the two main parties. But as soon as 2020 approached, no matter what region, both groups got on board. Of course there will always be analysis on to what extent a spoiler actually spoils an election, especially when you have two third party candidates that essentially cancel each other out, but I thought the sharp rise in 2016 to the highest level in quite some time, to the largest drop in quite some time does say something about our political climate. And I’ll drop one last background article for that demonstrating the unpopularity of the candidates, to the largest turnout in 2020 (on both sides).

2016 - two of the most unpopular candidates
https://fivethirtyeight.com/features/americans-distaste-for-both-trump-and-clinton-is-record-breaking/

https://fivethirtyeight.com/features/historic-turnout-in-2020-not-so-far/