Project 3

Author

Daniel B

1. Introduction

The dataset I’m using is the the same dataset I used as the previous one. The previous charts explored the urban-rural codes and the associated results of the Presidential votes. For this particular project, I’m going to use the same dataset, but I’m going to explore further into the geographical affiliation of voters by County and the resulting change in the vote for the Presidential election over time. The following variables are:

  • Lat/long associated with County

  • FIPS County code

  • Multiple years of Presidential results

  • Party candidate totals for multiple years

Background
Like previously mentioned, I’m interested in how circumstances drive voting behavior and how districts and areas change over time. Though previously I looked a urban-rural codes, in this project I’m going to dive deeper into using GIS and the FIPS code to map the voter percentages to their exact geographical area. By looking at their geographical area, we may find trends in at least trying to understand the drastic change in voter resentment for certain areas and/or how preference for a party changes over time.

I find this especially pressing because of the urban-rural divide and trying to understand some common ground and reducing polarity. In trying to understand voting decisions, we may try to understand each other. So I decided to explore some of the data to see what I could come up with. Usually when it comes to understanding our familiarity of the dataset, it usually comes from simply what party urban/rural voters choose exemplified by the map below:

Data-sources

I gathered three datasets for elections results, income, education background, and background. These datasets were associated by FIPS county code. Not all variables are used, but it allowed a plethora of data for exploration.

Voting Results by County - 2000-2020 (County FIPS):
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VOQCHQ

Income - 2000-2020 (County FIPS):
https://www.census.gov/programs-surveys/saipe/data/datasets.html

Education Background & Unemployment - 2000-2020 (County FIPS):
https://www.ers.usda.gov/data-products/county-level-data-sets/county-level-data-sets-download-data

2. Cleaning

A few datasets were cleaned and left open for exploration from the previous project. The usual cleaning was performed to clean headers initially. The main two data-sets I used was the income dataset and voter data-sets. In the previous project, I got stuck a few times on the voter dataset because of the way the data was associated. Doing some back and fourth and doing some group by in the last project, it was only a problem in the last couple election cycles because the way it was reported. Since there are different voting methods, the voting method itemized and separated ‘early voting’ and ‘mail in’ but the ‘total votes’ per county even though each candidate was the same.

A lat/long dataset was also pulled in to associate each FIPS code with a geographical location, so it can be mapped.

Load the libraries and set the working directory

library(tidyverse)
library(tidyr)
library(ggplot2)

setwd("C:/Users/dbland/Downloads/Project/Datasets/voting")
voting_data <- read_csv("countypres_2000-2020.csv")
education_data <- read_csv("Education.csv")
income_data <- read_csv("County_Income_.csv")
unemployment_data <- read_csv("Unemployment.csv")

Clean up column names and view headers

names(voting_data) <- tolower(names(voting_data))
names(voting_data) <- gsub("[ ,()]", "_", names(voting_data))
head(voting_data)
# A tibble: 6 × 12
   year state   state_po county_name county_fips office       candidate    party
  <dbl> <chr>   <chr>    <chr>       <chr>       <chr>        <chr>        <chr>
1  2000 ALABAMA AL       AUTAUGA     01001       US PRESIDENT AL GORE      DEMO…
2  2000 ALABAMA AL       AUTAUGA     01001       US PRESIDENT GEORGE W. B… REPU…
3  2000 ALABAMA AL       AUTAUGA     01001       US PRESIDENT RALPH NADER  GREEN
4  2000 ALABAMA AL       AUTAUGA     01001       US PRESIDENT OTHER        OTHER
5  2000 ALABAMA AL       BALDWIN     01003       US PRESIDENT AL GORE      DEMO…
6  2000 ALABAMA AL       BALDWIN     01003       US PRESIDENT GEORGE W. B… REPU…
# ℹ 4 more variables: candidatevotes <dbl>, totalvotes <dbl>, version <dbl>,
#   mode <chr>
names(education_data) <- tolower(names(education_data))
names(education_data) <- gsub("[ ,()]", "_", names(education_data))
head(education_data)
# A tibble: 6 × 55
  federal_information_processing_standa…¹ state area_name 2003_rural-urban_con…²
  <chr>                                   <chr> <chr>                      <dbl>
1 00000                                   US    United S…                     NA
2 01000                                   AL    Alabama                       NA
3 01001                                   AL    Autauga …                      2
4 01003                                   AL    Baldwin …                      4
5 01005                                   AL    Barbour …                      6
6 01007                                   AL    Bibb Cou…                      1
# ℹ abbreviated names: ¹​federal_information_processing_standard__fips__code,
#   ²​`2003_rural-urban_continuum_code`
# ℹ 51 more variables: `2003_urban_influence_code` <dbl>,
#   `2013_rural-urban_continuum_code` <dbl>, `2013_urban_influence_code` <dbl>,
#   less_than_a_high_school_diploma__1970 <dbl>,
#   high_school_diploma_only__1970 <dbl>,
#   `some_college__1-3_years___1970` <dbl>, …
names(income_data) <- tolower(names(income_data))
names(income_data) <- gsub("[ ,()]", "_", names(income_data))
head(income_data)
# A tibble: 6 × 6
  county_fips postal name           year poverty_estimate median_household_inc…¹
  <chr>       <chr>  <chr>         <dbl>            <dbl>                  <dbl>
1 01001       AL     Autauga Coun…  2008             5366                  51622
2 01003       AL     Baldwin Coun…  2008            17110                  51957
3 01005       AL     Barbour Coun…  2008             6399                  30896
4 01007       AL     Bibb County    2008             3753                  41076
5 01009       AL     Blount County  2008             7461                  46086
6 01011       AL     Bullock Coun…  2008             3074                  26980
# ℹ abbreviated name: ¹​median_household_income
names(unemployment_data) <- tolower(names(unemployment_data))
names(unemployment_data) <- gsub("[ ,()]", "_", names(unemployment_data))
head(unemployment_data)
# A tibble: 6 × 8
  fips_code state area_name        unemployment_rate_2004 unemployment_rate_2008
      <dbl> <chr> <chr>                             <dbl>                  <dbl>
1         0 US    United States                       5.5                    5.8
2      1000 AL    Alabama                             5.6                    5.8
3      1001 AL    Autauga County,…                    4.7                    5.3
4      1003 AL    Baldwin County,…                    5.1                    4.8
5      1005 AL    Barbour County,…                    7.1                    9.1
6      1007 AL    Bibb County, AL                     5.4                    6  
# ℹ 3 more variables: unemployment_rate_2012 <dbl>,
#   unemployment_rate_2016 <dbl>, unemployment_rate_2020 <dbl>

Make dataset for fips code and urban-rural codes

I kept the urban-rural codes in-tact in case I needed to use them.

# New dataset for urban-rural codes
urban_rural_codes <- education_data |>
  select("federal_information_processing_standard__fips__code",
         `2003_rural-urban_continuum_code`,
         `2003_urban_influence_code`,
         `2013_rural-urban_continuum_code`,
         `2013_urban_influence_code`)

Clean voting data with a group by

This is to fix the fact that there are a few states that have different modes and when doing calculations it would duplicate like mentioned in the last project.

# Group by county_fips and party, and calculate the total_candidate_votes and total_votes
voting_cleaned <- voting_data |>
  group_by(year,county_fips,party) |>
  summarize(
    total_candidate_votes = sum(candidatevotes),
    total_votes = max(totalvotes),
    county_name = max(county_name),
    state = max(state)
  )
`summarise()` has grouped output by 'year', 'county_fips'. You can override
using the `.groups` argument.

Join FIPS code with voter data and Income data

voting_results <- voting_cleaned |>
  left_join(urban_rural_codes,
            by = c("county_fips" =  "federal_information_processing_standard__fips__code")) |>

# Add to correct year
mutate(rural_urban_continuum_code = ifelse(year < 2013,
                                    `2003_rural-urban_continuum_code`,
                                    `2013_rural-urban_continuum_code`),
       urban_influence_code = ifelse(year < 2013, `2003_urban_influence_code`,
                                                  `2013_urban_influence_code`))

# Show summary to view NA
head(voting_results)
# A tibble: 6 × 13
# Groups:   year, county_fips [2]
   year county_fips party    total_candidate_votes total_votes county_name state
  <dbl> <chr>       <chr>                    <dbl>       <dbl> <chr>       <chr>
1  2000 01001       DEMOCRAT                  4942       17208 AUTAUGA     ALAB…
2  2000 01001       GREEN                      160       17208 AUTAUGA     ALAB…
3  2000 01001       OTHER                      113       17208 AUTAUGA     ALAB…
4  2000 01001       REPUBLI…                 11993       17208 AUTAUGA     ALAB…
5  2000 01003       DEMOCRAT                 13997       56480 BALDWIN     ALAB…
6  2000 01003       GREEN                     1033       56480 BALDWIN     ALAB…
# ℹ 6 more variables: `2003_rural-urban_continuum_code` <dbl>,
#   `2003_urban_influence_code` <dbl>, `2013_rural-urban_continuum_code` <dbl>,
#   `2013_urban_influence_code` <dbl>, rural_urban_continuum_code <dbl>,
#   urban_influence_code <dbl>

Year, FIPS code, Change In vote

The four election years were filtered with the dem vote percentage to measure over time.

# Filter Dem party and calculate the vote percentage
democrat_data <- voting_results |>
  filter(party == "DEMOCRAT") |>
  mutate(vote_percentage = total_candidate_votes / total_votes * 100)

# Pivot the data to have years as columns
pivot_data <- democrat_data |>
  select(county_fips, county_name, state, rural_urban_continuum_code, year, vote_percentage) |>
  spread(key = year, value = vote_percentage)

# Calculate the differences between years
pivot_data <- pivot_data |>
  mutate(
    diff_2004_2008 = `2008` - `2004`,
    diff_2008_2012 = `2012` - `2008`,
    diff_2012_2016 = `2016` - `2012`,
    diff_2016_2020 = `2020` - `2016`
  )

# View the final data
pivot_data
# A tibble: 3,654 × 14
# Groups:   county_fips [3,157]
   county_fips county_name state   rural_urban_continuum_…¹ `2000` `2004` `2008`
   <chr>       <chr>       <chr>                      <dbl>  <dbl>  <dbl>  <dbl>
 1 01001       AUTAUGA     ALABAMA                        2   28.7   23.7   25.8
 2 01003       BALDWIN     ALABAMA                        4   24.8   22.5   23.8
 3 01005       BARBOUR     ALABAMA                        6   49.9   44.8   49.0
 4 01007       BIBB        ALABAMA                        1   38.2   27.5   26.6
 5 01009       BLOUNT      ALABAMA                        1   27.7   18.3   14.5
 6 01011       BULLOCK     ALABAMA                        6   69.2   68.1   74.1
 7 01013       BUTLER      ALABAMA                        6   46.2   40.6   43.1
 8 01015       CALHOUN     ALABAMA                        3   40.6   33.3   33.2
 9 01017       CHAMBERS    ALABAMA                        6   47.5   41.0   45.5
10 01019       CHEROKEE    ALABAMA                        8   44.7   33.6   23.7
# ℹ 3,644 more rows
# ℹ abbreviated name: ¹​rural_urban_continuum_code
# ℹ 7 more variables: `2012` <dbl>, `2016` <dbl>, `2020` <dbl>,
#   diff_2004_2008 <dbl>, diff_2008_2012 <dbl>, diff_2012_2016 <dbl>,
#   diff_2016_2020 <dbl>
# Convert the FIPS code in unemployment_data to character
unemployment_data <- unemployment_data |>
  mutate(fips_code = as.character(fips_code))

# Select 2020 vote percentage
selected_data <- pivot_data |>
  select(county_fips, county_name, state, rural_urban_continuum_code, `2020`)

# Join unemployment data
joined_data <- selected_data |>
  left_join(unemployment_data, by = c("county_fips" = "fips_code")) |>
  select(county_fips, `2020`, county_name, rural_urban_continuum_code, unemployment_rate_2020)

# Join income data
final_data <- joined_data |>
  left_join(income_data, by = "county_fips") |>
  select(county_fips, county_name, rural_urban_continuum_code, `2020`, unemployment_rate_2020, median_household_income)
Warning in left_join(joined_data, income_data, by = "county_fips"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 2 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
# View the final dataset
head(final_data)
# A tibble: 6 × 6
# Groups:   county_fips [2]
  county_fips county_name rural_urban_continuum_…¹ `2020` unemployment_rate_2020
  <chr>       <chr>                          <dbl>  <dbl>                  <dbl>
1 01001       AUTAUGA                            2   27.0                     NA
2 01001       AUTAUGA                            2   27.0                     NA
3 01001       AUTAUGA                            2   27.0                     NA
4 01001       AUTAUGA                            2   27.0                     NA
5 01003       BALDWIN                            4   NA                       NA
6 01003       BALDWIN                            4   NA                       NA
# ℹ abbreviated name: ¹​rural_urban_continuum_code
# ℹ 1 more variable: median_household_income <dbl>

Highcharter scatter plot with tooltip: Median household income and democratic vote percentage (2020)

library(highcharter)
Warning: package 'highcharter' was built under R version 4.3.2
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
library(dplyr)

# Assuming final_data is already defined and loaded

# Categorize rural_urban_continuum_code into 'Urban', 'Suburban', 'Rural'
final_data <- final_data %>%
  mutate(group = case_when(
    rural_urban_continuum_code == 1 ~ "Urban",
    rural_urban_continuum_code == 2 ~ "Suburban",
    TRUE ~ "Rural"
  ))

# Extra information for tooltips
final_data$tooltip_info <- paste(
  "County FIPS: ", final_data$county_fips, 
  "<br>County: ", final_data$county_name, 
  "<br>Median Household Income: ", final_data$median_household_income,
  "<br>Democratic Vote (%): ", formatC(final_data$`2020`, format = 'f', digits = 2),
  "<br>Rural Urban Code: ", final_data$rural_urban_continuum_code,
  "<br>Group: ", final_data$group,
  sep = ""
)

# Define a custom color palette for the groups
group_colors <- c(
  "Urban" = "#0072B2",     # Blue
  "Suburban" = "#009E73",  # Green
  "Rural" = "#D55E00"      # Orange
)

# Scatter plot using highcharter with custom colors and legend labels
highchart() %>%
  hc_title(text = "Median Household Income and Democrat Vote Percentage in 2020") %>%
  hc_xAxis(title = list(text = "Median Household Income"), labels = list(format = '{value}')) %>%
  hc_yAxis(title = list(text = "Democrat Vote Percentage"), labels = list(format = '{value}%')) %>%
  hc_add_series(data = final_data %>%
                  filter(group == "Urban"),
                  type = 'scatter',
                  hcaes(x = median_household_income, y = `2020`, extra = tooltip_info),
                  name = 'Urban',
                  marker = list(radius = 3, opacity = 0.6)) %>%
  hc_add_series(data = final_data %>%
                  filter(group == "Suburban"),
                  type = 'scatter',
                  hcaes(x = median_household_income, y = `2020`, extra = tooltip_info),
                  name = 'Suburban', marker = list(radius = 3, opacity = 0.6)) %>%
  hc_add_series(data = final_data %>%
                  filter(group == "Rural"),
                  type = 'scatter',
                  hcaes(x = median_household_income, y = `2020`, extra = tooltip_info),
                  name = 'Rural',
                  marker = list(radius = 3, opacity = 0.6)) %>%
  hc_colors(group_colors) %>%
  hc_tooltip(useHTML = TRUE, headerFormat = '<small>{point.key}</small><br/>',
              pointFormat = '{point.extra}') %>%
  hc_legend(enabled = TRUE) %>%
  hc_caption(text = "Source: Voting Results - Harvard Dataverse / Median Household Income - USDA/Census")
Input to asJSON(keep_vec_names=TRUE) is a named vector. In a future version of jsonlite, this option will not be supported, and named vectors will be translated into arrays instead of objects. If you want JSON object output, please use a named list instead. See ?toJSON.