Project 3 - Further Exploration of Voting and Census Data

Author

Daniel B

1. Introduction

The dataset I’m using is the the same dataset I used as the previous one. The previous charts explored the urban-rural codes and the associated results of the Presidential votes. For this particular project, I’m going to use the same dataset, but I’m going to explore further into the geographical affiliation of voters by County and the resulting change in the vote for the Presidential election over time. The following variables are:

Lat/long associated with County
FIPS County code
Multiple years of Presidential results
Party candidate totals for multiple years

Background
Like previously mentioned, I’m interested in how circumstances drive voting behavior and how districts and areas change over time. Though previously I looked a urban-rural codes, in this project I’m going to dive deeper into using GIS and the FIPS code to map the voter percentages to their exact geographical area. By looking at their geographical area, we may find trends in at least trying to understand the drastic change in voter resentment for certain areas and/or how preference for a party changes over time.

I find this especially pressing because of the urban-rural divide and trying to understand some common ground and reducing polarity. In trying to understand voting decisions, we may try to understand each other. So I decided to explore some of the data to see what I could come up with. Usually when it comes to understanding our familiarity of the dataset, it usually comes from simply what party urban/rural voters choose exemplified by the map below:

Data-sources

I gathered three datasets for elections results, income, education background, and background. These datasets were associated by FIPS county code. Not all variables are used, but it allowed a plethora of data for exploration.

Voting Results by County - 2000-2020 (County FIPS):
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VOQCHQ

Income - 2000-2020 (County FIPS):
https://www.census.gov/programs-surveys/saipe/data/datasets.html

Education Background & Unemployment - 2000-2020 (County FIPS):
https://www.ers.usda.gov/data-products/county-level-data-sets/county-level-data-sets-download-data

2. Cleaning

A few datasets were cleaned and left open for exploration from the previous project. The usual cleaning was performed to clean headers initially. The main two data-sets I used was the income dataset and voter data-sets. In the previous project, I got stuck a few times on the voter dataset because of the way the data was associated. Doing some back and fourth and doing some group by in the last project, it was only a problem in the last couple election cycles because the way it was reported. Since there are different voting methods, the voting method itemized and separated ‘early voting’ and ‘mail in’ but the ‘total votes’ per county even though each candidate was the same.

A lat/long dataset was also pulled in to associate each FIPS code with a geographical location, so it can be mapped.

Load the libraries and set the working directory

library(tidyverse)
library(tidyr)
library(ggplot2)

setwd("C:/Users/Administrator/Documents/Data110/Datasets/voting")
voting_data <- read_csv("countypres_2000-2020.csv")
education_data <- read_csv("Education.csv")
income_data <- read_csv("County_Income_.csv")
unemployment_data <- read_csv("Unemployment.csv")

Clean up column names and view headers

names(voting_data) <- tolower(names(voting_data))
names(voting_data) <- gsub("[ ,()]", "_", names(voting_data))
head(voting_data)

# A tibble: 6 × 12
   year state   state_po county_name county_fips office       candidate    party
  <dbl> <chr>   <chr>    <chr>       <chr>       <chr>        <chr>        <chr>
1  2000 ALABAMA AL       AUTAUGA     01001       US PRESIDENT AL GORE      DEMO…
2  2000 ALABAMA AL       AUTAUGA     01001       US PRESIDENT GEORGE W. B… REPU…
3  2000 ALABAMA AL       AUTAUGA     01001       US PRESIDENT RALPH NADER  GREEN
4  2000 ALABAMA AL       AUTAUGA     01001       US PRESIDENT OTHER        OTHER
5  2000 ALABAMA AL       BALDWIN     01003       US PRESIDENT AL GORE      DEMO…
6  2000 ALABAMA AL       BALDWIN     01003       US PRESIDENT GEORGE W. B… REPU…
# ℹ 4 more variables: candidatevotes <dbl>, totalvotes <dbl>, version <dbl>,
#   mode <chr>

names(education_data) <- tolower(names(education_data))
names(education_data) <- gsub("[ ,()]", "_", names(education_data))
head(education_data)

# A tibble: 6 × 55
  federal_information_processing_standa…¹ state area_name 2003_rural-urban_con…²
  <chr>                                   <chr> <chr>                      <dbl>
1 00000                                   US    United S…                     NA
2 01000                                   AL    Alabama                       NA
3 01001                                   AL    Autauga …                      2
4 01003                                   AL    Baldwin …                      4
5 01005                                   AL    Barbour …                      6
6 01007                                   AL    Bibb Cou…                      1
# ℹ abbreviated names: ¹federal_information_processing_standard__fips__code,
#   ²`2003_rural-urban_continuum_code`
# ℹ 51 more variables: `2003_urban_influence_code` <dbl>,
#   `2013_rural-urban_continuum_code` <dbl>, `2013_urban_influence_code` <dbl>,
#   less_than_a_high_school_diploma__1970 <dbl>,
#   high_school_diploma_only__1970 <dbl>,
#   `some_college__1-3_years___1970` <dbl>, …

names(income_data) <- tolower(names(income_data))
names(income_data) <- gsub("[ ,()]", "_", names(income_data))
head(income_data)

# A tibble: 6 × 6
  county_fips postal name           year poverty_estimate median_household_inc…¹
  <chr>       <chr>  <chr>         <dbl>            <dbl>                  <dbl>
1 01001       AL     Autauga Coun…  2008             5366                  51622
2 01003       AL     Baldwin Coun…  2008            17110                  51957
3 01005       AL     Barbour Coun…  2008             6399                  30896
4 01007       AL     Bibb County    2008             3753                  41076
5 01009       AL     Blount County  2008             7461                  46086
6 01011       AL     Bullock Coun…  2008             3074                  26980
# ℹ abbreviated name: ¹median_household_income

names(unemployment_data) <- tolower(names(unemployment_data))
names(unemployment_data) <- gsub("[ ,()]", "_", names(unemployment_data))
head(unemployment_data)

# A tibble: 6 × 8
  fips_code state area_name        unemployment_rate_2004 unemployment_rate_2008
      <dbl> <chr> <chr>                             <dbl>                  <dbl>
1         0 US    United States                       5.5                    5.8
2      1000 AL    Alabama                             5.6                    5.8
3      1001 AL    Autauga County,…                    4.7                    5.3
4      1003 AL    Baldwin County,…                    5.1                    4.8
5      1005 AL    Barbour County,…                    7.1                    9.1
6      1007 AL    Bibb County, AL                     5.4                    6  
# ℹ 3 more variables: unemployment_rate_2012 <dbl>,
#   unemployment_rate_2016 <dbl>, unemployment_rate_2020 <dbl>

Make dataset for fips code and urban-rural codes

I kept the urban-rural codes in-tact in case I needed to use them.

# New dataset for urban-rural codes
urban_rural_codes <- education_data |>
  select("federal_information_processing_standard__fips__code",
         `2003_rural-urban_continuum_code`,
         `2003_urban_influence_code`,
         `2013_rural-urban_continuum_code`,
         `2013_urban_influence_code`)

Clean voting data with a group by

This is to fix the fact that there are a few states that have different modes and when doing calculations it would duplicate like mentioned in the last project.

# Group by county_fips and party, and calculate the total_candidate_votes and total_votes
voting_cleaned <- voting_data |>
  group_by(year,county_fips,party) |>
  summarize(
    total_candidate_votes = sum(candidatevotes),
    total_votes = max(totalvotes),
    county_name = max(county_name),
    state = max(state)
    
  )

Join FIPS code with voter data and Income data

voting_results <- voting_cleaned |>
  left_join(urban_rural_codes,
            by = c("county_fips" =  "federal_information_processing_standard__fips__code")) |>

# Add to correct year
mutate(rural_urban_continuum_code = ifelse(year < 2013,
                                    `2003_rural-urban_continuum_code`,
                                    `2013_rural-urban_continuum_code`),
       urban_influence_code = ifelse(year < 2013, `2003_urban_influence_code`,
                                                  `2013_urban_influence_code`))

# Show summary to view NA
head(voting_results)

# A tibble: 6 × 13
# Groups:   year, county_fips [2]
   year county_fips party    total_candidate_votes total_votes county_name state
  <dbl> <chr>       <chr>                    <dbl>       <dbl> <chr>       <chr>
1  2000 01001       DEMOCRAT                  4942       17208 AUTAUGA     ALAB…
2  2000 01001       GREEN                      160       17208 AUTAUGA     ALAB…
3  2000 01001       OTHER                      113       17208 AUTAUGA     ALAB…
4  2000 01001       REPUBLI…                 11993       17208 AUTAUGA     ALAB…
5  2000 01003       DEMOCRAT                 13997       56480 BALDWIN     ALAB…
6  2000 01003       GREEN                     1033       56480 BALDWIN     ALAB…
# ℹ 6 more variables: `2003_rural-urban_continuum_code` <dbl>,
#   `2003_urban_influence_code` <dbl>, `2013_rural-urban_continuum_code` <dbl>,
#   `2013_urban_influence_code` <dbl>, rural_urban_continuum_code <dbl>,
#   urban_influence_code <dbl>

Year, FIPS code, Change In vote

The four election years were filtered with the dem vote percentage to measure over time.

# Filter Dem party and calculate the vote percentage
democrat_data <- voting_results |>
  filter(party == "DEMOCRAT") |>
  mutate(vote_percentage = total_candidate_votes / total_votes * 100)

# Pivot the data to have years as columns
pivot_data <- democrat_data |>
  select(county_fips, county_name, state, rural_urban_continuum_code, year, vote_percentage) |>
  spread(key = year, value = vote_percentage)

# Calculate the differences between years
pivot_data <- pivot_data |>
  mutate(
    diff_2004_2008 = `2008` - `2004`,
    diff_2008_2012 = `2012` - `2008`,
    diff_2012_2016 = `2016` - `2012`,
    diff_2016_2020 = `2020` - `2016`
  )

# View the final data
pivot_data

# A tibble: 3,654 × 14
# Groups:   county_fips [3,157]
   county_fips county_name state   rural_urban_continuum_…¹ `2000` `2004` `2008`
   <chr>       <chr>       <chr>                      <dbl>  <dbl>  <dbl>  <dbl>
 1 01001       AUTAUGA     ALABAMA                        2   28.7   23.7   25.8
 2 01003       BALDWIN     ALABAMA                        4   24.8   22.5   23.8
 3 01005       BARBOUR     ALABAMA                        6   49.9   44.8   49.0
 4 01007       BIBB        ALABAMA                        1   38.2   27.5   26.6
 5 01009       BLOUNT      ALABAMA                        1   27.7   18.3   14.5
 6 01011       BULLOCK     ALABAMA                        6   69.2   68.1   74.1
 7 01013       BUTLER      ALABAMA                        6   46.2   40.6   43.1
 8 01015       CALHOUN     ALABAMA                        3   40.6   33.3   33.2
 9 01017       CHAMBERS    ALABAMA                        6   47.5   41.0   45.5
10 01019       CHEROKEE    ALABAMA                        8   44.7   33.6   23.7
# ℹ 3,644 more rows
# ℹ abbreviated name: ¹rural_urban_continuum_code
# ℹ 7 more variables: `2012` <dbl>, `2016` <dbl>, `2020` <dbl>,
#   diff_2004_2008 <dbl>, diff_2008_2012 <dbl>, diff_2012_2016 <dbl>,
#   diff_2016_2020 <dbl>

Export for Tableau

The information was exported to import into Tableau. Since it didn’t map correctly, I did pull a FIPS lat/long files and associate it with the data in excel before I imported it in.

library("writexl")
write_xlsx(pivot_data,"C:/Users/Administrator/Documents/Data110/Datasets/voting/pivot10.xlsx")

3. Visualation 1: Democratic Party Percentage of Vote Over Time (GEOMAPS)

First Chart: To start the exploration of voting results, I put side-by-side geomaps on a page dashboard to view the change over time.

Link: https://public.tableau.com/app/profile/daniel.bland4240/viz/voting_by_county/Dashboard1

Analysis: A few key points:

2008: Obama won across the board as he built a solid coalition, but not Southern Appalachia, the Ozark Region, or Northern Texas area. Important to note those regions are demographically very white, so even though he won large parts of the south, the southern white voting block leaned toward McCain, especially after he chose Sarah Palin as his VP, which could explain it., as she was more of a social conservative.
2012: Even though Obama won a second term, he lost votes across the board, and continued to lose more support in Southern Appalachia furthering that trend.
2016: When Trump won, the Democratic party lost votes almost everywhere, but it’s more pronounced in the Rustbelt. It may because Clinton’s history of supporting trade deals that shipped rural jobs overseas. Trump really criticised this aspect of Clinton. People also speculate it’s because a lot of areas didn’t recover from the Great Recession and chalk it up as economic-based. It’s also important to note that, more or less, that a lot of Democrats stayed home and weren’t that enthusiastic about voting for Clinton. The opposite was happening on the Republican side.
2020: Democrats won back voters with Biden, but the Texas Counties on the Southern border show drastic losses and abandoned the Democratic party. Those Counties show upwards of a 30% drop in voter percentages for the Democratic party. For me it’s sort of a chicken or the egg problem. It’s difficult to tell if this trend was from Republicans constant hammering of the border crisis, or if they saw this data, and picked it up as an issue. Probably a little bit of both.

Backup articles:

Obama’s win in 2008 and losing Appalachia:
https://www.nytimes.com/2008/11/11/us/politics/11south.html

Trump’s winning of the Rustbelt in 2016:
https://www.reuters.com/article/us-usa-election-coronavirus-insight/in-2016-trump-won-these-rust-belt-counties-on-the-economy-in-2020-he-might-lose-them-over-coronavirus-idUSKBN27416C/

Trump holding onto Texas Border Counties in 2020:
https://www.texastribune.org/2020/11/04/joe-biden-texas-border-democrats/

4. Scatter plots comparing Income & Unemployment with Democrat Percentage of Vote (2020)

Data combined into one data-frame

# Convert the FIPS code in unemployment_data to character
unemployment_data <- unemployment_data |>
  mutate(fips_code = as.character(fips_code))

# Select 2020 vote percentage
selected_data <- pivot_data |>
  select(county_fips, county_name, state, rural_urban_continuum_code, `2020`)

# Join unemployment data
joined_data <- selected_data |>
  left_join(unemployment_data, by = c("county_fips" = "fips_code")) |>
  select(county_fips, `2020`, county_name, rural_urban_continuum_code, unemployment_rate_2020)

# Join income data
final_data <- joined_data |>
  left_join(income_data, by = "county_fips") |>
  select(county_fips, county_name, rural_urban_continuum_code, `2020`, unemployment_rate_2020, median_household_income)

# View the final dataset
head(final_data)

# A tibble: 6 × 6
# Groups:   county_fips [2]
  county_fips county_name rural_urban_continuum_…¹ `2020` unemployment_rate_2020
  <chr>       <chr>                          <dbl>  <dbl>                  <dbl>
1 01001       AUTAUGA                            2   27.0                     NA
2 01001       AUTAUGA                            2   27.0                     NA
3 01001       AUTAUGA                            2   27.0                     NA
4 01001       AUTAUGA                            2   27.0                     NA
5 01003       BALDWIN                            4   NA                       NA
6 01003       BALDWIN                            4   NA                       NA
# ℹ abbreviated name: ¹rural_urban_continuum_code
# ℹ 1 more variable: median_household_income <dbl>

Simple scatterplot: Median household income and democratic vote percentage (2020)

# Scatterplot for Democratic Vote Percentage vs Median Household Income
ggplot(final_data, aes(x = median_household_income, y = `2020`)) +
  geom_point() +
  geom_smooth(method = "auto", se = FALSE, color = "red") +  # Add a red trendline
  labs(
    x = "Median Household Income",
    y = "Democratic Vote Percentage (2020)",
    title = "Scatterplot of Democratic Vote Percentage vs. Median Household Income (2020)"
  )

Analysis

The trend shows a high democratic vote percentage for low income counties, but they lose more of the vote as it proceed to $40,000. Then it turns around and slowly increases the more money you make. Almost in the form of a U-Shape.

Highcharter scatter plot with tooltip: Median household income and democratic vote percentage (2020)

library(highcharter)

# Categorize rural_urban_continuum_code into 'Urban', 'Suburban', 'Rural'
final_data <- final_data |>
  mutate(group = case_when(
    rural_urban_continuum_code == 1 ~ "Urban",
    rural_urban_continuum_code == 2 ~ "Suburban",
    TRUE ~ "Rural"
  ))

# Extra information for tooltips
final_data$tooltip_info <- paste(
  "County FIPS: ", final_data$county_fips, 
  "<br>County: ", final_data$county_name, 
  "<br>Median Household Income: ", final_data$median_household_income,
  "<br>Democratic Vote (%): ", formatC(final_data$`2020`, format = 'f', digits = 2),
  "<br>Rural Urban Code: ", final_data$rural_urban_continuum_code,
  "<br>Group: ", final_data$group,
  sep = ""
)

# Define a custom color palette for the groups
group_colors <- c(
  "Urban" = "#0072B2",     # Blue
  "Suburban" = "#009E73",  # Green
  "Rural" = "#D55E00"      # Orange
)

# Scatter plot using highcharter with custom colors and legend labels
highchart() |>
  hc_title(text = "Median Household Income and Democrat Vote Percentage in 2020") |>
  hc_xAxis(title = list(text = "Median Household Income"), labels = list(format = '{value}')) |>
  hc_yAxis(title = list(text = "Democrat Vote Percentage"), labels = list(format = '{value}%')) |>
  hc_add_series(data = final_data |>
                filter(group == "Urban"),
                type = 'scatter',
                hcaes(x = median_household_income,
                      y = `2020`,
                      extra = tooltip_info),
                name = 'Urban',
                marker = list(radius = 3, opacity = 0.6)) |>
  hc_add_series(data = final_data |>
                filter(group == "Suburban"),
                type = 'scatter',
                hcaes(x = median_household_income,
                      y = `2020`,
                      extra = tooltip_info),
                name = 'Suburban', marker = list(radius = 3, opacity = 0.6)) |>
  hc_add_series(data = final_data |>
                  filter(group == "Rural"),
                type = 'scatter',
                hcaes(x = median_household_income,
                      y = `2020`, extra = tooltip_info),
                name = 'Rural',
                marker = list(radius = 3, opacity = 0.6)) |>
  hc_colors(group_colors) |>
  hc_tooltip(useHTML = TRUE, headerFormat = '<small>{point.key}</small><br/>',
              pointFormat = '{point.extra}') |>
  hc_legend(enabled = TRUE) |>
  hc_chart(zoomType = 'xy') |>
  hc_exporting(enabled = TRUE) |>
  hc_caption(text = "Source: Voting Results - Harvard Dataverse / Median Household Income - USDA/Census")

Analysis

This chart shows that household income is higher in urban areas, lower in surburban areas, and even lower in rural areas.
The urban areas have a huge range and not as dense. Surburban areas are more uniform and a little less dense. And rural areas have a higher range, but has so many points pretty dense, almost a blob.

Simple scatterplot: Unemployment rate and democrat vote percentage (2020)

# Scatterplot for Democratic Vote Percentage vs Median Household Income
ggplot(final_data, aes(x = unemployment_rate_2020, y = `2020`)) +
  geom_point() +
  geom_smooth(method = "auto", se = FALSE, color = "red") +  # Add a red trendline
  labs(
    x = "Unemployment Rate",
    y = "Democratic Vote Percentage (2020)",
    title = "Scatterplot of Democratic Vote Percentage vs. Unemployment Rate (2020)"
  )

Analysis

The trend follows the higher the unemployment rate, the more likely the democrat vote is higher in those counties. The trend tends to be flat from 5% to 10% unemployment rate, but has a sharp increase from 0%-5% and beyond 10%.
The trends seems to suggest in areas that have low unemployment where policies are working with proper funding of education and infrastructure,

Highcharter scatterplot with tooltip: Unemployment rate and democrat vote percentage (2020)

library(highcharter)

# Categorize rural_urban_continuum_code into 'Urban', 'Suburban', 'Rural'
final_data <- final_data |>
  mutate(group = case_when(
    rural_urban_continuum_code == 1 ~ "Urban",
    rural_urban_continuum_code == 2 ~ "Suburban",
    TRUE ~ "Rural"
  ))

# Extra information for tooltips
final_data$tooltip_info <- paste(
  "County FIPS: ", final_data$county_fips, 
  "<br>County: ", final_data$county_name, 
  "<br>Unemployment Rate: ", final_data$unemployment_rate_2020,
  "<br>Democratic Vote (%): ", formatC(final_data$`2020`, format = 'f', digits = 2),
  "<br>Rural Urban Code: ", final_data$rural_urban_continuum_code,
  "<br>Group: ", final_data$group,
  sep = ""
)

# Define a custom color palette for the groups
group_colors <- c(
  "Urban" = "#0072B2",     # Blue
  "Suburban" = "#009E73",  # Green
  "Rural" = "#D55E00"      # Orange
)

# Scatter plot using highcharter with custom colors and legend labels
highchart() |>
  hc_title(text = "Unemployment Rate and Democrat Vote Percentage in 2020") |>
  hc_xAxis(title = list(text = "Unemployment Rate"), labels = list(format = '{value}%')) |>
  hc_yAxis(title = list(text = "Democrat Vote Percentage"), labels = list(format = '{value}%')) |>
  hc_add_series(data = final_data |>
                filter(group == "Urban"),
                type = 'scatter',
                hcaes(x = unemployment_rate_2020,
                      y = `2020`,
                      extra = tooltip_info),
                name = 'Urban',
                marker = list(radius = 3, opacity = 0.6)) |>
  hc_add_series(data = final_data |>
                filter(group == "Suburban"),
                type = 'scatter',
                hcaes(x = unemployment_rate_2020,
                      y = `2020`,
                      extra = tooltip_info),
                name = 'Suburban', marker = list(radius = 3, opacity = 0.6)) |>
  hc_add_series(data = final_data |>
                  filter(group == "Rural"),
                type = 'scatter',
                hcaes(x = unemployment_rate_2020,
                      y = `2020`, extra = tooltip_info),
                name = 'Rural',
                marker = list(radius = 3, opacity = 0.6)) |>
  hc_colors(group_colors) |>
  hc_tooltip(useHTML = TRUE, headerFormat = '<small>{point.key}</small><br/>',
              pointFormat = '{point.extra}') |>
  hc_legend(enabled = TRUE) |>
  hc_chart(zoomType = 'xy') |>
  hc_exporting(enabled = TRUE) |>
  hc_caption(text = "Source: Voting Results - Harvard Dataverse / Unemployment Rate - USDA/Census")

Analysis

In this chart, the rural area have a huge range for unemployment rate across the whole plot. Suburban areas against, pretty uniform and centered on the plot. Urban areas also have a huge range.
It’s important to note, this was in 2020, where the Pandemic affected the economy and unemployment rate.