library(tidyverse)
library(tidyr)
library(ggplot2)
setwd("C:/Users/Administrator/Documents/Data110/Datasets/voting")
voting_data <- read_csv("countypres_2000-2020.csv")
education_data <- read_csv("Education.csv")
income_data <- read_csv("County_Income_.csv")
unemployment_data <- read_csv("Unemployment.csv")Project 3 - Further Exploration of Voting and Census Data
1. Introduction
The dataset I’m using is the the same dataset I used as the previous one. The previous charts explored the urban-rural codes and the associated results of the Presidential votes. For this particular project, I’m going to use the same dataset, but I’m going to explore further into the geographical affiliation of voters by County and the resulting change in the vote for the Presidential election over time. The following variables are:
Lat/long associated with County
FIPS County code
Multiple years of Presidential results
Party candidate totals for multiple years
Background
Like previously mentioned, I’m interested in how circumstances drive voting behavior and how districts and areas change over time. Though previously I looked a urban-rural codes, in this project I’m going to dive deeper into using GIS and the FIPS code to map the voter percentages to their exact geographical area. By looking at their geographical area, we may find trends in at least trying to understand the drastic change in voter resentment for certain areas and/or how preference for a party changes over time.
I find this especially pressing because of the urban-rural divide and trying to understand some common ground and reducing polarity. In trying to understand voting decisions, we may try to understand each other. So I decided to explore some of the data to see what I could come up with. Usually when it comes to understanding our familiarity of the dataset, it usually comes from simply what party urban/rural voters choose exemplified by the map below:

Data-sources
I gathered three datasets for elections results, income, education background, and background. These datasets were associated by FIPS county code. Not all variables are used, but it allowed a plethora of data for exploration.
Voting Results by County - 2000-2020 (County FIPS):
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VOQCHQ
Income - 2000-2020 (County FIPS):
https://www.census.gov/programs-surveys/saipe/data/datasets.html
Education Background & Unemployment - 2000-2020 (County FIPS):
https://www.ers.usda.gov/data-products/county-level-data-sets/county-level-data-sets-download-data
2. Cleaning
A few datasets were cleaned and left open for exploration from the previous project. The usual cleaning was performed to clean headers initially. The main two data-sets I used was the income dataset and voter data-sets. In the previous project, I got stuck a few times on the voter dataset because of the way the data was associated. Doing some back and fourth and doing some group by in the last project, it was only a problem in the last couple election cycles because the way it was reported. Since there are different voting methods, the voting method itemized and separated ‘early voting’ and ‘mail in’ but the ‘total votes’ per county even though each candidate was the same.
A lat/long dataset was also pulled in to associate each FIPS code with a geographical location, so it can be mapped.
Load the libraries and set the working directory
Clean up column names and view headers
names(voting_data) <- tolower(names(voting_data))
names(voting_data) <- gsub("[ ,()]", "_", names(voting_data))
head(voting_data)# A tibble: 6 × 12
year state state_po county_name county_fips office candidate party
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 2000 ALABAMA AL AUTAUGA 01001 US PRESIDENT AL GORE DEMO…
2 2000 ALABAMA AL AUTAUGA 01001 US PRESIDENT GEORGE W. B… REPU…
3 2000 ALABAMA AL AUTAUGA 01001 US PRESIDENT RALPH NADER GREEN
4 2000 ALABAMA AL AUTAUGA 01001 US PRESIDENT OTHER OTHER
5 2000 ALABAMA AL BALDWIN 01003 US PRESIDENT AL GORE DEMO…
6 2000 ALABAMA AL BALDWIN 01003 US PRESIDENT GEORGE W. B… REPU…
# ℹ 4 more variables: candidatevotes <dbl>, totalvotes <dbl>, version <dbl>,
# mode <chr>
names(education_data) <- tolower(names(education_data))
names(education_data) <- gsub("[ ,()]", "_", names(education_data))
head(education_data)# A tibble: 6 × 55
federal_information_processing_standa…¹ state area_name 2003_rural-urban_con…²
<chr> <chr> <chr> <dbl>
1 00000 US United S… NA
2 01000 AL Alabama NA
3 01001 AL Autauga … 2
4 01003 AL Baldwin … 4
5 01005 AL Barbour … 6
6 01007 AL Bibb Cou… 1
# ℹ abbreviated names: ¹federal_information_processing_standard__fips__code,
# ²`2003_rural-urban_continuum_code`
# ℹ 51 more variables: `2003_urban_influence_code` <dbl>,
# `2013_rural-urban_continuum_code` <dbl>, `2013_urban_influence_code` <dbl>,
# less_than_a_high_school_diploma__1970 <dbl>,
# high_school_diploma_only__1970 <dbl>,
# `some_college__1-3_years___1970` <dbl>, …
names(income_data) <- tolower(names(income_data))
names(income_data) <- gsub("[ ,()]", "_", names(income_data))
head(income_data)# A tibble: 6 × 6
county_fips postal name year poverty_estimate median_household_inc…¹
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 01001 AL Autauga Coun… 2008 5366 51622
2 01003 AL Baldwin Coun… 2008 17110 51957
3 01005 AL Barbour Coun… 2008 6399 30896
4 01007 AL Bibb County 2008 3753 41076
5 01009 AL Blount County 2008 7461 46086
6 01011 AL Bullock Coun… 2008 3074 26980
# ℹ abbreviated name: ¹median_household_income
names(unemployment_data) <- tolower(names(unemployment_data))
names(unemployment_data) <- gsub("[ ,()]", "_", names(unemployment_data))
head(unemployment_data)# A tibble: 6 × 8
fips_code state area_name unemployment_rate_2004 unemployment_rate_2008
<dbl> <chr> <chr> <dbl> <dbl>
1 0 US United States 5.5 5.8
2 1000 AL Alabama 5.6 5.8
3 1001 AL Autauga County,… 4.7 5.3
4 1003 AL Baldwin County,… 5.1 4.8
5 1005 AL Barbour County,… 7.1 9.1
6 1007 AL Bibb County, AL 5.4 6
# ℹ 3 more variables: unemployment_rate_2012 <dbl>,
# unemployment_rate_2016 <dbl>, unemployment_rate_2020 <dbl>
Make dataset for fips code and urban-rural codes
I kept the urban-rural codes in-tact in case I needed to use them.
# New dataset for urban-rural codes
urban_rural_codes <- education_data |>
select("federal_information_processing_standard__fips__code",
`2003_rural-urban_continuum_code`,
`2003_urban_influence_code`,
`2013_rural-urban_continuum_code`,
`2013_urban_influence_code`)Clean voting data with a group by
This is to fix the fact that there are a few states that have different modes and when doing calculations it would duplicate like mentioned in the last project.
# Group by county_fips and party, and calculate the total_candidate_votes and total_votes
voting_cleaned <- voting_data |>
group_by(year,county_fips,party) |>
summarize(
total_candidate_votes = sum(candidatevotes),
total_votes = max(totalvotes),
county_name = max(county_name),
state = max(state)
)Join FIPS code with voter data and Income data
voting_results <- voting_cleaned |>
left_join(urban_rural_codes,
by = c("county_fips" = "federal_information_processing_standard__fips__code")) |>
# Add to correct year
mutate(rural_urban_continuum_code = ifelse(year < 2013,
`2003_rural-urban_continuum_code`,
`2013_rural-urban_continuum_code`),
urban_influence_code = ifelse(year < 2013, `2003_urban_influence_code`,
`2013_urban_influence_code`))
# Show summary to view NA
head(voting_results)# A tibble: 6 × 13
# Groups: year, county_fips [2]
year county_fips party total_candidate_votes total_votes county_name state
<dbl> <chr> <chr> <dbl> <dbl> <chr> <chr>
1 2000 01001 DEMOCRAT 4942 17208 AUTAUGA ALAB…
2 2000 01001 GREEN 160 17208 AUTAUGA ALAB…
3 2000 01001 OTHER 113 17208 AUTAUGA ALAB…
4 2000 01001 REPUBLI… 11993 17208 AUTAUGA ALAB…
5 2000 01003 DEMOCRAT 13997 56480 BALDWIN ALAB…
6 2000 01003 GREEN 1033 56480 BALDWIN ALAB…
# ℹ 6 more variables: `2003_rural-urban_continuum_code` <dbl>,
# `2003_urban_influence_code` <dbl>, `2013_rural-urban_continuum_code` <dbl>,
# `2013_urban_influence_code` <dbl>, rural_urban_continuum_code <dbl>,
# urban_influence_code <dbl>
Year, FIPS code, Change In vote
The four election years were filtered with the dem vote percentage to measure over time.
# Filter Dem party and calculate the vote percentage
democrat_data <- voting_results |>
filter(party == "DEMOCRAT") |>
mutate(vote_percentage = total_candidate_votes / total_votes * 100)
# Pivot the data to have years as columns
pivot_data <- democrat_data |>
select(county_fips, county_name, state, rural_urban_continuum_code, year, vote_percentage) |>
spread(key = year, value = vote_percentage)
# Calculate the differences between years
pivot_data <- pivot_data |>
mutate(
diff_2004_2008 = `2008` - `2004`,
diff_2008_2012 = `2012` - `2008`,
diff_2012_2016 = `2016` - `2012`,
diff_2016_2020 = `2020` - `2016`
)
# View the final data
pivot_data# A tibble: 3,654 × 14
# Groups: county_fips [3,157]
county_fips county_name state rural_urban_continuum_…¹ `2000` `2004` `2008`
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 01001 AUTAUGA ALABAMA 2 28.7 23.7 25.8
2 01003 BALDWIN ALABAMA 4 24.8 22.5 23.8
3 01005 BARBOUR ALABAMA 6 49.9 44.8 49.0
4 01007 BIBB ALABAMA 1 38.2 27.5 26.6
5 01009 BLOUNT ALABAMA 1 27.7 18.3 14.5
6 01011 BULLOCK ALABAMA 6 69.2 68.1 74.1
7 01013 BUTLER ALABAMA 6 46.2 40.6 43.1
8 01015 CALHOUN ALABAMA 3 40.6 33.3 33.2
9 01017 CHAMBERS ALABAMA 6 47.5 41.0 45.5
10 01019 CHEROKEE ALABAMA 8 44.7 33.6 23.7
# ℹ 3,644 more rows
# ℹ abbreviated name: ¹rural_urban_continuum_code
# ℹ 7 more variables: `2012` <dbl>, `2016` <dbl>, `2020` <dbl>,
# diff_2004_2008 <dbl>, diff_2008_2012 <dbl>, diff_2012_2016 <dbl>,
# diff_2016_2020 <dbl>
Export for Tableau
The information was exported to import into Tableau. Since it didn’t map correctly, I did pull a FIPS lat/long files and associate it with the data in excel before I imported it in.
library("writexl")
write_xlsx(pivot_data,"C:/Users/Administrator/Documents/Data110/Datasets/voting/pivot10.xlsx")3. Visualation 1: Democratic Party Percentage of Vote Over Time (GEOMAPS)
First Chart: To start the exploration of voting results, I put side-by-side geomaps on a page dashboard to view the change over time.
Link: https://public.tableau.com/app/profile/daniel.bland4240/viz/voting_by_county/Dashboard1

Analysis: A few key points:
2008: Obama won across the board as he built a solid coalition, but not Southern Appalachia, the Ozark Region, or Northern Texas area. Important to note those regions are demographically very white, so even though he won large parts of the south, the southern white voting block leaned toward McCain, especially after he chose Sarah Palin as his VP, which could explain it., as she was more of a social conservative.
2012: Even though Obama won a second term, he lost votes across the board, and continued to lose more support in Southern Appalachia furthering that trend.
2016: When Trump won, the Democratic party lost votes almost everywhere, but it’s more pronounced in the Rustbelt. It may because Clinton’s history of supporting trade deals that shipped rural jobs overseas. Trump really criticised this aspect of Clinton. People also speculate it’s because a lot of areas didn’t recover from the Great Recession and chalk it up as economic-based. It’s also important to note that, more or less, that a lot of Democrats stayed home and weren’t that enthusiastic about voting for Clinton. The opposite was happening on the Republican side.
2020: Democrats won back voters with Biden, but the Texas Counties on the Southern border show drastic losses and abandoned the Democratic party. Those Counties show upwards of a 30% drop in voter percentages for the Democratic party. For me it’s sort of a chicken or the egg problem. It’s difficult to tell if this trend was from Republicans constant hammering of the border crisis, or if they saw this data, and picked it up as an issue. Probably a little bit of both.
Backup articles:
Obama’s win in 2008 and losing Appalachia:
https://www.nytimes.com/2008/11/11/us/politics/11south.html
Trump’s winning of the Rustbelt in 2016:
https://www.reuters.com/article/us-usa-election-coronavirus-insight/in-2016-trump-won-these-rust-belt-counties-on-the-economy-in-2020-he-might-lose-them-over-coronavirus-idUSKBN27416C/
Trump holding onto Texas Border Counties in 2020:
https://www.texastribune.org/2020/11/04/joe-biden-texas-border-democrats/
4. Scatter plots comparing Income & Unemployment with Democrat Percentage of Vote (2020)
Data combined into one data-frame
# Convert the FIPS code in unemployment_data to character
unemployment_data <- unemployment_data |>
mutate(fips_code = as.character(fips_code))
# Select 2020 vote percentage
selected_data <- pivot_data |>
select(county_fips, county_name, state, rural_urban_continuum_code, `2020`)
# Join unemployment data
joined_data <- selected_data |>
left_join(unemployment_data, by = c("county_fips" = "fips_code")) |>
select(county_fips, `2020`, county_name, rural_urban_continuum_code, unemployment_rate_2020)
# Join income data
final_data <- joined_data |>
left_join(income_data, by = "county_fips") |>
select(county_fips, county_name, rural_urban_continuum_code, `2020`, unemployment_rate_2020, median_household_income)
# View the final dataset
head(final_data)# A tibble: 6 × 6
# Groups: county_fips [2]
county_fips county_name rural_urban_continuum_…¹ `2020` unemployment_rate_2020
<chr> <chr> <dbl> <dbl> <dbl>
1 01001 AUTAUGA 2 27.0 NA
2 01001 AUTAUGA 2 27.0 NA
3 01001 AUTAUGA 2 27.0 NA
4 01001 AUTAUGA 2 27.0 NA
5 01003 BALDWIN 4 NA NA
6 01003 BALDWIN 4 NA NA
# ℹ abbreviated name: ¹rural_urban_continuum_code
# ℹ 1 more variable: median_household_income <dbl>
Simple scatterplot: Median household income and democratic vote percentage (2020)
# Scatterplot for Democratic Vote Percentage vs Median Household Income
ggplot(final_data, aes(x = median_household_income, y = `2020`)) +
geom_point() +
geom_smooth(method = "auto", se = FALSE, color = "red") + # Add a red trendline
labs(
x = "Median Household Income",
y = "Democratic Vote Percentage (2020)",
title = "Scatterplot of Democratic Vote Percentage vs. Median Household Income (2020)"
)
Analysis
- The trend shows a high democratic vote percentage for low income counties, but they lose more of the vote as it proceed to $40,000. Then it turns around and slowly increases the more money you make. Almost in the form of a U-Shape.
Highcharter scatter plot with tooltip: Median household income and democratic vote percentage (2020)
library(highcharter)
# Categorize rural_urban_continuum_code into 'Urban', 'Suburban', 'Rural'
final_data <- final_data |>
mutate(group = case_when(
rural_urban_continuum_code == 1 ~ "Urban",
rural_urban_continuum_code == 2 ~ "Suburban",
TRUE ~ "Rural"
))
# Extra information for tooltips
final_data$tooltip_info <- paste(
"County FIPS: ", final_data$county_fips,
"<br>County: ", final_data$county_name,
"<br>Median Household Income: ", final_data$median_household_income,
"<br>Democratic Vote (%): ", formatC(final_data$`2020`, format = 'f', digits = 2),
"<br>Rural Urban Code: ", final_data$rural_urban_continuum_code,
"<br>Group: ", final_data$group,
sep = ""
)
# Define a custom color palette for the groups
group_colors <- c(
"Urban" = "#0072B2", # Blue
"Suburban" = "#009E73", # Green
"Rural" = "#D55E00" # Orange
)
# Scatter plot using highcharter with custom colors and legend labels
highchart() |>
hc_title(text = "Median Household Income and Democrat Vote Percentage in 2020") |>
hc_xAxis(title = list(text = "Median Household Income"), labels = list(format = '{value}')) |>
hc_yAxis(title = list(text = "Democrat Vote Percentage"), labels = list(format = '{value}%')) |>
hc_add_series(data = final_data |>
filter(group == "Urban"),
type = 'scatter',
hcaes(x = median_household_income,
y = `2020`,
extra = tooltip_info),
name = 'Urban',
marker = list(radius = 3, opacity = 0.6)) |>
hc_add_series(data = final_data |>
filter(group == "Suburban"),
type = 'scatter',
hcaes(x = median_household_income,
y = `2020`,
extra = tooltip_info),
name = 'Suburban', marker = list(radius = 3, opacity = 0.6)) |>
hc_add_series(data = final_data |>
filter(group == "Rural"),
type = 'scatter',
hcaes(x = median_household_income,
y = `2020`, extra = tooltip_info),
name = 'Rural',
marker = list(radius = 3, opacity = 0.6)) |>
hc_colors(group_colors) |>
hc_tooltip(useHTML = TRUE, headerFormat = '<small>{point.key}</small><br/>',
pointFormat = '{point.extra}') |>
hc_legend(enabled = TRUE) |>
hc_chart(zoomType = 'xy') |>
hc_exporting(enabled = TRUE) |>
hc_caption(text = "Source: Voting Results - Harvard Dataverse / Median Household Income - USDA/Census")Analysis
This chart shows that household income is higher in urban areas, lower in surburban areas, and even lower in rural areas.
The urban areas have a huge range and not as dense. Surburban areas are more uniform and a little less dense. And rural areas have a higher range, but has so many points pretty dense, almost a blob.
Simple scatterplot: Unemployment rate and democrat vote percentage (2020)
# Scatterplot for Democratic Vote Percentage vs Median Household Income
ggplot(final_data, aes(x = unemployment_rate_2020, y = `2020`)) +
geom_point() +
geom_smooth(method = "auto", se = FALSE, color = "red") + # Add a red trendline
labs(
x = "Unemployment Rate",
y = "Democratic Vote Percentage (2020)",
title = "Scatterplot of Democratic Vote Percentage vs. Unemployment Rate (2020)"
)
Analysis
The trend follows the higher the unemployment rate, the more likely the democrat vote is higher in those counties. The trend tends to be flat from 5% to 10% unemployment rate, but has a sharp increase from 0%-5% and beyond 10%.
The trends seems to suggest in areas that have low unemployment where policies are working with proper funding of education and infrastructure,
Highcharter scatterplot with tooltip: Unemployment rate and democrat vote percentage (2020)
library(highcharter)
# Categorize rural_urban_continuum_code into 'Urban', 'Suburban', 'Rural'
final_data <- final_data |>
mutate(group = case_when(
rural_urban_continuum_code == 1 ~ "Urban",
rural_urban_continuum_code == 2 ~ "Suburban",
TRUE ~ "Rural"
))
# Extra information for tooltips
final_data$tooltip_info <- paste(
"County FIPS: ", final_data$county_fips,
"<br>County: ", final_data$county_name,
"<br>Unemployment Rate: ", final_data$unemployment_rate_2020,
"<br>Democratic Vote (%): ", formatC(final_data$`2020`, format = 'f', digits = 2),
"<br>Rural Urban Code: ", final_data$rural_urban_continuum_code,
"<br>Group: ", final_data$group,
sep = ""
)
# Define a custom color palette for the groups
group_colors <- c(
"Urban" = "#0072B2", # Blue
"Suburban" = "#009E73", # Green
"Rural" = "#D55E00" # Orange
)
# Scatter plot using highcharter with custom colors and legend labels
highchart() |>
hc_title(text = "Unemployment Rate and Democrat Vote Percentage in 2020") |>
hc_xAxis(title = list(text = "Unemployment Rate"), labels = list(format = '{value}%')) |>
hc_yAxis(title = list(text = "Democrat Vote Percentage"), labels = list(format = '{value}%')) |>
hc_add_series(data = final_data |>
filter(group == "Urban"),
type = 'scatter',
hcaes(x = unemployment_rate_2020,
y = `2020`,
extra = tooltip_info),
name = 'Urban',
marker = list(radius = 3, opacity = 0.6)) |>
hc_add_series(data = final_data |>
filter(group == "Suburban"),
type = 'scatter',
hcaes(x = unemployment_rate_2020,
y = `2020`,
extra = tooltip_info),
name = 'Suburban', marker = list(radius = 3, opacity = 0.6)) |>
hc_add_series(data = final_data |>
filter(group == "Rural"),
type = 'scatter',
hcaes(x = unemployment_rate_2020,
y = `2020`, extra = tooltip_info),
name = 'Rural',
marker = list(radius = 3, opacity = 0.6)) |>
hc_colors(group_colors) |>
hc_tooltip(useHTML = TRUE, headerFormat = '<small>{point.key}</small><br/>',
pointFormat = '{point.extra}') |>
hc_legend(enabled = TRUE) |>
hc_chart(zoomType = 'xy') |>
hc_exporting(enabled = TRUE) |>
hc_caption(text = "Source: Voting Results - Harvard Dataverse / Unemployment Rate - USDA/Census")Analysis
In this chart, the rural area have a huge range for unemployment rate across the whole plot. Suburban areas against, pretty uniform and centered on the plot. Urban areas also have a huge range.
It’s important to note, this was in 2020, where the Pandemic affected the economy and unemployment rate.