library(tidyverse)
library(tidyr)
library(ggplot2)
setwd("C:/Users/dbland/Downloads/Project/Datasets/voting")
voting_data <- read_csv("countypres_2000-2020.csv")
education_data <- read_csv("Education.csv")
income_data <- read_csv("County_Income_.csv")
unemployment_data <- read_csv("Unemployment.csv")Project 3
1. Introduction
The dataset I’m using is the the same dataset I used as the previous one. The previous charts explored the urban-rural codes and the associated results of the Presidential votes. For this particular project, I’m going to use the same dataset, but I’m going to explore further into the geographical affiliation of voters by County and the resulting change in the vote for the Presidential election over time. The following variables are:
Lat/long associated with County
FIPS County code
Multiple years of Presidential results
Party candidate totals for multiple years
Background
Like previously mentioned, I’m interested in how circumstances drive voting behavior and how districts and areas change over time. Though previously I looked a urban-rural codes, in this project I’m going to dive deeper into using GIS and the FIPS code to map the voter percentages to their exact geographical area. By looking at their geographical area, we may find trends in at least trying to understand the drastic change in voter resentment for certain areas and/or how preference for a party changes over time.
I find this especially pressing because of the urban-rural divide and trying to understand some common ground and reducing polarity. In trying to understand voting decisions, we may try to understand each other. So I decided to explore some of the data to see what I could come up with. Usually when it comes to understanding our familiarity of the dataset, it usually comes from simply what party urban/rural voters choose exemplified by the map below:
Data-sources
I gathered three datasets for elections results, income, education background, and background. These datasets were associated by FIPS county code. Not all variables are used, but it allowed a plethora of data for exploration.
Voting Results by County - 2000-2020 (County FIPS):
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VOQCHQ
Income - 2000-2020 (County FIPS):
https://www.census.gov/programs-surveys/saipe/data/datasets.html
Education Background & Unemployment - 2000-2020 (County FIPS):
https://www.ers.usda.gov/data-products/county-level-data-sets/county-level-data-sets-download-data
2. Cleaning
A few datasets were cleaned and left open for exploration from the previous project. The usual cleaning was performed to clean headers initially. The main two data-sets I used was the income dataset and voter data-sets. In the previous project, I got stuck a few times on the voter dataset because of the way the data was associated. Doing some back and fourth and doing some group by in the last project, it was only a problem in the last couple election cycles because the way it was reported. Since there are different voting methods, the voting method itemized and separated ‘early voting’ and ‘mail in’ but the ‘total votes’ per county even though each candidate was the same.
A lat/long dataset was also pulled in to associate each FIPS code with a geographical location, so it can be mapped.
Load the libraries and set the working directory
Clean up column names and view headers
names(voting_data) <- tolower(names(voting_data))
names(voting_data) <- gsub("[ ,()]", "_", names(voting_data))
head(voting_data)# A tibble: 6 × 12
year state state_po county_name county_fips office candidate party
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 2000 ALABAMA AL AUTAUGA 01001 US PRESIDENT AL GORE DEMO…
2 2000 ALABAMA AL AUTAUGA 01001 US PRESIDENT GEORGE W. B… REPU…
3 2000 ALABAMA AL AUTAUGA 01001 US PRESIDENT RALPH NADER GREEN
4 2000 ALABAMA AL AUTAUGA 01001 US PRESIDENT OTHER OTHER
5 2000 ALABAMA AL BALDWIN 01003 US PRESIDENT AL GORE DEMO…
6 2000 ALABAMA AL BALDWIN 01003 US PRESIDENT GEORGE W. B… REPU…
# ℹ 4 more variables: candidatevotes <dbl>, totalvotes <dbl>, version <dbl>,
# mode <chr>
names(education_data) <- tolower(names(education_data))
names(education_data) <- gsub("[ ,()]", "_", names(education_data))
head(education_data)# A tibble: 6 × 55
federal_information_processing_standa…¹ state area_name 2003_rural-urban_con…²
<chr> <chr> <chr> <dbl>
1 00000 US United S… NA
2 01000 AL Alabama NA
3 01001 AL Autauga … 2
4 01003 AL Baldwin … 4
5 01005 AL Barbour … 6
6 01007 AL Bibb Cou… 1
# ℹ abbreviated names: ¹federal_information_processing_standard__fips__code,
# ²`2003_rural-urban_continuum_code`
# ℹ 51 more variables: `2003_urban_influence_code` <dbl>,
# `2013_rural-urban_continuum_code` <dbl>, `2013_urban_influence_code` <dbl>,
# less_than_a_high_school_diploma__1970 <dbl>,
# high_school_diploma_only__1970 <dbl>,
# `some_college__1-3_years___1970` <dbl>, …
names(income_data) <- tolower(names(income_data))
names(income_data) <- gsub("[ ,()]", "_", names(income_data))
head(income_data)# A tibble: 6 × 6
county_fips postal name year poverty_estimate median_household_inc…¹
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 01001 AL Autauga Coun… 2008 5366 51622
2 01003 AL Baldwin Coun… 2008 17110 51957
3 01005 AL Barbour Coun… 2008 6399 30896
4 01007 AL Bibb County 2008 3753 41076
5 01009 AL Blount County 2008 7461 46086
6 01011 AL Bullock Coun… 2008 3074 26980
# ℹ abbreviated name: ¹median_household_income
names(unemployment_data) <- tolower(names(unemployment_data))
names(unemployment_data) <- gsub("[ ,()]", "_", names(unemployment_data))
head(unemployment_data)# A tibble: 6 × 8
fips_code state area_name unemployment_rate_2004 unemployment_rate_2008
<dbl> <chr> <chr> <dbl> <dbl>
1 0 US United States 5.5 5.8
2 1000 AL Alabama 5.6 5.8
3 1001 AL Autauga County,… 4.7 5.3
4 1003 AL Baldwin County,… 5.1 4.8
5 1005 AL Barbour County,… 7.1 9.1
6 1007 AL Bibb County, AL 5.4 6
# ℹ 3 more variables: unemployment_rate_2012 <dbl>,
# unemployment_rate_2016 <dbl>, unemployment_rate_2020 <dbl>
Make dataset for fips code and urban-rural codes
I kept the urban-rural codes in-tact in case I needed to use them.
# New dataset for urban-rural codes
urban_rural_codes <- education_data |>
select("federal_information_processing_standard__fips__code",
`2003_rural-urban_continuum_code`,
`2003_urban_influence_code`,
`2013_rural-urban_continuum_code`,
`2013_urban_influence_code`)Clean voting data with a group by
This is to fix the fact that there are a few states that have different modes and when doing calculations it would duplicate like mentioned in the last project.
# Group by county_fips and party, and calculate the total_candidate_votes and total_votes
voting_cleaned <- voting_data |>
group_by(year,county_fips,party) |>
summarize(
total_candidate_votes = sum(candidatevotes),
total_votes = max(totalvotes),
county_name = max(county_name),
state = max(state)
)`summarise()` has grouped output by 'year', 'county_fips'. You can override
using the `.groups` argument.
Join FIPS code with voter data and Income data
voting_results <- voting_cleaned |>
left_join(urban_rural_codes,
by = c("county_fips" = "federal_information_processing_standard__fips__code")) |>
# Add to correct year
mutate(rural_urban_continuum_code = ifelse(year < 2013,
`2003_rural-urban_continuum_code`,
`2013_rural-urban_continuum_code`),
urban_influence_code = ifelse(year < 2013, `2003_urban_influence_code`,
`2013_urban_influence_code`))
# Show summary to view NA
head(voting_results)# A tibble: 6 × 13
# Groups: year, county_fips [2]
year county_fips party total_candidate_votes total_votes county_name state
<dbl> <chr> <chr> <dbl> <dbl> <chr> <chr>
1 2000 01001 DEMOCRAT 4942 17208 AUTAUGA ALAB…
2 2000 01001 GREEN 160 17208 AUTAUGA ALAB…
3 2000 01001 OTHER 113 17208 AUTAUGA ALAB…
4 2000 01001 REPUBLI… 11993 17208 AUTAUGA ALAB…
5 2000 01003 DEMOCRAT 13997 56480 BALDWIN ALAB…
6 2000 01003 GREEN 1033 56480 BALDWIN ALAB…
# ℹ 6 more variables: `2003_rural-urban_continuum_code` <dbl>,
# `2003_urban_influence_code` <dbl>, `2013_rural-urban_continuum_code` <dbl>,
# `2013_urban_influence_code` <dbl>, rural_urban_continuum_code <dbl>,
# urban_influence_code <dbl>
Year, FIPS code, Change In vote
The four election years were filtered with the dem vote percentage to measure over time.
# Filter Dem party and calculate the vote percentage
democrat_data <- voting_results |>
filter(party == "DEMOCRAT") |>
mutate(vote_percentage = total_candidate_votes / total_votes * 100)
# Pivot the data to have years as columns
pivot_data <- democrat_data |>
select(county_fips, county_name, state, rural_urban_continuum_code, year, vote_percentage) |>
spread(key = year, value = vote_percentage)
# Calculate the differences between years
pivot_data <- pivot_data |>
mutate(
diff_2004_2008 = `2008` - `2004`,
diff_2008_2012 = `2012` - `2008`,
diff_2012_2016 = `2016` - `2012`,
diff_2016_2020 = `2020` - `2016`
)
# View the final data
pivot_data# A tibble: 3,654 × 14
# Groups: county_fips [3,157]
county_fips county_name state rural_urban_continuum_…¹ `2000` `2004` `2008`
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 01001 AUTAUGA ALABAMA 2 28.7 23.7 25.8
2 01003 BALDWIN ALABAMA 4 24.8 22.5 23.8
3 01005 BARBOUR ALABAMA 6 49.9 44.8 49.0
4 01007 BIBB ALABAMA 1 38.2 27.5 26.6
5 01009 BLOUNT ALABAMA 1 27.7 18.3 14.5
6 01011 BULLOCK ALABAMA 6 69.2 68.1 74.1
7 01013 BUTLER ALABAMA 6 46.2 40.6 43.1
8 01015 CALHOUN ALABAMA 3 40.6 33.3 33.2
9 01017 CHAMBERS ALABAMA 6 47.5 41.0 45.5
10 01019 CHEROKEE ALABAMA 8 44.7 33.6 23.7
# ℹ 3,644 more rows
# ℹ abbreviated name: ¹rural_urban_continuum_code
# ℹ 7 more variables: `2012` <dbl>, `2016` <dbl>, `2020` <dbl>,
# diff_2004_2008 <dbl>, diff_2008_2012 <dbl>, diff_2012_2016 <dbl>,
# diff_2016_2020 <dbl>
# Convert the FIPS code in unemployment_data to character
unemployment_data <- unemployment_data |>
mutate(fips_code = as.character(fips_code))
# Select 2020 vote percentage
selected_data <- pivot_data |>
select(county_fips, county_name, state, rural_urban_continuum_code, `2020`)
# Join unemployment data
joined_data <- selected_data |>
left_join(unemployment_data, by = c("county_fips" = "fips_code")) |>
select(county_fips, `2020`, county_name, rural_urban_continuum_code, unemployment_rate_2020)
# Join income data
final_data <- joined_data |>
left_join(income_data, by = "county_fips") |>
select(county_fips, county_name, rural_urban_continuum_code, `2020`, unemployment_rate_2020, median_household_income)Warning in left_join(joined_data, income_data, by = "county_fips"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 2 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
# View the final dataset
head(final_data)# A tibble: 6 × 6
# Groups: county_fips [2]
county_fips county_name rural_urban_continuum_…¹ `2020` unemployment_rate_2020
<chr> <chr> <dbl> <dbl> <dbl>
1 01001 AUTAUGA 2 27.0 NA
2 01001 AUTAUGA 2 27.0 NA
3 01001 AUTAUGA 2 27.0 NA
4 01001 AUTAUGA 2 27.0 NA
5 01003 BALDWIN 4 NA NA
6 01003 BALDWIN 4 NA NA
# ℹ abbreviated name: ¹rural_urban_continuum_code
# ℹ 1 more variable: median_household_income <dbl>