library(tidyverse)
library(tidyr)
setwd("C:/Users/Administrator/Documents/Data110/Datasets/voting")
voting_data <- read_csv("countypres_2000-2020.csv")
education_data <- read_csv("Education.csv")
income_data <- read_csv("County_Income_.csv")Project 2 - An Exploration of Voting and Census Data
1. Introduction
Topic
The topic I have decided to explore is voting results over time and trying to pull demographic data from the census. Since that was an tall order, I pulled the following variables:
Urban-rural codes as a grouping
Median income totals
Year used over time on the x axis
Party candidate totals and total vote
Background
I’m interested in how circumstances drive voting behavior and how districts and areas change over time. Most of us know by now that politics are very tribal, but I’m more interested in how it changes over time. The factors that drive their motivation or lack thereof. This potentially gives us insight in how, why, or where certain candidates are ideologically captured, and how voters respond to that because it has the potential to affect their standard of living or their lifestyle. It’s very easy for people to say certain groups vote against their self-interest, which is pretty presumptuous, because it’s just as easy to say certain groups do so for the exact opposite reason, namely it’s the exact case that motivates them when they don’t stand to benefit for the opposition candidate because of the conditions they face. It’s that constant tug of war that I like to find trends in time and space when it comes to political leanings.
I think this is especially pressing because of the urban-rural divide where after the great recession they may not have recovered as much as the rest of the Country. Of course the ‘gold standard’ trade deals didn’t help, which set the conditions for their attitudes and voting behavior. So I decided to explore some of the data to see what I could come up with.
Rural-Urban Codes for Reference
| Code | Description |
|---|---|
| Metro counties: | |
| 1 | Counties in metro areas of 1 million population or more |
| 2 | Counties in metro areas of 250,000 to 1 million population |
| 3 | Counties in metro areas of fewer than 250,000 population |
| Non-metro counties: | |
| 4 | Urban population of 20,000 or more, adjacent to a metro area |
| 5 | Urban population of 20,000 or more, not adjacent to a metro area |
| 6 | Urban population of 2,500 to 19,999, adjacent to a metro area |
| 7 | Urban population of 2,500 to 19,999, not adjacent to a metro area |
| 8 | Completely rural or less than 2,500 urban population, adjacent to a metro area |
| 9 | Completely rural or less than 2,500 urban population, not adjacent to a metro area |
2. Some Cited Articles
https://www.washingtonpost.com/politics/2022/03/11/what-political-shift-rural-counties-looks-like-since-2000
https://www.marylandmatters.org/2022/02/04/opinion-democrats-lose-90-of-rural-counties-in-america-here-is-why
3. Cleaning
A few things on cleaning. The usual cleaning was performed to clean headers. The main two data-sets I used was the income dataset and voter data-sets, but luckily I had another dataset that I didn’t even use except for the rural-urban codes that was a fun variable to use considering there are many different regional characteristics. I got stuck a few times on the voter dataset because of the way the data was associated. Doing some back and fourth and doing some group by, it was only a problem in the last couple election cycles because the way it’s reported. Since there are different voting methods, the voting method itemized and separated ‘early voting’ and ‘mail in’ but the ‘total votes’ per county for each candidate was the same. I think if it was data I was unfamiliar with I wouldn’t have noticed and just thought it interesting, but looking at the results was clearly wrong.
Load the libraries and set the working directory
I’m loading in a few databases to explore and for later.
clean up column names and view headers
names(voting_data) <- tolower(names(voting_data))
names(voting_data) <- gsub("[ ,()]", "_", names(voting_data))
head(voting_data)# A tibble: 6 × 12
year state state_po county_name county_fips office candidate party
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 2000 ALABAMA AL AUTAUGA 01001 US PRESIDENT AL GORE DEMO…
2 2000 ALABAMA AL AUTAUGA 01001 US PRESIDENT GEORGE W. B… REPU…
3 2000 ALABAMA AL AUTAUGA 01001 US PRESIDENT RALPH NADER GREEN
4 2000 ALABAMA AL AUTAUGA 01001 US PRESIDENT OTHER OTHER
5 2000 ALABAMA AL BALDWIN 01003 US PRESIDENT AL GORE DEMO…
6 2000 ALABAMA AL BALDWIN 01003 US PRESIDENT GEORGE W. B… REPU…
# ℹ 4 more variables: candidatevotes <dbl>, totalvotes <dbl>, version <dbl>,
# mode <chr>
names(education_data) <- tolower(names(education_data))
names(education_data) <- gsub("[ ,()]", "_", names(education_data))
head(education_data)# A tibble: 6 × 55
federal_information_processing_standa…¹ state area_name 2003_rural-urban_con…²
<chr> <chr> <chr> <dbl>
1 00000 US United S… NA
2 01000 AL Alabama NA
3 01001 AL Autauga … 2
4 01003 AL Baldwin … 4
5 01005 AL Barbour … 6
6 01007 AL Bibb Cou… 1
# ℹ abbreviated names: ¹federal_information_processing_standard__fips__code,
# ²`2003_rural-urban_continuum_code`
# ℹ 51 more variables: `2003_urban_influence_code` <dbl>,
# `2013_rural-urban_continuum_code` <dbl>, `2013_urban_influence_code` <dbl>,
# less_than_a_high_school_diploma__1970 <dbl>,
# high_school_diploma_only__1970 <dbl>,
# `some_college__1-3_years___1970` <dbl>, …
names(income_data) <- tolower(names(income_data))
names(income_data) <- gsub("[ ,()]", "_", names(income_data))
head(income_data)# A tibble: 6 × 6
county_fips postal name year poverty_estimate median_household_inc…¹
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 01001 AL Autauga Coun… 2008 5366 51622
2 01003 AL Baldwin Coun… 2008 17110 51957
3 01005 AL Barbour Coun… 2008 6399 30896
4 01007 AL Bibb County 2008 3753 41076
5 01009 AL Blount County 2008 7461 46086
6 01011 AL Bullock Coun… 2008 3074 26980
# ℹ abbreviated name: ¹median_household_income
Make dataset for fips code and urban-rural codes
Since the urban-rural codes are a subset of one of my data-sets I’m going to make it it’s own dataset and rejoin when necessary.
# New dataset for urban-rural codes
urban_rural_codes <- education_data |>
select("federal_information_processing_standard__fips__code",
`2003_rural-urban_continuum_code`,
`2003_urban_influence_code`,
`2013_rural-urban_continuum_code`,
`2013_urban_influence_code`)Clean voting data with a group by
After some exploration I realized my data was coming out wrong. There are a few states that have different modes and when doing calculations on them it’s duplicating the total votes and making it difficult. I grouped by year, county_fips, and party and summed up the votes for each party, and then took the max of total votes, so it wouldn’t duplicate. The lazier version of doing a nestled query.
# Group by county_fips and party, and calculate the total_candidate_votes and total_votes
voting_cleaned <- voting_data |>
group_by(year,county_fips,party) |>
summarize(
total_candidate_votes = sum(candidatevotes),
total_votes = max(totalvotes),
county_name = max(county_name),
state = max(state)
)Join fips code with voter data and Income data
voting_results <- voting_cleaned |>
left_join(urban_rural_codes,
by = c("county_fips" = "federal_information_processing_standard__fips__code")) |>
# Add to correct year
mutate(rural_urban_continuum_code = ifelse(year < 2013,
`2003_rural-urban_continuum_code`,
`2013_rural-urban_continuum_code`),
urban_influence_code = ifelse(year < 2013, `2003_urban_influence_code`,
`2013_urban_influence_code`))
# Show summary to view NA
summary(voting_results) year county_fips party total_candidate_votes
Min. :2000 Length:64144 Length:64144 Min. : 0
1st Qu.:2004 Class :character Class :character 1st Qu.: 196
Median :2012 Mode :character Mode :character Median : 1800
Mean :2010 Mean : 12206
3rd Qu.:2016 3rd Qu.: 6919
Max. :2020 Max. :3028885
total_votes county_name state
Min. : 0 Length:64144 Length:64144
1st Qu.: 4994 Class :character Class :character
Median : 10788 Mode :character Mode :character
Mean : 42207
3rd Qu.: 28485
Max. :4264365
2003_rural-urban_continuum_code 2003_urban_influence_code
Min. :1.000 Min. : 1.000
1st Qu.:3.000 1st Qu.: 2.000
Median :6.000 Median : 5.000
Mean :5.097 Mean : 5.403
3rd Qu.:7.000 3rd Qu.: 8.000
Max. :9.000 Max. :12.000
NA's :853 NA's :853
2013_rural-urban_continuum_code 2013_urban_influence_code
Min. :1.000 Min. : 1.000
1st Qu.:2.000 1st Qu.: 2.000
Median :6.000 Median : 5.000
Mean :4.974 Mean : 5.211
3rd Qu.:7.000 3rd Qu.: 8.000
Max. :9.000 Max. :12.000
NA's :853 NA's :853
rural_urban_continuum_code urban_influence_code
Min. :1.000 Min. : 1.000
1st Qu.:3.000 1st Qu.: 2.000
Median :6.000 Median : 5.000
Mean :5.053 Mean : 5.334
3rd Qu.:7.000 3rd Qu.: 8.000
Max. :9.000 Max. :12.000
NA's :853 NA's :853
There’s some counts with no rural-urban codes because there’s no fips code. They are from Maine, Rhode island, and DC. They look like state-wide write-ins.
income_cleaned <- income_data |>
left_join(urban_rural_codes,
by = c("county_fips" = "federal_information_processing_standard__fips__code")) |>
# Add to correct year
mutate(rural_urban_continuum_code = ifelse(year < 2013,
`2003_rural-urban_continuum_code`,
`2013_rural-urban_continuum_code`),
urban_influence_code = ifelse(year < 2013, `2003_urban_influence_code`,
`2013_urban_influence_code`))
# Show summary to view NA
head(income_cleaned)# A tibble: 6 × 12
county_fips postal name year poverty_estimate median_household_inc…¹
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 01001 AL Autauga Coun… 2008 5366 51622
2 01003 AL Baldwin Coun… 2008 17110 51957
3 01005 AL Barbour Coun… 2008 6399 30896
4 01007 AL Bibb County 2008 3753 41076
5 01009 AL Blount County 2008 7461 46086
6 01011 AL Bullock Coun… 2008 3074 26980
# ℹ abbreviated name: ¹median_household_income
# ℹ 6 more variables: `2003_rural-urban_continuum_code` <dbl>,
# `2003_urban_influence_code` <dbl>, `2013_rural-urban_continuum_code` <dbl>,
# `2013_urban_influence_code` <dbl>, rural_urban_continuum_code <dbl>,
# urban_influence_code <dbl>
4. Visualization of Income by Urban-Rural Codes
# Filter the dataset for the year 2020
filtered_data_2020 <- income_cleaned |>
filter(year == 2020)
# Boxplot for each rural-urban code
boxplot_2020 <- ggplot(filtered_data_2020, aes(x = factor(rural_urban_continuum_code),
y = median_household_income)) +
geom_boxplot() +
labs(x = "Rural-Urban Code", y = "Median Household Income") +
ggtitle("Boxplots of Median Household Income by Rural-Urban Code in 2020") +
theme_minimal()
print(boxplot_2020)This is to satisfy the statistical analysis requirement. I tried a couple different variations of box-plots, but I found just focusing on one year, the most recent being the most pertinent the most interest. Not to mention, I love the simplicity of the visualization. Glaringly 1 is the most urban and there is going to be a huge spread between the min and max with the median higher. But you see a lot of interesting things happen. Like 8 and 9 have the same population, but 9 is closer to an urban area, and 8 is not. So if you are in a small rural area near a city, you have both the lower and higher end of the spectrum.
5. Three Visualizations using Highcharts
Highcharts Vote Percentage Visualization
This is largely to check to make sure the data is good by comparing it to wiki percentage totals for the candidates.
library(highcharter)
# Party vote percentages
party_votes <- voting_results |>
group_by(year, party) |>
summarize(Percentage = round(sum(total_candidate_votes) / sum(total_votes) * 100, 2))
# Highcharter line graph
line_chart <- party_votes |>
hchart(type = "line", hcaes(x = year, y = Percentage, group = party)) |>
hc_title(text = "Party Vote Percentage Over Time") |>
hc_xAxis(title = list(text = "Year")) |>
hc_yAxis(title = list(text = "Percentage"))
# Display
line_chartDemocratic percentage over time using urban-rural codes
# Percentage of Democratic votes by year and urban-rural code
percentage_by_year_and_code <- voting_results |>
filter(party == "DEMOCRAT" & !is.na(rural_urban_continuum_code)) |>
group_by(year, rural_urban_continuum_code) |>
summarize(Democratic_percentage = round(sum(total_candidate_votes) / sum(total_votes) * 100, 2))
# Highchart line graph with legend
line_chart <- percentage_by_year_and_code |>
hchart(type = "line", hcaes(x = year,
y = Democratic_percentage,
group = rural_urban_continuum_code)) |>
# Legend
hc_title(text = "Democratic Percentage Over Time by RUCC Code") |>
hc_xAxis(title = list(text = "Year")) |>
hc_yAxis(title = list(text = "Percentage of Democratic Votes")) |>
hc_legend(enabled = TRUE)
# Display line chart
line_chartThe visualization above is interesting because you see metro areas largely flat, but democrats took a major dive with rural areas. However, after 2016 every region type trended up. I didn’t expect that. I’d almost expect it to go in opposite directions as certain areas double down or polarity increased. Another thing I found interesting is that in the early years the grouping between 1 and 9 is a lot tighter compared to now.
Third-party percentage by RUCC Code
# Third-party percentage by year and rural-urban code
percentage_by_year_and_thirdparty <- voting_results |>
filter(!(party %in% c("DEMOCRAT", "REPUBLICAN"))
& !is.na(rural_urban_continuum_code)) |>
group_by(year, rural_urban_continuum_code) |>
summarize(Democratic_percentage = round(sum(total_candidate_votes) / sum(total_votes) * 100, 2))
# Highchart line graph with a legend
line_chart <- percentage_by_year_and_thirdparty |>
hchart(type = "line", hcaes(x = year,
y = Democratic_percentage,
group = rural_urban_continuum_code)) |>
# Legend
hc_title(text = "Third-Party Percentage Over Time by RUCC Code") |>
hc_xAxis(title = list(text = "Year")) |>
hc_yAxis(title = list(text = "Percentage of Votes")) |>
hc_legend(enabled = TRUE)
# Display line chart
line_chartThis one is fun as well. There is a somewhat flat line to slightly increasing in 2004 of people doing protest votes. But in 2016, with both candidates being severely unpopular, there was a major spike of people voting for candidates other than the two main parties. But as soon as 2020 approached, no matter what region, both groups got on board. Of course there will always be analysis on to what extent a spoiler actually spoils an election, especially when you have two third party candidates that essentially cancel each other out, but I thought the sharp rise in 2016 to the highest level in quite some time, to the largest drop in quite some time does say something about our political climate. And I’ll drop one last background article for that demonstrating the unpopularity of the candidates, to the largest turnout in 2020 (on both sides).
2016 - two of the most unpopular candidates
https://fivethirtyeight.com/features/americans-distaste-for-both-trump-and-clinton-is-record-breaking/
https://fivethirtyeight.com/features/historic-turnout-in-2020-not-so-far/