DATA 606 Data Project Proposal

Data Preparation

library(dplyr)
library(ggplot2)
library(psych)
library(knitr)
library(kableExtra)

# load data
election_results_df <- read.csv('https://raw.githubusercontent.com/zachalexander/data_606_cunysps/master/ProjectProposal/county_level_votes.csv')

education_data_df <- read.csv('https://raw.githubusercontent.com/zachalexander/data_606_cunysps/master/ProjectProposal/education_data_fip.csv')

# the dataset has many columns, I'm subsetting down to only columns of interest
election_results_df <- election_results_df %>% 
  select(combined_fips, county_name, state_abbr, per_dem, per_gop)

# I'm taking the rural/urban continuum codes and creating a separate variable 
# with the qualitative values
education_data_df <- education_data_df %>% 
  select(fips, lesscollege_pct, ruralurban_cc) %>% 
  mutate(ruralurban_grp = ifelse(ruralurban_cc == 1, 
  "Counties in metro areas of 1 million population or more", 
  ifelse(ruralurban_cc == 2, 
         "Counties in metro areas of 250,000 to 1 million population",
  ifelse(ruralurban_cc == 3, 
         "Counties in metro areas of fewer than 250,000 population",
  ifelse(ruralurban_cc == 4, 
         "Urban population of 20,000 or more, adjacent to a metro area",
  ifelse(ruralurban_cc == 5, 
         "Urban population of 20,000 or more, not adjacent to a metro area",
  ifelse(ruralurban_cc == 6, 
         "Urban population of 2,500 to 19,999, adjacent to a metro area",
  ifelse(ruralurban_cc == 7, 
         "Urban population of 2,500 to 19,999, not adjacent to a metro area",
  ifelse(ruralurban_cc == 8, 
         "Completely rural or less than 2,500 urban population, adjacent to a metro area",
  ifelse(ruralurban_cc == 9, 
         "Completely rural or less than 2,500 urban population, adjacent to a metro area", 
         NA))))))))))

# since the FIP codes are found in both data frames, I can use merge to 
# join the information from both data frames into one data frame while 
# creating a few columns to calculate the party winner by county as well 
# as the proportion of individuals in each county with a bachelor's degree or more

election_education_df <- merge(election_results_df, education_data_df, by.x = "combined_fips", by.y = "fips" ) %>%
  mutate(college_or_more_pct = (100 - lesscollege_pct), 
         party_winner = ifelse(per_gop > per_dem, 'Republican', 'Democrat')) %>% 
  select(combined_fips:per_gop, college_or_more_pct, party_winner, ruralurban_cc, ruralurban_grp)

# a quick look at the data frame
head(election_education_df)

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

Is educational attainment or ruralness predictive of the proportion of GOP votes by county in the 2016 presidential election?

I thought these two items would be interesting to look into in lieu of the upcoming 2020 election. I had read a lot of articles after 2016 about the “urban/rural divide”, and educational differences between Republicans and Democrats, and thought it would be interesting to explore the election data a bit more to see if these factors really do seem to have an affect on voter choice. If I do see large differences based on these factors, it would be neat to build some type of predictive model in the lead up to the 2020 election.

Cases

What are the cases, and how many are there?

The cases are the number of counties in the United States (the id is FIPS code) that have educational attainment percentages, ruralness continuum codes, and voting percentages available from the 2016 election. There are 3112 counties in this dataset that contain this information out of a total of 3141 counties in the United States. Although we are missing some county-level data, this dataset still comprises about 99% of the counties that make up the United States.

Data collection

Describe the method of data collection.

The education percentages were calculated by Stephen Pettigrew at Harvard University Dataverse. It appears that the percent with a college degree or more by county population are calculations based on a dataset created by IPUMS NHGIS, University of Minnesota. The rural/urban continuum codes (ruralurban_cc) variable, was adopted from the United States Department of Agriculture Economic Research Service in 2013. These variables were combined by Stephen Pettigrew into one dataset, which also stored other types of election data - outside the scope of this proposal.

Type of study

What type of study is this (observational/experiment)?

This is an observational study.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

I found the dataset with 2016 election results on GitHub: https://github.com/tonmcg/US_County_Level_Election_Results_08-16/blob/master/2016_US_County_Level_Presidential_Results.csv

I found the dataset with the education and ruralness data on GitHub as well: https://github.com/MEDSL/2018-elections-unoffical/blob/master/election-context-2018.md

Although this second dataset houses 2018 election data, I am only using the American Community Survey education data (5-year estimates).

Dependent Variable

What is the response variable? Is it quantitative or qualitative?

The response variable is ‘per_gop’ (proportion of GOP vote out of total votes by county) and it is quantitative.

Independent Variable

You should have two independent variables, one quantitative and one qualitative.

My two independent variables are ‘college_or_more_pct’ (quantitative) and ‘ruralurban_grp’ (qualitative).

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

First, I thought it would be interesting to just take a quick look at how many counties Donald Trump won in 2016:

table(election_education_df$party_winner)

## 
##   Democrat Republican 
##        487       2625

We can see that when solely looking at the vote proportions by county, Donald Trump won a much larger percentage of U.S. counties in 2016 (albeit, this doesn’t take into account population size, just the number of counties). It was great to see that this split was confirmed by the Associated Press (helps to check my data merges and wrangling weren’t prone to errors - the AP reports one extra “county” in Louisiana going to Donald Trump to make their count at 2626, but this extra “county” is a parish in Lousiana, so it’s up for interpretation).

Then, I wanted to get a better sense of the data, so I decided to plot a histogram of the percent of each county’s population that has a college degree or higher. We can see that the histogram is right skewed, with a center around 20%. The histogram is unimodal.

ggplot(election_education_df, aes(x=college_or_more_pct)) + geom_histogram()

describe(election_education_df$college_or_more_pct)

Next, I thought it would be helpful to plot a histogram of the percent of each county that voted for Donald Trump in 2016. Again, this distribution is unimodal, however it is left skewed with a center around 65%.

ggplot(election_education_df, aes(x=per_gop), bins = 30) + geom_histogram()

describe(election_education_df$per_gop)

To take a look at my rural variable, I first thoguht it would be helpful to see some summary statistics related to my qualitative variable ‘ruralurban_grp’:

describeBy(election_education_df$per_gop,
           group = election_education_df$ruralurban_grp,
           mat = TRUE)

I may be able to compare means across these 8 different groups, but I also thought it would be helpful to subset this data a bit more based on broader categories of ruralness. I decided to recode the ruralurban codes so that those that are characteristic of urban areas (‘ruralurban_cc’ = 1 - 3) would be grouped together, those characteristic of rural areas (‘ruralurban_cc’ = 7 - 9) would be grouped together, and those characteristic of suburban areas (‘ruralurban_cc’ = 4 - 6) would be grouped together. This’ll hopefully make mean comparisons easier later on! I saved these recodes in a new variable called ‘ruralurban_grp_3_way’.

# recoding the ruralurban_grp and ruralurban_cc variable into a 3_way grouping variable
election_education_df$ruralurban_grp_3_way <- ifelse(
  election_education_df$ruralurban_cc <= 3, 'Urban Counties',
  ifelse(election_education_df$ruralurban_cc >= 4 & 
        election_education_df$ruralurban_cc < 7, 'Suburban Counties',  
  ifelse(election_education_df$ruralurban_cc >= 7, 'Rural Counties', NA)))

After this was setup, I could then look at summary statistics of the ‘per_gop’ variable and ‘college_or_more_pct’ variable split by this new variable.

# summary statistics of the percent voting for Donald Trump split by ruralness
describeBy(election_education_df$per_gop,
           group = election_education_df$ruralurban_grp_3_way,
           mat = TRUE)

# summary statistics of the percent with a bachelor's degree or higher split by ruralness
describeBy(election_education_df$college_or_more_pct,
           group = election_education_df$ruralurban_grp_3_way,
           mat = TRUE)

Finally, I thought it would be interesting to plot these two tables as box plots. See below for the box plot of the of per_gop split by ruralness (3-way):

boxplot(per_gop~ruralurban_grp_3_way,data=election_education_df, 
   col = c("#ffb3b3", "#c6ecc6", "#b3d9ff"),
   main="Proportion voting for Donald Trump",
   xlab="Rural/Urban Category", 
   ylab="Percent GOP vote")

And, here is a boxplot for percent with a bachelor’s degree or higher by ruralness (3-way).

boxplot(college_or_more_pct~ruralurban_grp_3_way,data=election_education_df, 
   col = c("#ffb3b3", "#c6ecc6", "#b3d9ff"),
   main="Proportion with a college degree or higher",
   xlab="Rural/Urban Category", 
   ylab="Percent with at least a college degree")

Although this second boxplot is a bit outside the scope of the reseearch question, I thought it would be interesting to see if there are any noticeable differences in educational attainment across the ruralness continuum as well to build a bit more context.