Introduction

This homework assignment analyzes a dataset discussed in the FiveThirtyEight article titled “Why Many Americans Don’t Vote”. The dataset is based on survey results collected from over 8,000 participants, and is focused on identifying patterns surrounding individuals’ willingness or unwillingness to vote in US elections.

The original dataset includes questions about race, gender, income, age, party association, weight, and of course the target variable “voter_category”. Individuals are bucketed into three voter_categories: “Never Vote”, “Sometimes Vote”, “Always Vote”.

Voter apathy is a crux to any functioning democracy, so analysis of this sort is critical to understanding why individuals may choose to not participate, as well as identifying means to increase voter engagement for those “Not Voters” today.

Link to article:

https://projects.fivethirtyeight.com/non-voters-poll-2020-election/

Import Tidyverse and CSV data

Using the read_csv function, we load the downloaded dataset into R

library("tidyverse")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.4     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
nonvoters = read_csv("nonvoters_data.csv")
## Rows: 5836 Columns: 119
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (5): educ, race, gender, income_cat, voter_category
## dbl (114): RespId, weight, Q1, Q2_1, Q2_2, Q2_3, Q2_4, Q2_5, Q2_6, Q2_7, Q2_...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Create initial subset dataframe

This dataset includes over 100 features. 6 of these features are descriptive of the survey respondent, including features like “weight” and “age”. The remaining features include coded responses to questions targeting individuals’ beliefs and personal feelings relating to the electoral process.

To begin my analysis, I chose to focus only on the “descriptive” features to see if there are any correlations between these and the target “voter_category”.

voters_subset = nonvoters[,c("RespId","weight","ppage","educ","race","gender","income_cat","voter_category")]

Change column names for clarity

Certain feature titles are not clear, such as “ppage”. I chose to rename the features for clarity as follows:

new_names <- c("id","weight","age","education","race","gender","income_category","voter_category")

colnames(voters_subset) <- new_names

voters_subset
## # A tibble: 5,836 × 8
##        id weight   age education    race   gender income_category voter_category
##     <dbl>  <dbl> <dbl> <chr>        <chr>  <chr>  <chr>           <chr>         
##  1 470001  0.752    73 College      White  Female $75-125k        always        
##  2 470002  1.03     90 College      White  Female $125k or more   always        
##  3 470003  1.08     53 College      White  Male   $125k or more   sporadic      
##  4 470007  0.682    58 Some college Black  Female $40-75k         sporadic      
##  5 480008  0.991    81 High school… White  Male   $40-75k         always        
##  6 480009  1.06     61 High school… White  Female $40-75k         rarely/never  
##  7 480010  1.15     80 High school… White  Female $125k or more   always        
##  8 470008  1.02     68 Some college Other… Female $75-125k        always        
##  9 470010  0.818    70 College      White  Male   $125k or more   always        
## 10 470011  1.17     83 Some college White  Male   $125k or more   always        
## # … with 5,826 more rows

Convert education and income_category to ordered factors

Of the descriptive features selected for this analysis, two are Ordinal and as such I decided to convert these features to Factors.

unique(voters_subset$income_category)
## [1] "$75-125k"       "$125k or more"  "$40-75k"        "Less than $40k"
str(voters_subset)
## tibble [5,836 × 8] (S3: tbl_df/tbl/data.frame)
##  $ id             : num [1:5836] 470001 470002 470003 470007 480008 ...
##  $ weight         : num [1:5836] 0.752 1.027 1.084 0.682 0.991 ...
##  $ age            : num [1:5836] 73 90 53 58 81 61 80 68 70 83 ...
##  $ education      : chr [1:5836] "College" "College" "College" "Some college" ...
##  $ race           : chr [1:5836] "White" "White" "White" "Black" ...
##  $ gender         : chr [1:5836] "Female" "Female" "Male" "Female" ...
##  $ income_category: chr [1:5836] "$75-125k" "$125k or more" "$125k or more" "$40-75k" ...
##  $ voter_category : chr [1:5836] "always" "always" "sporadic" "sporadic" ...
voters_subset$education <- as.factor(voters_subset$education)
voters_subset$income_category <- as.factor(voters_subset$income_category)

levels(voters_subset$education) <- c("High school or less","Some College","College")

levels(voters_subset$income_category) <- c("Less than $40k","$40-75k" ,"$75-125k","$125k or more")

Conclusion

Following this exercise of sub-setting the Voter’s dataset, I plan to follow up with further statistical and graphical analysis of the chosen features to determine if there are any correlations between them and the target variable. This sort of analysis will be greatly beneficial in the subsequent modeling process, where I will use the collected data to create a predictive model focused on classifying individuals as either “Not Voters”, “Sometimes Voters”, or “Always Voters”.

Following this, I will see if I can improve upon my results by including additional features initially excluded in the first subset.