This homework assignment analyzes a dataset discussed in the FiveThirtyEight article titled “Why Many Americans Don’t Vote”. The dataset is based on survey results collected from over 8,000 participants, and is focused on identifying patterns surrounding individuals’ willingness or unwillingness to vote in US elections.
The original dataset includes questions about race, gender, income, age, party association, weight, and of course the target variable “voter_category”. Individuals are bucketed into three voter_categories: “Never Vote”, “Sometimes Vote”, “Always Vote”.
Voter apathy is a crux to any functioning democracy, so analysis of this sort is critical to understanding why individuals may choose to not participate, as well as identifying means to increase voter engagement for those “Not Voters” today.
Link to article:
https://projects.fivethirtyeight.com/non-voters-poll-2020-election/
Using the read_csv function, we load the downloaded dataset into R
library("tidyverse")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.4 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
nonvoters = read_csv("nonvoters_data.csv")
## Rows: 5836 Columns: 119
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): educ, race, gender, income_cat, voter_category
## dbl (114): RespId, weight, Q1, Q2_1, Q2_2, Q2_3, Q2_4, Q2_5, Q2_6, Q2_7, Q2_...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
This dataset includes over 100 features. 6 of these features are descriptive of the survey respondent, including features like “weight” and “age”. The remaining features include coded responses to questions targeting individuals’ beliefs and personal feelings relating to the electoral process.
To begin my analysis, I chose to focus only on the “descriptive” features to see if there are any correlations between these and the target “voter_category”.
voters_subset = nonvoters[,c("RespId","weight","ppage","educ","race","gender","income_cat","voter_category")]
Certain feature titles are not clear, such as “ppage”. I chose to rename the features for clarity as follows:
new_names <- c("id","weight","age","education","race","gender","income_category","voter_category")
colnames(voters_subset) <- new_names
voters_subset
## # A tibble: 5,836 × 8
## id weight age education race gender income_category voter_category
## <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 470001 0.752 73 College White Female $75-125k always
## 2 470002 1.03 90 College White Female $125k or more always
## 3 470003 1.08 53 College White Male $125k or more sporadic
## 4 470007 0.682 58 Some college Black Female $40-75k sporadic
## 5 480008 0.991 81 High school… White Male $40-75k always
## 6 480009 1.06 61 High school… White Female $40-75k rarely/never
## 7 480010 1.15 80 High school… White Female $125k or more always
## 8 470008 1.02 68 Some college Other… Female $75-125k always
## 9 470010 0.818 70 College White Male $125k or more always
## 10 470011 1.17 83 Some college White Male $125k or more always
## # … with 5,826 more rows
Of the descriptive features selected for this analysis, two are Ordinal and as such I decided to convert these features to Factors.
unique(voters_subset$income_category)
## [1] "$75-125k" "$125k or more" "$40-75k" "Less than $40k"
str(voters_subset)
## tibble [5,836 × 8] (S3: tbl_df/tbl/data.frame)
## $ id : num [1:5836] 470001 470002 470003 470007 480008 ...
## $ weight : num [1:5836] 0.752 1.027 1.084 0.682 0.991 ...
## $ age : num [1:5836] 73 90 53 58 81 61 80 68 70 83 ...
## $ education : chr [1:5836] "College" "College" "College" "Some college" ...
## $ race : chr [1:5836] "White" "White" "White" "Black" ...
## $ gender : chr [1:5836] "Female" "Female" "Male" "Female" ...
## $ income_category: chr [1:5836] "$75-125k" "$125k or more" "$125k or more" "$40-75k" ...
## $ voter_category : chr [1:5836] "always" "always" "sporadic" "sporadic" ...
voters_subset$education <- as.factor(voters_subset$education)
voters_subset$income_category <- as.factor(voters_subset$income_category)
levels(voters_subset$education) <- c("High school or less","Some College","College")
levels(voters_subset$income_category) <- c("Less than $40k","$40-75k" ,"$75-125k","$125k or more")
Following this exercise of sub-setting the Voter’s dataset, I plan to follow up with further statistical and graphical analysis of the chosen features to determine if there are any correlations between them and the target variable. This sort of analysis will be greatly beneficial in the subsequent modeling process, where I will use the collected data to create a predictive model focused on classifying individuals as either “Not Voters”, “Sometimes Voters”, or “Always Voters”.
Following this, I will see if I can improve upon my results by including additional features initially excluded in the first subset.