Overview

This analysis is based on a dataset behind the story Why Many Americans Don’t Vote from FiveThirtyEight that explores why some Americans do not vote in presidential elections.

The original dataset contains information on a variety of demographic, socioeconomic, and attitudinal factors that contribute to non-voting.

In this analysis, we will be selecting a subset of variables to explore the relationships between non-voting and key demographic and socioeconomic factors.

Data Cleaning and Transformation

First, let’s load the libraries we will be using:

library(dplyr)
library(readr)

Next, we will load the original dataset into R:

url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/non-voters/nonvoters_data.csv"
nonvoters <- read_csv(url)

This dataset has 119 columns and 5836 rows.

Here are the column names:

colnames(nonvoters)
##   [1] "RespId"         "weight"         "Q1"             "Q2_1"          
##   [5] "Q2_2"           "Q2_3"           "Q2_4"           "Q2_5"          
##   [9] "Q2_6"           "Q2_7"           "Q2_8"           "Q2_9"          
##  [13] "Q2_10"          "Q3_1"           "Q3_2"           "Q3_3"          
##  [17] "Q3_4"           "Q3_5"           "Q3_6"           "Q4_1"          
##  [21] "Q4_2"           "Q4_3"           "Q4_4"           "Q4_5"          
##  [25] "Q4_6"           "Q5"             "Q6"             "Q7"            
##  [29] "Q8_1"           "Q8_2"           "Q8_3"           "Q8_4"          
##  [33] "Q8_5"           "Q8_6"           "Q8_7"           "Q8_8"          
##  [37] "Q8_9"           "Q9_1"           "Q9_2"           "Q9_3"          
##  [41] "Q9_4"           "Q10_1"          "Q10_2"          "Q10_3"         
##  [45] "Q10_4"          "Q11_1"          "Q11_2"          "Q11_3"         
##  [49] "Q11_4"          "Q11_5"          "Q11_6"          "Q14"           
##  [53] "Q15"            "Q16"            "Q17_1"          "Q17_2"         
##  [57] "Q17_3"          "Q17_4"          "Q18_1"          "Q18_2"         
##  [61] "Q18_3"          "Q18_4"          "Q18_5"          "Q18_6"         
##  [65] "Q18_7"          "Q18_8"          "Q18_9"          "Q18_10"        
##  [69] "Q19_1"          "Q19_2"          "Q19_3"          "Q19_4"         
##  [73] "Q19_5"          "Q19_6"          "Q19_7"          "Q19_8"         
##  [77] "Q19_9"          "Q19_10"         "Q20"            "Q21"           
##  [81] "Q22"            "Q23"            "Q24"            "Q25"           
##  [85] "Q26"            "Q27_1"          "Q27_2"          "Q27_3"         
##  [89] "Q27_4"          "Q27_5"          "Q27_6"          "Q28_1"         
##  [93] "Q28_2"          "Q28_3"          "Q28_4"          "Q28_5"         
##  [97] "Q28_6"          "Q28_7"          "Q28_8"          "Q29_1"         
## [101] "Q29_2"          "Q29_3"          "Q29_4"          "Q29_5"         
## [105] "Q29_6"          "Q29_7"          "Q29_8"          "Q29_9"         
## [109] "Q29_10"         "Q30"            "Q31"            "Q32"           
## [113] "Q33"            "ppage"          "educ"           "race"          
## [117] "gender"         "income_cat"     "voter_category"

Select relevant columns

We will select a subset of variables from the dataset that are most relevant to our analysis. Specifically, we will focus on the following variables:

  • ppage: Age of respondent
  • educ: Highest educational attainment category
  • race: Race of respondent, census categories
  • gender: Gender of respondent
  • income_cat: Household income category of respondent
  • voter_category: Voter category of respondent: always, sporadic or rarely/never
nonvoters <- nonvoters %>%
  select(ppage, educ, race, gender, income_cat, voter_category)

Let’s view the first 10 rows:

head(nonvoters, 10)
## # A tibble: 10 × 6
##    ppage educ                race        gender income_cat    voter_category
##    <dbl> <chr>               <chr>       <chr>  <chr>         <chr>         
##  1    73 College             White       Female $75-125k      always        
##  2    90 College             White       Female $125k or more always        
##  3    53 College             White       Male   $125k or more sporadic      
##  4    58 Some college        Black       Female $40-75k       sporadic      
##  5    81 High school or less White       Male   $40-75k       always        
##  6    61 High school or less White       Female $40-75k       rarely/never  
##  7    80 High school or less White       Female $125k or more always        
##  8    68 Some college        Other/Mixed Female $75-125k      always        
##  9    70 College             White       Male   $125k or more always        
## 10    83 Some college        White       Male   $125k or more always

Rename columns

Let’s use better and clear column names, and get rid of unnecessary abbreviations:

nonvoters <- nonvoters %>%
  rename(
    age = ppage,
    highest_education_level = educ,
    income_category = income_cat
  )

The renamed columns now look like this:

head(nonvoters, 10)
## # A tibble: 10 × 6
##      age highest_education_level race      gender income_category voter_category
##    <dbl> <chr>                   <chr>     <chr>  <chr>           <chr>         
##  1    73 College                 White     Female $75-125k        always        
##  2    90 College                 White     Female $125k or more   always        
##  3    53 College                 White     Male   $125k or more   sporadic      
##  4    58 Some college            Black     Female $40-75k         sporadic      
##  5    81 High school or less     White     Male   $40-75k         always        
##  6    61 High school or less     White     Female $40-75k         rarely/never  
##  7    80 High school or less     White     Female $125k or more   always        
##  8    68 Some college            Other/Mi… Female $75-125k        always        
##  9    70 College                 White     Male   $125k or more   always        
## 10    83 Some college            White     Male   $125k or more   always

Conclusions

In conclusion, the non-voters dataset provides valuable insights into the demographics and attitudes of non-voters in the United States. By tidying the dataset, we were able to create a more useful and manageable subset of the data that includes only the relevant columns. We also renamed some of the columns to make them more interpretable.

To extend and verify the work from the selected article, one could conduct more detailed analyses to investigate the reasons behind non-voting among different demographic groups. This could include conducting statistical tests to determine if there are significant differences in the reasons for non-voting between different age, race, or income groups. Additionally, it would be interesting to explore the effectiveness of various voter outreach and mobilization strategies, such as door-to-door canvassing, phone banking, or social media campaigns. Such analyses could help to identify effective strategies for increasing voter turnout in future elections.