The article that I chose is "Why Many Americans Don't Vote", by Amelia Thomsn-DeVeux, Jasmine Mithani, and Laura Bronner. The article used a poll to collect data to study why many Americans do not vote. This is an important topic because anywhere between 35 to 60 percent of eligible voters do not cast a ballot in any given electin, as stated in the article. The link to the article is: https://projects.fivethirtyeight.com/non-voters-poll-2020-election/.
poll.data <- read.csv("https://raw.githubusercontent.com/SaneSky109/DATA607/main/poll_data.csv")
# Check the first few rows of the dataset using head() function
head(poll.data)
## RespId weight Q1 Q2_1 Q2_2 Q2_3 Q2_4 Q2_5 Q2_6 Q2_7 Q2_8 Q2_9 Q2_10 Q3_1 Q3_2
## 1 470001 0.7516 1 1 1 2 4 1 4 2 2 4 2 1 1
## 2 470002 1.0267 1 1 2 2 3 1 1 2 1 1 3 3 3
## 3 470003 1.0844 1 1 1 2 2 1 1 2 1 4 3 2 2
## 4 470007 0.6817 1 1 1 1 3 1 1 1 1 1 2 1 1
## 5 480008 0.9910 1 1 1 -1 1 1 1 1 1 1 1 4 -1
## 6 480009 1.0591 1 3 2 3 4 1 3 3 1 1 4 1 2
## Q3_3 Q3_4 Q3_5 Q3_6 Q4_1 Q4_2 Q4_3 Q4_4 Q4_5 Q4_6 Q5 Q6 Q7 Q8_1 Q8_2 Q8_3
## 1 4 4 3 2 2 1 2 2 2 2 1 2 1 3 4 2
## 2 4 3 3 2 2 2 2 3 3 1 1 2 2 2 3 2
## 3 3 3 2 2 2 2 3 3 2 3 1 1 1 3 2 1
## 4 4 4 2 1 1 2 2 2 2 2 1 3 1 3 2 2
## 5 1 1 2 4 1 1 1 1 1 1 1 2 2 1 3 2
## 6 -1 2 2 2 4 3 3 3 4 2 2 4 1 3 3 3
## Q8_4 Q8_5 Q8_6 Q8_7 Q8_8 Q8_9 Q9_1 Q9_2 Q9_3 Q9_4 Q10_1 Q10_2 Q10_3 Q10_4
## 1 1 1 1 1 2 4 2 2 4 4 2 2 2 2
## 2 2 2 2 3 2 2 1 1 3 4 2 2 2 2
## 3 1 2 2 2 2 1 1 2 4 4 2 2 1 2
## 4 2 2 2 2 2 2 1 2 4 4 2 2 2 2
## 5 3 3 3 4 2 2 1 4 3 4 2 2 2 2
## 6 2 3 3 2 2 2 -1 -1 -1 4 2 2 2 2
## Q11_1 Q11_2 Q11_3 Q11_4 Q11_5 Q11_6 Q14 Q15 Q16 Q17_1 Q17_2 Q17_3 Q17_4 Q18_1
## 1 2 2 2 2 2 2 5 1 1 1 1 1 3 2
## 2 2 2 1 2 2 2 1 1 2 2 2 2 3 2
## 3 2 2 1 2 1 2 5 2 1 1 3 1 1 2
## 4 1 2 2 2 1 2 5 1 4 1 1 1 1 2
## 5 2 2 1 2 2 2 1 5 1 2 2 4 4 2
## 6 2 2 2 1 2 2 -1 -1 -1 -1 -1 -1 -1 2
## Q18_2 Q18_3 Q18_4 Q18_5 Q18_6 Q18_7 Q18_8 Q18_9 Q18_10 Q19_1 Q19_2 Q19_3
## 1 2 2 2 2 2 2 2 2 2 -1 -1 1
## 2 2 2 2 2 2 2 2 2 2 -1 1 -1
## 3 2 2 2 2 2 1 2 2 2 -1 1 -1
## 4 2 2 2 2 2 2 2 2 2 -1 -1 1
## 5 2 2 2 2 2 2 2 2 2 -1 -1 -1
## 6 2 2 2 2 2 2 2 2 2 -1 -1 -1
## Q19_4 Q19_5 Q19_6 Q19_7 Q19_8 Q19_9 Q19_10 Q20 Q21 Q22 Q23 Q24 Q25 Q26 Q27_1
## 1 1 1 1 1 -1 -1 -1 1 1 NA 2 1 1 1 1
## 2 -1 -1 -1 -1 -1 -1 -1 1 1 NA 1 3 3 1 1
## 3 1 -1 -1 -1 1 1 -1 1 1 NA 2 1 2 1 1
## 4 -1 -1 -1 -1 1 -1 1 1 1 NA 2 1 2 1 1
## 5 -1 -1 -1 -1 -1 -1 -1 1 1 NA 1 3 1 1 1
## 6 -1 -1 -1 -1 -1 -1 -1 2 2 7 -1 4 3 4 2
## Q27_2 Q27_3 Q27_4 Q27_5 Q27_6 Q28_1 Q28_2 Q28_3 Q28_4 Q28_5 Q28_6 Q28_7 Q28_8
## 1 1 1 1 1 1 1 1 1 1 -1 -1 1 -1
## 2 1 1 1 1 1 1 -1 -1 -1 -1 1 -1 -1
## 3 1 1 1 1 1 1 -1 -1 -1 -1 -1 1 -1
## 4 1 1 1 1 1 1 1 -1 1 -1 -1 -1 -1
## 5 1 1 1 1 1 1 1 1 -1 1 -1 1 -1
## 6 2 2 2 2 2 NA NA NA NA NA NA NA NA
## Q29_1 Q29_2 Q29_3 Q29_4 Q29_5 Q29_6 Q29_7 Q29_8 Q29_9 Q29_10 Q30 Q31 Q32 Q33
## 1 NA NA NA NA NA NA NA NA NA NA 2 NA 1 NA
## 2 NA NA NA NA NA NA NA NA NA NA 3 NA NA 1
## 3 NA NA NA NA NA NA NA NA NA NA 2 NA 2 NA
## 4 NA NA NA NA NA NA NA NA NA NA 2 NA 1 NA
## 5 NA NA NA NA NA NA NA NA NA NA 1 -1 NA NA
## 6 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 5 NA NA -1
## ppage educ race gender income_cat voter_category
## 1 73 College White Female $75-125k always
## 2 90 College White Female $125k or more always
## 3 53 College White Male $125k or more sporadic
## 4 58 Some college Black Female $40-75k sporadic
## 5 81 High school or less White Male $40-75k always
## 6 61 High school or less White Female $40-75k rarely/never
It is clear to see that: the column names need to be changed, entry names need to be more clear, and the number of columns need can be reduced.
poll.data.subset<-poll.data[,c(4:13,27:37,42:45,54,82,84,110,114:119)]
head(poll.data.subset)
## Q2_1 Q2_2 Q2_3 Q2_4 Q2_5 Q2_6 Q2_7 Q2_8 Q2_9 Q2_10 Q6 Q7 Q8_1 Q8_2 Q8_3 Q8_4
## 1 1 1 2 4 1 4 2 2 4 2 2 1 3 4 2 1
## 2 1 2 2 3 1 1 2 1 1 3 2 2 2 3 2 2
## 3 1 1 2 2 1 1 2 1 4 3 1 1 3 2 1 1
## 4 1 1 1 3 1 1 1 1 1 2 3 1 3 2 2 2
## 5 1 1 -1 1 1 1 1 1 1 1 2 2 1 3 2 3
## 6 3 2 3 4 1 3 3 1 1 4 4 1 3 3 3 2
## Q8_5 Q8_6 Q8_7 Q8_8 Q8_9 Q10_1 Q10_2 Q10_3 Q10_4 Q16 Q23 Q25 Q30 ppage
## 1 1 1 1 2 4 2 2 2 2 1 2 1 2 73
## 2 2 2 3 2 2 2 2 2 2 2 1 3 3 90
## 3 2 2 2 2 1 2 2 1 2 1 2 2 2 53
## 4 2 2 2 2 2 2 2 2 2 4 2 2 2 58
## 5 3 3 4 2 2 2 2 2 2 1 1 1 1 81
## 6 3 3 2 2 2 2 2 2 2 -1 -1 3 5 61
## educ race gender income_cat voter_category
## 1 College White Female $75-125k always
## 2 College White Female $125k or more always
## 3 College White Male $125k or more sporadic
## 4 Some college Black Female $40-75k sporadic
## 5 High school or less White Male $40-75k always
## 6 High school or less White Female $40-75k rarely/never
After some time looking at the dataset and article, I decided that these variables are most useful in determining the target variable: voter_category. Now let's adjust the column names adn entry names for all unclear variable names and entry names.
# Renaming Variables
names(poll.data.subset)[names(poll.data.subset) == 'Q2_1'] <- 'Importance_of_Voting'
poll.data.subset$Importance_of_Voting[poll.data.subset$Importance_of_Voting == -1]<-1
poll.data.subset$Importance_of_Voting<-factor(poll.data.subset$Importance_of_Voting,labels = c("very_important", "somewhat_important", "not_so_important", "not_at_all_important"))
names(poll.data.subset)[names(poll.data.subset) == 'Q2_2'] <- 'Importance_of_Jury_Duty'
poll.data.subset$Importance_of_Jury_Duty[poll.data.subset$Importance_of_Jury_Duty == -1]<-1
poll.data.subset$Importance_of_Jury_Duty<-factor(poll.data.subset$Importance_of_Jury_Duty,labels = c("very_important", "somewhat_important", "not_so_important", "not_at_all_important"))
names(poll.data.subset)[names(poll.data.subset) == 'Q2_3'] <- 'Importance_of_Following_Politics'
poll.data.subset$Importance_of_Following_Politics[poll.data.subset$Importance_of_Following_Politics == -1]<-1
poll.data.subset$Importance_of_Following_Politics<-factor(poll.data.subset$Importance_of_Following_Politics,labels = c("very_important", "somewhat_important", "not_so_important", "not_at_all_important"))
names(poll.data.subset)[names(poll.data.subset) == 'Q2_4'] <- 'Importance_of_the_Flag'
poll.data.subset$Importance_of_the_Flag[poll.data.subset$Importance_of_the_Flag == -1]<-1
poll.data.subset$Importance_of_the_Flag<-factor(poll.data.subset$Importance_of_the_Flag,labels = c("very_important", "somewhat_important", "not_so_important", "not_at_all_important"))
names(poll.data.subset)[names(poll.data.subset) == 'Q2_5'] <- 'Importance_of_US_Census'
poll.data.subset$Importance_of_US_Census[poll.data.subset$Importance_of_US_Census == -1]<-1
poll.data.subset$Importance_of_US_Census<-factor(poll.data.subset$Importance_of_US_Census,labels = c("very_important", "somewhat_important", "not_so_important", "not_at_all_important"))
names(poll.data.subset)[names(poll.data.subset) == 'Q2_6'] <- 'Importance_of_Saying_the_Pledge'
poll.data.subset$Importance_of_Saying_the_Pledge[poll.data.subset$Importance_of_Saying_the_Pledge == -1]<-1
poll.data.subset$Importance_of_Saying_the_Pledge<-factor(poll.data.subset$Importance_of_Saying_the_Pledge,labels = c("very_important", "somewhat_important", "not_so_important", "not_at_all_important"))
names(poll.data.subset)[names(poll.data.subset) == 'Q2_7'] <- 'Importance_of_Military_Support'
poll.data.subset$Importance_of_Military_Support[poll.data.subset$Importance_of_Military_Support == -1]<-1
poll.data.subset$Importance_of_Military_Support<-factor(poll.data.subset$Importance_of_Military_Support,labels = c("very_important", "somewhat_important", "not_so_important", "not_at_all_important"))
names(poll.data.subset)[names(poll.data.subset) == 'Q2_8'] <- 'Importance_of_Respecting_Opinions'
poll.data.subset$Importance_of_Respecting_Opinions[poll.data.subset$Importance_of_Respecting_Opinions == -1]<-1
poll.data.subset$Importance_of_Respecting_Opinions<-factor(poll.data.subset$Importance_of_Respecting_Opinions,labels = c("very_important", "somewhat_important", "not_so_important", "not_at_all_important"))
names(poll.data.subset)[names(poll.data.subset) == 'Q2_9'] <- 'Importance_of_Religion'
poll.data.subset$Importance_of_Religion[poll.data.subset$Importance_of_Religion == -1]<-1
poll.data.subset$Importance_of_Religion<-factor(poll.data.subset$Importance_of_Religion,labels = c("very_important", "somewhat_important", "not_so_important", "not_at_all_important"))
names(poll.data.subset)[names(poll.data.subset) == 'Q2_10'] <- 'Importance_of_Right_to_Protest'
poll.data.subset$Importance_of_Right_to_Protest[poll.data.subset$Importance_of_Right_to_Protest == -1]<-1
poll.data.subset$Importance_of_Right_to_Protest<-factor(poll.data.subset$Importance_of_Right_to_Protest,labels = c("very_important", "somewhat_important", "not_so_important", "not_at_all_important"))
names(poll.data.subset)[names(poll.data.subset) == 'Q6'] <- 'How_many_people_in_office_are_like_you'
poll.data.subset$How_many_people_in_office_are_like_you[poll.data.subset$How_many_people_in_office_are_like_you == -1]<-1
poll.data.subset$How_many_people_in_office_are_like_you<-factor(poll.data.subset$How_many_people_in_office_are_like_you, labels = c("a lot", "some", "a few", "none"))
names(poll.data.subset)[names(poll.data.subset) == 'Q7'] <- 'Opinion_on_Structure_of_US_Government'
poll.data.subset$Opinion_on_Structure_of_US_Government[poll.data.subset$Opinion_on_Structure_of_US_Government == -1]<-1
poll.data.subset$Opinion_on_Structure_of_US_Government<-factor(poll.data.subset$Opinion_on_Structure_of_US_Government, labels = c("a lot needs to change", "change is not really needed"))
names(poll.data.subset)[names(poll.data.subset) == 'Q8_1'] <- 'Trust_President'
poll.data.subset$Trust_President[poll.data.subset$Trust_President == -1]<-1
poll.data.subset$Trust_President<-factor(poll.data.subset$Trust_President, labels = c("a lot", "some", "not much", "not at all"))
names(poll.data.subset)[names(poll.data.subset) == 'Q8_2'] <- 'Trust_Congress'
poll.data.subset$Trust_Congress[poll.data.subset$Trust_Congress == -1]<-1
poll.data.subset$Trust_Congress<-factor(poll.data.subset$Trust_Congress, labels = c("a lot", "some", "not much", "not at all"))
names(poll.data.subset)[names(poll.data.subset) == 'Q8_3'] <- 'Trust_Supreme_Court'
poll.data.subset$Trust_Supreme_Court[poll.data.subset$Trust_Supreme_Court == -1]<-1
poll.data.subset$Trust_Supreme_Court<-factor(poll.data.subset$Trust_Supreme_Court, labels = c("a lot", "some", "not much", "not at all"))
names(poll.data.subset)[names(poll.data.subset) == 'Q8_4'] <- 'Trust_CDC'
poll.data.subset$Trust_CDC[poll.data.subset$Trust_CDC == -1]<-1
poll.data.subset$Trust_CDC<-factor(poll.data.subset$Trust_CDC, labels = c("a lot", "some", "not much", "not at all"))
names(poll.data.subset)[names(poll.data.subset) == 'Q8_5'] <- 'Trust_Elected_Officials'
poll.data.subset$Trust_Elected_Officials[poll.data.subset$Trust_Elected_Officials == -1]<-1
poll.data.subset$Trust_Elected_Officials<-factor(poll.data.subset$Trust_Elected_Officials, labels = c("a lot", "some", "not much", "not at all"))
names(poll.data.subset)[names(poll.data.subset) == 'Q8_6'] <- 'Trust_CIA_or_FBI'
poll.data.subset$Trust_CIA_or_FBI[poll.data.subset$Trust_CIA_or_FBI == -1]<-1
poll.data.subset$Trust_CIA_or_FBI<-factor(poll.data.subset$Trust_CIA_or_FBI, labels = c("a lot", "some", "not much", "not at all"))
names(poll.data.subset)[names(poll.data.subset) == 'Q8_7'] <- 'Trust_News_Media_Outlets'
poll.data.subset$Trust_News_Media_Outlets[poll.data.subset$Trust_News_Media_Outlets == -1]<-1
poll.data.subset$Trust_News_Media_Outlets<-factor(poll.data.subset$Trust_News_Media_Outlets, labels = c("a lot", "some", "not much", "not at all"))
names(poll.data.subset)[names(poll.data.subset) == 'Q8_8'] <- 'Trust_Police'
poll.data.subset$Trust_Police[poll.data.subset$Trust_Police == -1]<-1
poll.data.subset$Trust_Police<-factor(poll.data.subset$Trust_Police, labels = c("a lot", "some", "not much", "not at all"))
names(poll.data.subset)[names(poll.data.subset) == 'Q8_9'] <- 'Trust_US_Postal_Service'
poll.data.subset$Trust_US_Postal_Service[poll.data.subset$Trust_US_Postal_Service == -1]<-1
poll.data.subset$Trust_US_Postal_Service<-factor(poll.data.subset$Trust_US_Postal_Service, labels = c("a lot", "some", "not much", "not at all"))
names(poll.data.subset)[names(poll.data.subset) == 'Q10_1'] <- 'Recieve_Longterm_Disability'
poll.data.subset$Recieve_Longterm_Disability[poll.data.subset$Recieve_Longterm_Disability == -1]<-1
poll.data.subset$Recieve_Longterm_Disability<-factor(poll.data.subset$Recieve_Longterm_Disability, labels = c("Yes", "No"))
names(poll.data.subset)[names(poll.data.subset) == 'Q10_2'] <- 'Have_Chronic_Illness'
poll.data.subset$Have_Chronic_Illness[poll.data.subset$Have_Chronic_Illness == -1]<-1
poll.data.subset$Have_Chronic_Illness<-factor(poll.data.subset$Have_Chronic_Illness, labels = c("Yes", "No"))
names(poll.data.subset)[names(poll.data.subset) == 'Q10_3'] <- 'Unemployed_Longer_than_1Year'
poll.data.subset$Unemployed_Longer_than_1Year[poll.data.subset$Unemployed_Longer_than_1Year == -1]<-1
poll.data.subset$Unemployed_Longer_than_1Year<-factor(poll.data.subset$Unemployed_Longer_than_1Year, labels = c("Yes", "No"))
names(poll.data.subset)[names(poll.data.subset) == 'Q10_4'] <- 'Evicted_within_past_Year'
poll.data.subset$Evicted_within_past_Year[poll.data.subset$Evicted_within_past_Year == -1]<-1
poll.data.subset$Evicted_within_past_Year<-factor(poll.data.subset$Evicted_within_past_Year, labels = c("Yes", "No"))
names(poll.data.subset)[names(poll.data.subset) == 'Q16'] <- 'How_Easy_is_it_to_Vote_in_National_Elections'
poll.data.subset$How_Easy_is_it_to_Vote_in_National_Elections[poll.data.subset$How_Easy_is_it_to_Vote_in_National_Elections == -1]<-1
poll.data.subset$How_Easy_is_it_to_Vote_in_National_Elections<-factor(poll.data.subset$How_Easy_is_it_to_Vote_in_National_Elections, labels = c("Very easy", "Somewhat easy", "Somewhat difficult","Very difficult"))
names(poll.data.subset)[names(poll.data.subset) == 'Q23'] <- 'Presidential_Candidate_Vote_for_2020'
poll.data.subset$Presidential_Candidate_Vote_for_2020[poll.data.subset$Presidential_Candidate_Vote_for_2020 == -1]<-1
poll.data.subset$Presidential_Candidate_Vote_for_2020<-factor(poll.data.subset$Presidential_Candidate_Vote_for_2020, labels = c("Donald Trump", "Joe Biden", "Unsure"))
names(poll.data.subset)[names(poll.data.subset) == 'Q25'] <- 'Following_Presidential_Race_2020'
poll.data.subset$Following_Presidential_Race_2020[poll.data.subset$Following_Presidential_Race_2020 == -1]<-1
poll.data.subset$Following_Presidential_Race_2020<-factor(poll.data.subset$Following_Presidential_Race_2020, labels = c("Very closely", "Somewhat closely", "Not very closely","Not closely at all"))
names(poll.data.subset)[names(poll.data.subset) == 'Q30'] <- 'Political_Affiliation'
poll.data.subset$Political_Affiliation[poll.data.subset$Political_Affiliation == -1]<-1
poll.data.subset$Political_Affiliation<-factor(poll.data.subset$Political_Affiliation, labels = c("Republican", "Democrat", "Independent","Other","No preference"))
names(poll.data.subset)[names(poll.data.subset) == 'ppage'] <- 'Age'
names(poll.data.subset)[names(poll.data.subset) == 'educ'] <- 'Education'
names(poll.data.subset)[names(poll.data.subset) == 'income_cat'] <- 'Income'
# Show the improved colnames and labels
head(poll.data.subset)
## Importance_of_Voting Importance_of_Jury_Duty Importance_of_Following_Politics
## 1 very_important very_important somewhat_important
## 2 very_important somewhat_important somewhat_important
## 3 very_important very_important somewhat_important
## 4 very_important very_important very_important
## 5 very_important very_important very_important
## 6 not_so_important somewhat_important not_so_important
## Importance_of_the_Flag Importance_of_US_Census
## 1 not_at_all_important very_important
## 2 not_so_important very_important
## 3 somewhat_important very_important
## 4 not_so_important very_important
## 5 very_important very_important
## 6 not_at_all_important very_important
## Importance_of_Saying_the_Pledge Importance_of_Military_Support
## 1 not_at_all_important somewhat_important
## 2 very_important somewhat_important
## 3 very_important somewhat_important
## 4 very_important very_important
## 5 very_important very_important
## 6 not_so_important not_so_important
## Importance_of_Respecting_Opinions Importance_of_Religion
## 1 somewhat_important not_at_all_important
## 2 very_important very_important
## 3 very_important not_at_all_important
## 4 very_important very_important
## 5 very_important very_important
## 6 very_important very_important
## Importance_of_Right_to_Protest How_many_people_in_office_are_like_you
## 1 somewhat_important some
## 2 not_so_important some
## 3 not_so_important a lot
## 4 somewhat_important a few
## 5 very_important some
## 6 not_at_all_important none
## Opinion_on_Structure_of_US_Government Trust_President Trust_Congress
## 1 a lot needs to change not much not at all
## 2 change is not really needed some not much
## 3 a lot needs to change not much some
## 4 a lot needs to change not much some
## 5 change is not really needed a lot not much
## 6 a lot needs to change not much not much
## Trust_Supreme_Court Trust_CDC Trust_Elected_Officials Trust_CIA_or_FBI
## 1 some a lot a lot a lot
## 2 some some some some
## 3 a lot a lot some some
## 4 some some some some
## 5 some not much not much not much
## 6 not much some not much not much
## Trust_News_Media_Outlets Trust_Police Trust_US_Postal_Service
## 1 a lot some not at all
## 2 not much some some
## 3 some some a lot
## 4 some some some
## 5 not at all some some
## 6 some some some
## Recieve_Longterm_Disability Have_Chronic_Illness Unemployed_Longer_than_1Year
## 1 No No No
## 2 No No No
## 3 No No Yes
## 4 No No No
## 5 No No No
## 6 No No No
## Evicted_within_past_Year How_Easy_is_it_to_Vote_in_National_Elections
## 1 No Very easy
## 2 No Somewhat easy
## 3 No Very easy
## 4 No Very difficult
## 5 No Very easy
## 6 No Very easy
## Presidential_Candidate_Vote_for_2020 Following_Presidential_Race_2020
## 1 Joe Biden Very closely
## 2 Donald Trump Not very closely
## 3 Joe Biden Somewhat closely
## 4 Joe Biden Somewhat closely
## 5 Donald Trump Very closely
## 6 Donald Trump Not very closely
## Political_Affiliation Age Education race gender Income
## 1 Democrat 73 College White Female $75-125k
## 2 Independent 90 College White Female $125k or more
## 3 Democrat 53 College White Male $125k or more
## 4 Democrat 58 Some college Black Female $40-75k
## 5 Republican 81 High school or less White Male $40-75k
## 6 No preference 61 High school or less White Female $40-75k
## voter_category
## 1 always
## 2 always
## 3 sporadic
## 4 sporadic
## 5 always
## 6 rarely/never
The adjusted names and entries make the dataset much easier to read.
The original polling data needed to be adjusted to make the variables and results easier to understand. I acomplished this through variable selection, renaming columns, and factoring and renaming categorical entries. The target variable is voter_category in which analysis can be conducted to find meaningful insights using the enhanced dataset that I created from the polling data.