Final Project Overview and Motivation

As the presidential election cycle starts to ramp up again, we thought it would be interesting to take a look back at the election data from 2016 in order to think more about potential factors that could affect the outcome of next year’s vote. Labeled by Politico as the “biggest upset in U.S. history”, a large narrative about a “divided” America continued to develop in the days following the 2016 presidential election. Many would argue that this narrative continues to dominate current news headlines and will be an influential factor in the way candidates run their campaigns over the next 12 months.

For our project, we are curious about what factors seem to “divide” America. We’ll explore questions such as:


Our Datasets

We’ll be utilizing two different datasets for this project.


Dataset 1 – 2016 Presidential Election Data

The first contains the 2016 presidential election results for every United States county, among other past election data.

knitr::opts_chunk$set(echo = TRUE)
require(tidyverse)
require(dplyr)
require(tidyr)
require(plyr)
require(knitr)
require(kableExtra)
election_data <- read.csv('https://raw.githubusercontent.com/zachalexander/data607_cunysps/master/FinalProject/election_data.csv')
kable(head(election_data, n = 15L), align = rep('c', 5)) %>% 
  kable_styling(bootstrap_options = c("striped", "responsive", "condensed"), full_width = FALSE) %>% 
  kableExtra::scroll_box(width = "100%", height = "500px")
X state county fips trump16 clinton16 otherpres16 romney12 obama12 otherpres12 demsen16 repsen16 othersen16 demhouse16 rephouse16 otherhouse16 demgov16 repgov16 othergov16 repgov14 demgov14 othergov14 total_population cvap white_pct black_pct hispanic_pct nonwhite_pct foreignborn_pct female_pct age29andunder_pct age65andolder_pct median_hh_inc clf_unemploy_pct lesshs_pct lesscollege_pct lesshs_whites_pct lesscollege_whites_pct rural_pct ruralurban_cc
1 Alabama Autauga 1001 18172 5936 865 17379 6363 190 6331 18220 62 7544 14315 2258 NA NA NA 9427 3638 0 55049 40690 75.68348 18.370906 2.5722538 24.316518 1.8383622 51.17622 40.03706 13.97846 53099 5.591657 12.417046 75.40723 10.002112 74.06560 42.00216 2
2 Alabama Baldwin 1003 72883 18458 3874 66016 18424 898 19145 74021 248 0 76995 1991 NA NA NA 37650 8416 0 199510 151770 83.17879 9.225603 4.3666984 16.821212 3.2695103 51.19493 35.47441 18.71485 51365 6.286843 9.972418 70.45289 7.842227 68.40561 42.27910 3
3 Alabama Barbour 1005 5454 4871 144 5550 5912 47 4777 5436 16 5297 4286 463 NA NA NA 3111 3651 0 26614 20375 45.88562 47.888329 4.3097618 54.114376 2.8593973 46.49808 37.66439 16.52889 33956 12.824738 26.235928 87.13221 19.579752 81.36475 67.78963 6
4 Alabama Bibb 1007 6738 1874 207 6132 2202 86 2082 6612 17 1971 6670 15 NA NA NA 3525 1368 0 22572 17590 74.76520 21.212121 2.2239943 25.234804 1.3512316 46.46465 37.32943 14.88570 39776 7.146827 19.301587 88.00000 15.020490 87.47177 68.35261 1
5 Alabama Blount 1009 22859 2156 573 20757 2970 279 2980 22169 48 2390 22367 47 NA NA NA 12074 2178 0 57704 42430 87.65770 1.557951 8.7272979 12.342299 4.2718009 50.48524 37.24005 17.19292 46212 5.953833 19.968585 86.95024 16.643368 86.16361 89.95150 1
6 Alabama Bullock 1011 1140 3530 40 1251 4061 10 3364 1167 6 3517 889 80 NA NA NA 747 2440 0 10552 8195 21.68309 75.502274 0.1231994 78.316907 1.5163002 45.80174 37.93594 15.13457 29335 13.258520 33.437883 89.74499 11.780384 79.15778 51.37438 6
7 Alabama Butler 1013 4901 3726 105 5087 4374 35 3663 4840 7 4088 3918 321 NA NA NA 3148 2741 0 20280 15425 52.78107 43.515779 1.2475345 47.218935 1.3954635 53.04734 37.05621 18.12623 34315 9.804827 18.940426 83.91999 14.604180 80.07600 71.23216 6
8 Alabama Calhoun 1015 32865 13242 1757 30278 15511 468 14152 32976 69 14000 33175 75 NA NA NA 17688 9082 0 115883 88525 72.99777 20.331714 3.4362245 27.002235 2.3937937 51.89804 38.78308 15.93072 41954 11.681822 17.663137 82.34703 16.399511 81.64061 33.69683 3
9 Alabama Chambers 1017 7843 5784 273 7626 6871 114 5845 7865 18 5796 7907 16 NA NA NA 3635 4587 0 34018 26480 56.74349 40.478570 0.4350638 43.256511 0.9906520 52.08713 35.87219 18.38732 36027 7.489945 19.736732 87.51572 15.295351 83.85879 49.14803 6
10 Alabama Cherokee 1019 8953 1547 233 7506 2132 141 1915 8636 7 1702 8707 12 NA NA NA 5007 1868 0 25897 20505 91.87165 4.606711 1.5638877 8.128355 0.7259528 50.25292 32.71035 20.32282 38925 5.855731 18.717236 86.03830 19.256407 85.78324 85.73627 6
11 Alabama Chilton 1021 15081 2911 377 13932 3397 133 3327 14582 25 2996 14723 12 NA NA NA 9028 2419 0 43817 31555 80.61027 10.009813 7.6248944 19.389735 5.5572038 50.79764 38.83424 15.24979 42594 7.938580 19.902472 85.14545 17.849792 84.38194 86.74472 1
12 Alabama Choctaw 1023 4106 3109 77 4152 3786 30 2992 4035 8 4332 0 94 NA NA NA 2380 1947 0 13287 10515 56.20531 42.229247 0.5569354 43.794686 0.1655754 52.21645 33.40107 20.74960 32622 13.642814 21.147333 88.03980 17.688938 87.45893 100.00000 9
13 Alabama Clarke 1025 7140 5749 142 7470 6334 47 5558 7158 14 5214 2848 127 NA NA NA 5000 3524 0 24847 19195 53.17745 45.494426 0.1730591 46.822554 0.4668572 52.75888 36.60402 18.06254 32735 17.083333 18.957819 87.86277 14.136998 82.08411 75.98034 7
14 Alabama Clay 1027 5245 1237 142 4817 1777 68 1377 5147 4 1317 5231 4 NA NA NA 3199 1215 0 13483 10360 80.26404 14.744493 3.1076170 19.735964 1.8912705 51.55381 34.84388 19.40221 38815 6.591530 25.386407 88.93712 21.804320 87.52224 100.00000 9
15 Alabama Cleburne 1029 5764 684 145 5272 971 62 847 5554 6 769 5644 4 NA NA NA 3129 717 0 14991 11295 92.79568 2.361417 2.3280635 7.204323 1.7477153 50.48362 36.14169 18.16423 36316 7.173601 25.815165 88.47193 24.262969 88.65948 100.00000 8

The dataset was found on GitHub, here: https://github.com/tonmcg/US_County_Level_Election_Results_08-16

Information about this dataset: The 2016 election results were obtained from Townhall.com by utilizing a web-scraping package – beautifulsoup. This Python package is referenced in the GitHub readme file of the user above. Beautifulsoup is an HTML parser that assists in scraping data from websites. In this case, the GitHub user located the published election results here, and due to it’s easy-to-scrape html table format, was able to obtain the results quite easily using this package. Websites like Townhall.com have designated data teams that help produce the real-time election results on the night of the election.


Dataset 2 – Public Religion Research Institute (PRRI)

The second dataset that we’ll work with contains data related to values, with a respondent identifier that captures their state of residence – which we can use to connect back to our election results. This dataset was found on the Public Religion Research Institute (PRRI) website and contains a large number of questions related to values ranging from respondent’s views on immigration, gun control laws, health care, and much more.

Information about this dataset: This survey was completely designed and conducted by PRRI. Their website states that it is the “eighth annual multi-issue survey of it’s kind”.

Survey Methodology

  • The survey was conducted between October 18th and October 30th, 2017.
  • Data collection was based on stratified, single-stage, random-digit-dialing (RDD) sampling of landline telephone households and randomly generated cell phone numbers.
  • In the end, they were able to survey 2,019 individuals (810 landline and 1,209 cell phone)
  • The sample represents responses from adults 18 years or older living in the United States.

Data Wrangling


Reading data files from Github

require(haven)
spss_file <- file.path('https://github.com/zachalexander/data607_cunysps/blob/master/FinalProject/PRRI-2017-American-Values-Survey.sav?raw=true')
avs <- read_sav(spss_file)
kable(head(avs, n = 15L), align = rep('c', 5)) %>% 
  kable_styling(bootstrap_options = c("striped", "responsive", "condensed"), full_width = FALSE) %>% 
  kableExtra::scroll_box(width = "100%", height = "500px")
case_id week weight state metro region division ownhome marital employ hhnum adults kids1217 kids611 kids06 parent age age2 educ2 educ income race partyrot party ideorot ideo regvote othtel sex ident l1 l2a c1 c2 c1a usborn relig religopen denom denomopen born jewid secular jdenom year demh1 demh2 date sexorient_1 sexorient_2 sexorient_3 sexorient_4 sexorient_5 formrot q1 q2 q3rot q3 q4rot q4 q5rot q5 q5ot q6rot q6a q6b q6c q6d q6e q6f partyln partyst1 partyst2 q7 q8 q9 q10 q11 q12 q13rot q13 q14rot q14a q14b q14c q14d q15 q16 q17rot q17a q17c q17b q17d q17e q17f q17g q17h q18 q19 q20 q20ot q21 q22rot q22a q22b q22c q22d q22e q22f q22g q23 q24 q25rot q25
10000001 842 0.3804 SC 3 3 5 1 6 3 2 2 NA NA NA 2 69 NA 6 4 2 1 1 2 1 4 1 1 1 1 2 NA NA NA NA NA 90 NA NA 2 1 NA 17 NA NA 171018 0 0 0 0 1 1 3 4 1 2 NA NA 1 4 ACB 2 3 2 NA NA NA NA NA 1 4 1 3 NA NA NA 1 2 ABCD 4 4 1 3 3 1 EGHCDAFB 2 2 3 3 3 2 3 3 1 3 1 2 DBEFGCA 1 3 3 3 4 4 1 2 1 1 1
10000002 842 0.5964 GA 1 3 5 1 6 3 2 2 NA NA NA 2 65 NA 4 3 98 1 1 3 1 1 2 0 2 1 1 NA NA NA 8 NA 99 NA NA NA NA NA 17 NA NA 171018 1 0 0 0 0 2 3 4 1 2 NA NA 4 3 FED NA NA NA 1 1 2 2 NA NA 4 1 NA 3 NA NA 1 2 DBCA 3 4 1 4 3 1 DBAFHECG 2 3 2 3 2 2 3 3 3 1 2 1 GBCDEFA 3 4 3 3 2 4 2 1 1 1 2
10000003 842 0.8891 FL 1 3 5 1 3 4 2 2 NA NA NA 2 99 4 4 3 4 2 2 1 2 3 1 0 2 1 1 NA NA NA 1 NA 1 NA 1 NA NA NA 17 NA NA 171018 1 0 0 0 0 2 5 4 2 2 NA NA 2 5 EDF NA NA NA 1 2 3 NA 2 NA 4 2 NA 3 2 2 2 9 ADBC 3 4 1 3 3 1 ACFBHGED 3 2 3 3 3 3 3 3 1 1 1 1 BDFCGEA 2 2 2 1 3 3 2 1 1 2 2
10000004 842 0.9169 FL 1 3 5 1 3 3 2 2 NA NA NA 2 71 NA 5 3 4 1 2 2 2 4 1 0 1 1 1 NA NA NA 1 NA 96 NA NA 2 NA NA 17 NA NA 171018 1 0 0 0 0 1 1 4 2 2 NA NA 5 5 ACB 1 1 1 NA NA NA NA NA 1 4 2 3 NA NA NA 2 2 DACB 4 4 1 4 3 1 DHGCEABF 3 3 3 4 3 2 3 4 2 1 1 2 GACEBFD 1 3 3 3 2 3 2 2 1 2 1
10000005 842 0.8904 NJ 3 1 2 2 1 6 1 1 NA NA NA 2 33 NA 3 2 1 2 1 2 1 4 1 0 1 1 1 NA NA NA 1 NA 22 13 2 NA NA NA 17 1 4 171018 1 0 0 0 0 2 4 2 NA NA 1 1 3 3 DFE NA NA NA 3 3 2 NA NA 2 3 2 NA 3 NA NA 1 2 BDCA 4 4 1 4 3 2 BHCFAEGD 2 2 4 3 2 4 1 2 1 1 1 2 ADEBFGC 1 2 2 1 4 4 1 2 1 1 1
10000006 842 0.6692 OR 5 4 9 2 4 7 1 1 NA NA NA 2 55 NA 3 2 1 1 1 2 1 4 1 0 2 1 1 NA NA NA 1 NA 90 NA NA 2 2 NA 17 1 4 171018 1 0 0 0 0 1 4 3 1 2 NA NA 2 4 ACB 9 2 1 NA NA NA NA NA 2 3 1 3 NA NA NA 1 2 CADB 9 2 9 4 9 1 CGFEDHAB 2 1 4 3 3 2 2 3 1 1 2 1 GABCDEF 1 1 1 4 4 4 1 1 1 1 1
10000007 842 0.3147 OH 3 2 3 1 3 1 1 1 NA NA NA 2 50 NA 4 3 8 1 2 3 2 1 1 4 1 1 1 NA NA NA 7 NA 99 NA NA NA NA NA 17 1 2 171018 1 0 0 0 0 1 1 1 NA NA 2 1 4 2 ABC 1 9 9 NA NA NA 1 NA NA 1 9 9 NA 9 9 2 9 BCDA 1 1 1 1 9 2 BGHEAFCD 2 3 1 1 2 4 9 1 1 2 9 1 ECBDFGA 4 1 1 3 3 2 3 9 9 2 2
10000008 842 0.8399 MD 2 3 5 1 3 1 2 2 NA NA NA 2 61 NA 7 5 12 1 1 2 1 3 1 0 1 1 1 NA NA NA 1 NA 99 NA NA NA NA NA 17 1 1 171018 1 0 0 0 0 2 1 4 1 2 NA NA 4 4 EDF NA NA NA 3 3 2 NA NA 1 4 1 NA 3 NA NA 1 2 ABCD 4 4 1 4 3 1 ECDHFBAG 1 9 3 3 3 2 4 3 1 1 1 2 CAFGEBD 1 3 9 4 4 4 1 2 1 1 1
10000009 842 1.3808 FL 2 3 5 1 3 1 4 3 1 0 0 1 56 NA 5 3 13 1 1 3 1 3 1 0 2 1 1 NA NA NA 1 NA 2 NA NA NA NA NA 17 1 1 171018 1 0 0 0 0 2 4 4 1 2 NA NA 1 5 EFD NA NA NA 1 3 1 2 NA NA 4 2 NA 3 NA NA 1 2 DACB 4 3 1 3 3 1 HEDGACBF 2 1 4 1 2 4 1 4 4 3 1 2 GACEBFD 1 3 2 1 3 1 2 2 1 1 1
10000011 842 0.6906 MI 3 2 3 1 3 3 2 2 NA NA NA 2 68 NA 4 3 6 1 2 3 2 4 1 0 2 1 1 NA NA NA 1 NA 95 NA NA 2 NA NA 17 NA NA 171018 1 0 0 0 0 1 1 4 2 2 NA NA 1 4 CBA 3 2 3 NA NA NA 2 NA NA 4 1 3 NA NA NA 2 2 BDCA 4 4 1 4 3 1 BCHAEGFD 1 1 4 4 4 2 4 4 1 1 2 2 AFEGBDC 2 4 3 3 4 4 1 2 1 2 1
10000012 842 1.4924 NY 3 1 2 1 1 6 9 9 NA NA NA 2 41 NA 3 2 1 1 2 2 2 5 1 0 1 1 1 NA NA NA 1 NA 2 NA NA NA NA NA 17 1 4 171018 0 0 0 0 1 2 5 3 2 1 NA NA 4 3 DEF NA NA NA 1 1 1 NA NA 2 4 3 NA 2 NA NA 2 3 CDBA 2 2 2 2 3 1 CDGBHFEA 4 4 4 4 4 4 1 4 2 3 2 2 CDABEFG 4 4 4 4 4 4 4 2 2 2 1
10000013 842 0.9353 PA 3 1 2 1 2 6 2 2 NA NA NA 2 74 NA 4 3 5 1 1 2 1 3 1 0 1 1 2 NA NA NA NA NA 15 3 2 NA NA NA 17 NA NA 171018 1 0 0 0 0 1 2 4 1 1 NA NA 4 5 CAB 3 3 1 NA NA NA NA NA 1 3 1 2 NA NA NA 1 2 CBAD 4 3 1 3 3 2 GFCDAEBH 4 3 3 4 9 4 3 4 1 1 1 2 ABFGECD 2 3 2 4 2 1 1 2 2 1 1
10000014 842 0.6506 NY 1 1 2 2 6 3 1 1 NA NA NA 2 62 NA 3 2 1 8 2 2 2 3 1 0 2 1 1 NA NA NA 1 NA 25 NA 1 NA NA NA 17 1 4 171018 1 0 0 0 0 2 4 3 2 2 NA NA 3 5 FED NA NA NA 2 1 1 NA NA 2 3 2 NA 3 NA NA 2 2 CBAD 4 1 1 3 2 2 GBCHAFED 4 4 3 4 4 1 2 4 1 3 2 2 GACEBFD 2 2 2 3 2 3 3 2 1 2 2
10000015 842 0.6039 NY 4 1 2 1 2 2 5 4 0 1 0 1 27 NA 4 3 6 1 1 3 1 3 1 0 1 1 1 NA NA NA 2 NA 20 NA NA NA NA NA 17 1 4 171018 1 0 0 0 0 2 3 4 1 2 NA NA 5 4 FED NA NA NA 3 3 1 2 NA NA 4 3 NA 3 NA NA 1 2 DBCA 4 4 1 4 3 1 HADGECFB 1 1 4 4 4 1 4 4 1 1 2 1 CBEFAGD 1 4 3 3 4 4 2 2 1 1 1
10000016 842 1.7472 NY 4 1 2 1 3 4 4 4 NA NA NA 2 55 NA 3 2 6 1 2 3 2 4 1 0 2 1 1 NA NA NA 1 NA 2 NA NA NA NA NA 17 1 1 171018 1 0 0 0 0 2 2 4 2 2 NA NA 5 4 DFE NA NA NA 2 2 1 2 NA NA 4 1 NA 3 NA NA 2 2 ACBD 3 3 1 3 3 1 EBADGFHC 2 2 4 4 2 2 3 3 2 3 1 2 ECDFGAB 1 3 1 3 3 1 1 1 1 2 1

Subsetting the data

To start, we subsetted the election data and selected columns that were relevant to the 2016 election, including the state column and the election totals for Republicans, Democrats, and Independents by county. Then, we created a total column that sums the total votes per county. Next, we used the group_by() and summarise_all() functions to group and sum the votes by state. At this point we had a data frame with each row contianing voter information for a state. Finally, we added a percentage column to get the percent of voters that voted for Donald Trump broken down by each state by dividing the number or Trump votes by the total number of votes in each state.

Trump_sub<- election_data[,c(2,5,6,7)]
Trump_sub$total16 <- Trump_sub$trump16 + Trump_sub$clinton16 + Trump_sub$otherpres16
Trump_map <- Trump_sub %>% group_by(state) %>% summarise_all(sum)
Trump_map$trump_per <- Trump_map$trump16/Trump_map$total16

For our next analyses and plots, we had to load in a few mapping packages, as well as the tidyverse package for more data wrangling.

require(sf)
require(maps)
require(RColorBrewer)
require(tools)

In order to create our first map, we needed to use the maps package to help load in the necessary geometry data that allowed us to create a map of the United States. We were able to load in the proper geometries to draw the state boundaries, as well as project the map to the correct coordinate plane. Then, we needed to subset our election data a bit more to remove the vote totals for Alaska and Hawaii, since there was a very low number of respondents from these states that contributed to the American Values Survey – we felt that any analyses and comparisons would not be representative of these states.

states <- st_as_sf(map("state", plot = FALSE, fill = TRUE))
states$ID <- toTitleCase(states$ID)
USA_votes <- Trump_map[which(Trump_map$state != "Alaska" & Trump_map$state != "Hawaii"),]
states$trump_per <- USA_votes$trump_per

for(i in 1:length(states$trump_per)){
  if(states$trump_per[i]-0.5>0.05){
    states$color[i] <- "1 - Greater than 55% voted for Trump"
  }
  if(states$trump_per[i]-0.5< 0.05 & states$trump_per[i]- 0.5 >0){
    states$color[i] <- "2 - Between 50% and 55% voted for Trump"
  }
  if(states$trump_per[i]-0.5> (-0.05) & states$trump_per[i]-0.5<0){
    states$color[i] <- "3 - Between 45 and 50% voted for Trump"
  }
  if(states$trump_per[i]-0.5< (-0.05)){
    states$color[i] <- "4 - Less than 45% voted for Trump"
  }
}

2016 Presidential Election Data


After getting the data into a state where we could start to plot out the percentages, we decided to use ggplot to create a map to show the percentage of voters that voted for Donald Trump in the 2016 election. Since we were focusing on Trump votes, our inital pass focused on how many percent more or less than 50% did Donal Trump receive in 2016 in each particular state. At this point we saw that the differences in many states were fairly low, so the contrast in our image was not ideal. To improve this we created a second plot where we binned the percent of Trump votes into 4 groups - ‘more than 55% for Trump’, ‘between 50% and 55% for Trump’, ‘between 45% and 50% for Trump’ and ‘less than 45% for Trump’. This map provided a much better picture of the broader statewise trends. Click between the two tabs below to see both options.

From these maps, we determined that the percentage of votes per state varied quite drastically across different states and regions of the United States. In the next phase, we thought it would be interesting to see if certain values can be attributed to these differences in voting percentages for Donald Trump.

Percentages

Broader Trend


American Values Survey


Next, we needed to do some data wrangling in order to isolate the responses that we used later for data analysis and mapping. We decided to create a for() loop to create multiple data frames that have the state of the respondent as well as their numerical answer. Additionally, we decided to subset these datasets to only include definitive answers. In other words, we removed responses that indicated “Don’t know” or “Refuse to answer”.

question_list <- c("q17a", "q17b", "q17c", "q17d", "q17e", "q17f", "q17g", "q17h")
df_list <- c("Temp_df_a", "Temp_df_b", "Temp_df_c", "Temp_df_d", "Temp_df_e", "Temp_df_f", "Temp_df_g", "Temp_df_h")
q_num <- c(90,92,91,93,94,95,96,97)

for(i in 1:length(q_num)){
 assign(df_list[i], avs[which(avs[[paste(question_list[i])]]< 5), c(4, q_num[i])])
}

After the data had been tidy’d into temporary dataframes and cleaned up, we were able to use some of the tidyverse functions in order to group and calculate average response values for each question by state. We did this by using the lapply(), group_by(), and summarise_all() functions.

df_state_list <- c("state_q17a", "state_q17b", "state_q17c", "state_q17d", "state_q17e", "state_q17f", "state_q17g", "state_q17h")
dfs <- list(Temp_df_a, Temp_df_b, Temp_df_c, Temp_df_d, Temp_df_e, Temp_df_f, Temp_df_g, Temp_df_h)

grouped_data <- lapply(dfs, function(x){ x %>% group_by(state) %>% summarise_all(mean)})

for(i in 1:length(grouped_data)){
  assign(df_state_list[i], as.data.frame(grouped_data[[i]]))
}

Next, we merged all of the answers to question #17 into one dataframe in order to set up our file for mapping. By then joining these answers to the coordinate and geometric information, we were able to create some maps of answer responses to these questions broken out by U.S. state.

library(openintro)
q17_df <- Reduce(function(x, y) merge(x, y, all=TRUE), list(state_q17a, state_q17b, state_q17c, state_q17d, state_q17e, state_q17f, state_q17g, state_q17h))

q17_df <- q17_df[which(q17_df$state != "HI" & q17_df$state != "AK"),]
q17_df$state <- abbr2state(q17_df$state)
names(q17_df) <- c("ID", "q17a", "q17b", "q17c", "q17d", "q17e", "q17f", "q17g", "q17h")

states <- merge(states,q17_df,by="ID")

Interpretation of American Values Survey Questions

So, after all of the data wrangling we performed, how can we interpret it? We decided to take a look at a few questions that seemed to be pertinent and relevant topics to see if they also varied across states. The first question we looked at was Question 17.


Question 17 – Do you strongly favor, favor, oppose or strongly oppose the following?

A

B

C

D

E

F

G

H

First analysis and regression model


After creating maps for each component of question #17, we could visibly see that some items seemed to draw starker contrasts across states than others. For instance, there seemed to generally be pretty broad support in favor of allowing DACA recipients brought to the U.S. as children to gain legal resident status if they attend college or join the military. However, there seemed to be more disagreement across states for support over withdrawing from the nuclear agreement with Iran or temporarily preventing people from some majority Muslim countries from entering the United States.

At this point we decided to build our first regression model using the parts of question 17 as the variables and the percentage of votes won by Trump as the response.

elec_regr1 <- lm(trump_per ~  q17a + q17b + q17c + q17d + q17e + q17f + q17g + q17h , data = states)
summary(elec_regr1)
## 
## Call:
## lm(formula = trump_per ~ q17a + q17b + q17c + q17d + q17e + q17f + 
##     q17g + q17h, data = states)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.192144 -0.057717 -0.006362  0.040472  0.185307 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.10387    0.35658   0.291   0.7723  
## q17a         0.02848    0.07726   0.369   0.7144  
## q17b         0.05903    0.07537   0.783   0.4381  
## q17c         0.17792    0.06744   2.638   0.0118 *
## q17d        -0.15185    0.07908  -1.920   0.0620 .
## q17e        -0.10146    0.06234  -1.628   0.1115  
## q17f         0.07681    0.05615   1.368   0.1790  
## q17g         0.04865    0.07621   0.638   0.5269  
## q17h         0.07845    0.05163   1.519   0.1366  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08921 on 40 degrees of freedom
## Multiple R-squared:  0.529,  Adjusted R-squared:  0.4348 
## F-statistic: 5.616 on 8 and 40 DF,  p-value: 8.602e-05

The first model resulted in an \(R^{2}\) value of 0.529. Not too shabby for a first pass, but something that could hopefully be improved on. Before moving on to additional variables, we decided to apply some backward elimination to reduce our focus. This resulted in the following model with a slightly lower \(R^{2}\) of 0.5173, but fewer variables and a higher adjusted \(R^{2}\).

elec_regr2 <- lm(trump_per ~  q17c + q17d + q17e + q17f + q17h , data = states)
summary(elec_regr2)
## 
## Call:
## lm(formula = trump_per ~ q17c + q17d + q17e + q17f + q17h, data = states)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.203295 -0.045223 -0.009582  0.043034  0.205947 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  0.33976    0.20697   1.642   0.1080   
## q17c         0.17803    0.05465   3.258   0.0022 **
## q17d        -0.11308    0.05841  -1.936   0.0595 . 
## q17e        -0.09842    0.05951  -1.654   0.1055   
## q17f         0.08291    0.05149   1.610   0.1147   
## q17h         0.07806    0.04322   1.806   0.0779 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0871 on 43 degrees of freedom
## Multiple R-squared:  0.5173, Adjusted R-squared:  0.4612 
## F-statistic: 9.217 on 5 and 43 DF,  p-value: 5.102e-06

At this point we decided to expand our field of variables and look at more questions. The next step was scrubbing the data from questions 22, 24 and 25 and incorporating them into our regression model.

Question 22 – Now as I read a few statements please tell me whether you completely agree, mostly agree, mostly disagree or completely disagree with each one?

A

B

C

D

E

F

G

Adding results from Q22 and Q25


As you can see from above, we created maps for each component of question #22 as well. Again, there seemed to be noticeable contrasts in responses to these value-based questions across states. For instance, there seemed to generally be pretty broad support that people were not bothered when they come in contact with immigrants who speak little or no English. However, there seemed to be more disagreement across states for attribution of the severity of recent natural disasters to global climate change or whether professional athletes should be required to stand during the national anthem at sporting events.

These responses were also helpful when building out our regression models later on in our investigation. We decided to map out one more question from the survey (Q24) and adding the data for another (Q25) before creating another model.

Question 25:

“When you think about what it means to be American, which of the following comes closer to your own view?” (we ignored responses of “neither”, “both” and “Don’t Know/Refused”)

  1. Having a mix of different cultures and values from around the world
  2. Having a single culture grounded in Christian values

Question 24 – Do you think recent stories about women being sexually harassed and assaulted in the workplace are isolated incidents or are they part of a broader pattern of how women are often treated?

Second regression model set

Once we’d aggregated the data from these questions, we decided to take a look at an expanded regression model incorporating our new variables.

elec_regr3 <- lm(trump_per ~  q17c + q17d + q17e + q17f + q17h + q22a + q22b + q22c + q22d + q22e + q22f + q24 + q25, data = states)
summary(elec_regr3)
## 
## Call:
## lm(formula = trump_per ~ q17c + q17d + q17e + q17f + q17h + q22a + 
##     q22b + q22c + q22d + q22e + q22f + q24 + q25, data = states)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.22147 -0.04476  0.01023  0.04069  0.18626 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.60758    0.44145   1.376   0.1775  
## q17c         0.17187    0.06492   2.647   0.0121 *
## q17d        -0.18641    0.09576  -1.947   0.0596 .
## q17e        -0.08579    0.07242  -1.185   0.2441  
## q17f         0.03859    0.06500   0.594   0.5566  
## q17h         0.05032    0.05260   0.957   0.3453  
## q22a        -0.01970    0.04869  -0.404   0.6883  
## q22b         0.02579    0.04730   0.545   0.5890  
## q22c         0.01259    0.05970   0.211   0.8342  
## q22d        -0.03885    0.04599  -0.845   0.4039  
## q22e         0.03615    0.05672   0.637   0.5280  
## q22f         0.03407    0.06094   0.559   0.5797  
## q24         -0.17640    0.17272  -1.021   0.3141  
## q25          0.14444    0.15583   0.927   0.3603  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0888 on 35 degrees of freedom
## Multiple R-squared:  0.5917, Adjusted R-squared:   0.44 
## F-statistic: 3.901 on 13 and 35 DF,  p-value: 0.0006468

Adding our 8 new variables yielded a slight increase in the \(R^{2}\) value (about 0.09), but many of the variables showed high p-values and could therefore be eliminated from the model without significant losses to the \(R^{2}\) value. Our backward elimination process here led to us removing all of the responses to the parts of question 22. It appears that the responses to the parts of that question were not strong predictors of a vote for Trump in 2016, though question 24 did seem to have a significant effect.

elec_regr4 <- lm(trump_per ~  q17c + q17d + q17e + q24 + q25, data = states)
summary(elec_regr4)
## 
## Call:
## lm(formula = trump_per ~ q17c + q17d + q17e + q24 + q25, data = states)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.249979 -0.050020  0.007008  0.050064  0.180692 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  0.80889    0.26212   3.086  0.00354 **
## q17c         0.18846    0.05385   3.500  0.00110 **
## q17d        -0.08684    0.05515  -1.575  0.12267   
## q17e        -0.11323    0.05644  -2.006  0.05113 . 
## q24         -0.33525    0.11133  -3.011  0.00434 **
## q25          0.21083    0.12797   1.647  0.10675   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08329 on 43 degrees of freedom
## Multiple R-squared:  0.5586, Adjusted R-squared:  0.5072 
## F-statistic: 10.88 on 5 and 43 DF,  p-value: 8.307e-07

At this point we took another look at the questions from the American Values Survey. We decided to avoid questions directly related to Donald Trump and party affiliation due to the potential bias they may have introduced. From what remained we decided to incorporate the responses to question 6 into our model as well. Below are plots of the responses organized by state.

Question 6 – Should the following topic be the highest priority, high but not the highest, or a lower priority?

A

B

C

D

E

F

Incorporating Q6 into the regression

This next regression model incorporated all of the responses to questions 6a-f.

elec_regr5 <- lm(trump_per ~  q17c + q17d + q17e + q24 + q25 + q6a + q6b + q6c + q6d + q6e + q6f, data = states)
summary(elec_regr5)
## 
## Call:
## lm(formula = trump_per ~ q17c + q17d + q17e + q24 + q25 + q6a + 
##     q6b + q6c + q6d + q6e + q6f, data = states)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.165018 -0.042201  0.000819  0.034778  0.206267 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.681799   0.448701   1.519 0.137621    
## q17c         0.197609   0.054766   3.608 0.000953 ***
## q17d        -0.077017   0.064896  -1.187 0.243304    
## q17e        -0.046973   0.066633  -0.705 0.485498    
## q24         -0.158915   0.127393  -1.247 0.220519    
## q25          0.034444   0.138029   0.250 0.804401    
## q6a          0.033999   0.049307   0.690 0.495024    
## q6b          0.149202   0.050256   2.969 0.005366 ** 
## q6c         -0.123089   0.055270  -2.227 0.032474 *  
## q6d         -0.076825   0.061716  -1.245 0.221472    
## q6e         -0.077962   0.045761  -1.704 0.097304 .  
## q6f         -0.004134   0.044065  -0.094 0.925787    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0752 on 35 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.6951, Adjusted R-squared:  0.5992 
## F-statistic: 7.253 on 11 and 35 DF,  p-value: 2.925e-06

Finally, a respectable \(R^{2}\) value! We cleaned up the model a little by removing some of the less impactful variables and generated some new models based on the results of our regressions analyses. At this point we added the responses to two additional questions:

  1. Q20 - Do you see monuments to Confederate soldiers more as symbols of Southern pride or more as symbols of racism?
  2. Q21 - Do you think most reporters have a personal or political agenda, OR are most reporters trying to report the news fairly and accurately?

Additionally, based on our highest performing model, we calculated predictive values of the percent that voted for Trump in 2016 by state, and took a look at the residuals.

require(ggiraph)
require(ggiraphExtra)

states <- states %>% dplyr::select(ID, trump_per, q17a, q17b, q17d, q17f, q22e, q22f, q6b, q6c, q6e, q6f, q20, q21)

# tested the model based on our highest performing lm that you created
for(i in 1:length(states$ID)){
  states$model[i] <- 0.83630 + 
    (0.07863 * states$q17a[i]) + 
    (-0.08399 * states$q17b[i]) + 
    (-0.14377 * states$q17d[i]) + 
    (0.09759 * states$q17f[i]) + 
    (-0.06089 * states$q22e[i]) + 
    (0.07426 * states$q22f[i]) + 
    (0.14610 * states$q6b[i]) +
    (-0.12655 * states$q6c[i]) +
    (-0.08417 * states$q6e[i]) +
    (-0.04930 * states$q6f[i]) +
    (-0.41984 * states$q20[i]) +
    (0.40825 * states$q21[i])
  
  states$diff[i] <- states$model[i] - states$trump_per[i]
}

model <- lm(formula = trump_per ~ q17a + q17b + q17d + q17f+ q22e+ q22f+ q6b+ q6c+ q6e+ q6f + q20+ q21, data = states)
summary(model)
## 
## Call:
## lm(formula = trump_per ~ q17a + q17b + q17d + q17f + q22e + q22f + 
##     q6b + q6c + q6e + q6f + q20 + q21, data = states)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.125656 -0.033505 -0.005109  0.027585  0.107003 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.50031    0.29191   1.714 0.095643 .  
## q17a         0.09861    0.04776   2.065 0.046642 *  
## q17b        -0.08374    0.05652  -1.482 0.147676    
## q17d        -0.11875    0.05411  -2.194 0.035132 *  
## q17f         0.11878    0.04961   2.394 0.022312 *  
## q22e        -0.06999    0.03854  -1.816 0.078203 .  
## q22f         0.08089    0.04001   2.022 0.051139 .  
## q6b          0.16623    0.03726   4.462 8.47e-05 ***
## q6c         -0.11416    0.04553  -2.508 0.017096 *  
## q6e         -0.08910    0.03589  -2.482 0.018146 *  
## q6f         -0.01926    0.03247  -0.593 0.556898    
## q20         -0.30816    0.07645  -4.031 0.000296 ***
## q21          0.36764    0.10182   3.611 0.000973 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05765 on 34 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.8259, Adjusted R-squared:  0.7645 
## F-statistic: 13.44 on 12 and 34 DF,  p-value: 1.33e-09
step(model, direction = "both", trace=FALSE ) 
## 
## Call:
## lm(formula = trump_per ~ q17a + q17b + q17d + q17f + q22e + q22f + 
##     q6b + q6c + q6e + q20 + q21, data = states)
## 
## Coefficients:
## (Intercept)         q17a         q17b         q17d         q17f         q22e         q22f          q6b          q6c          q6e          q20          q21  
##     0.41340      0.10186     -0.08226     -0.10941      0.11533     -0.06833      0.07729      0.16358     -0.11869     -0.08723     -0.28826      0.36884

Using backward elimination, we decided to remove one extra variable from the model to boost the \(R^{2}\) value a little more.

# tested the model again
model <- lm(formula = trump_per ~ q17a + q17b + q17d + q17f+ q22e+ q22f+ q6b+ q6c+ q6e + q20+ q21, data = states)
summary(model)
## 
## Call:
## lm(formula = trump_per ~ q17a + q17b + q17d + q17f + q22e + q22f + 
##     q6b + q6c + q6e + q20 + q21, data = states)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.12519 -0.02898 -0.00432  0.02730  0.10780 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.41340    0.25014   1.653 0.107344    
## q17a         0.10186    0.04700   2.167 0.037129 *  
## q17b        -0.08226    0.05594  -1.470 0.150364    
## q17d        -0.10941    0.05129  -2.133 0.040005 *  
## q17f         0.11533    0.04881   2.363 0.023825 *  
## q22e        -0.06833    0.03808  -1.794 0.081443 .  
## q22f         0.07729    0.03918   1.973 0.056482 .  
## q6b          0.16358    0.03665   4.464 8.01e-05 ***
## q6c         -0.11869    0.04446  -2.669 0.011440 *  
## q6e         -0.08723    0.03542  -2.463 0.018857 *  
## q20         -0.28826    0.06806  -4.235 0.000157 ***
## q21          0.36884    0.10085   3.657 0.000831 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05711 on 35 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.8241, Adjusted R-squared:  0.7688 
## F-statistic: 14.91 on 11 and 35 DF,  p-value: 3.914e-10
for(i in 1:length(states$ID)){
  states$model_two[i] <- 0.41340 + 
    (0.10186 * states$q17a[i]) + 
    (-0.08226 * states$q17b[i]) + 
    (-0.10941 * states$q17d[i]) + 
    (0.11533 * states$q17f[i]) + 
    (-0.06833 * states$q22e[i]) + 
    (0.07729 * states$q22f[i]) + 
    (0.16358 * states$q6b[i]) +
    (-0.11869 * states$q6c[i]) +
    (-0.08723 * states$q6e[i]) +
    (-0.28826 * states$q20[i]) +
    (0.36884 * states$q21[i])
  
  states$diff_two[i] <- states$model_two[i] - states$trump_per[i]
}

After calculating the predicted values using the updated model, we noticed that we still had some percent total calculations for some states that were much higher than the actual values. To adjust for this, we decided to remove these states and re-run the model one final time.

## the residuals show that these were very far from 0, so decided to remove them to see if it would improve the model
states_out <- states %>% 
  filter(ID != "California" & ID != "Maryland" & ID != "Oklahoma" & ID != "Montana" & ID != "North Dakota")

## last model
model_three <- lm(formula = trump_per ~ q17a + q17b + q17d + q17f+ q22e+ q22f+ q6b+ q6c+ q6e + q20+ q21, data = states_out)
summary(model_three)
## 
## Call:
## lm(formula = trump_per ~ q17a + q17b + q17d + q17f + q22e + q22f + 
##     q6b + q6c + q6e + q20 + q21, data = states_out)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.06845 -0.02835 -0.01096  0.03545  0.08189 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.35615    0.20578   1.731  0.09313 .  
## q17a         0.08077    0.03892   2.075  0.04608 *  
## q17b        -0.07477    0.04593  -1.628  0.11336    
## q17d        -0.12467    0.04261  -2.926  0.00627 ** 
## q17f         0.12577    0.04121   3.052  0.00455 ** 
## q22e        -0.08823    0.03179  -2.776  0.00913 ** 
## q22f         0.06634    0.03237   2.049  0.04869 *  
## q6b          0.15742    0.03008   5.233 1.01e-05 ***
## q6c         -0.09137    0.03711  -2.462  0.01938 *  
## q6e         -0.06588    0.02989  -2.204  0.03484 *  
## q20         -0.26037    0.05611  -4.640 5.64e-05 ***
## q21          0.42384    0.08423   5.032 1.81e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04671 on 32 degrees of freedom
## Multiple R-squared:  0.8778, Adjusted R-squared:  0.8357 
## F-statistic: 20.89 on 11 and 32 DF,  p-value: 1.452e-11
for(i in 1:length(states_out$ID)){
  states_out$model_three[i] <- 0.35615 + 
    (0.08077 * states_out$q17a[i]) + 
    (-0.07477 * states_out$q17b[i]) + 
    (-0.12467 * states_out$q17d[i]) + 
    (0.12577 * states_out$q17f[i]) + 
    (-0.08823 * states_out$q22e[i]) + 
    (0.06634 * states_out$q22f[i]) + 
    (0.15742 * states_out$q6b[i]) +
    (-0.09137 * states_out$q6c[i]) +
    (-0.06588 * states_out$q6e[i]) +
    (-0.26037 * states_out$q20[i]) +
    (0.42384 * states_out$q21[i])
  
  states_out$diff_three[i] <- states_out$model_three[i] - states_out$trump_per[i]
}

model_data <- as.data.frame(states_out) %>% dplyr::select(ID, trump_per, model_three, diff_three)

With our final model yielding a relatively strong \(R^{2}\), we mapped out the residuals between the actual and predicted values for states with enough data:

Model against data for all of the USA

Finally, we were able to run our model against the data as a whole and come up with an estimate of what the popular vote comes out to for Donald Trump for the 2016 election:

## [1] 0.4895869

48.95% is not so far off from the 46.1% Trump got in the 2016 Presidential Election (https://en.wikipedia.org/wiki/2016_United_States_presidential_election)


Discussion

When we started building our regression model, it was interesting to see that certain values and questions from the survey seemed to be more predictive of the amount of votes for Trump than others. Here are the questions and their corresponding values that made it into our final model:

Do you strongly favor, favor, oppose, or strongly oppose…

Do you completely agree, mostly agree, mostly disagree, or completely disagree…

Do you think that the following should be the highest priority, high but not highest priority, or lower priority?…

Do you see monuments to Confederate soldiers more as a symbol of Southern pride or more as symbols of racism?

Do you think most reporters have a personal or political agenda, OR are most reporters trying to report the news fairly and accurately?

As you can see, it is interesting that these value-based questions seemed to span a wide range of topics, and didn’t focus solely on one polarizing value. This seems to indicate that there truly was strong polarization across many different types of values across the country during the time of the 2016 election.


Conclusion

Elections often come down to the wire and are decided by a relatively small number of votes. Candidates spend millions of dollars on polling to get an idea of just how tight these margins are. We aimed to create a model that could predict polling preferences based on a select number of questions pertaining to major policy points and diversity. From a statistical point of view, the final result was a fairly strong model with an \(R^{2}\) above 0.85. This is a prime example, however, of a case where the margins in the real world are much finer than in the mathematical world. The residuals of our model at times approached 0.1, which would mean a miscalculation of the amount of votes Trump would get by 10%. 10% is the difference between a close race and a landslide.

While working through this process, we were able to examine values that seemed to be quite predictive of favoring votes for a particular candidate. For instance, it is interesting that responses to questions related to gun control, health care, our tax system, immigration, and building a wall along the southern border with Mexico all made it into our model. These topics have been at the forefront of the Trump campaign and his time in office.

When thinking about building out a similar model to predict the votes garnered by Hillary Clinton, we could re-examine these values that showed to be quite predictive of votes for Trump, and see if values outside the scope of this project could be added in later on that tend to hold more weight among left-leaning voters than right-leaning voters.

As mentioned at the outset, this election proved to be a very polarizing one, with people claiming at times that they would vote against a particular candidate instead of voting for one. The long list of values that make up our final model prove that this is true, given that there were ample opportunities for voters to project why they felt their candidate was the right choice to lead the United States.


Addendum – Utilizing JSON Data for this Project

To utilize more types of data for our project, we also decided to make a webpage that shows a representation of the performance of our model relative to the actual percentage of votes for Trump by state in the 2016 election.

This webpage was made using a Javascript framework (Angular), and the visualization was created using d3.js – a Javascript charting library.

We converted the data to json format utilizing the rjson() library:

library(rjson)
library(jsonlite)

webdata <- toJSON(unname(split(model_data, 1:nrow(model_data))))

# # to view the json data uncomment below
# prettify(webdata)

Then, we were able to utilize this dataset to create a d3.js graphic and webpage. Here is the final product.