As the presidential election cycle starts to ramp up again, we thought it would be interesting to take a look back at the election data from 2016 in order to think more about potential factors that could affect the outcome of next year’s vote. Labeled by Politico as the “biggest upset in U.S. history”, a large narrative about a “divided” America continued to develop in the days following the 2016 presidential election. Many would argue that this narrative continues to dominate current news headlines and will be an influential factor in the way candidates run their campaigns over the next 12 months.
For our project, we are curious about what factors seem to “divide” America. We’ll explore questions such as:
We’ll be utilizing two different datasets for this project.
The first contains the 2016 presidential election results for every United States county, among other past election data.
knitr::opts_chunk$set(echo = TRUE)
require(tidyverse)
require(dplyr)
require(tidyr)
require(plyr)
require(knitr)
require(kableExtra)
election_data <- read.csv('https://raw.githubusercontent.com/zachalexander/data607_cunysps/master/FinalProject/election_data.csv')
kable(head(election_data, n = 15L), align = rep('c', 5)) %>%
kable_styling(bootstrap_options = c("striped", "responsive", "condensed"), full_width = FALSE) %>%
kableExtra::scroll_box(width = "100%", height = "500px")
X | state | county | fips | trump16 | clinton16 | otherpres16 | romney12 | obama12 | otherpres12 | demsen16 | repsen16 | othersen16 | demhouse16 | rephouse16 | otherhouse16 | demgov16 | repgov16 | othergov16 | repgov14 | demgov14 | othergov14 | total_population | cvap | white_pct | black_pct | hispanic_pct | nonwhite_pct | foreignborn_pct | female_pct | age29andunder_pct | age65andolder_pct | median_hh_inc | clf_unemploy_pct | lesshs_pct | lesscollege_pct | lesshs_whites_pct | lesscollege_whites_pct | rural_pct | ruralurban_cc |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Alabama | Autauga | 1001 | 18172 | 5936 | 865 | 17379 | 6363 | 190 | 6331 | 18220 | 62 | 7544 | 14315 | 2258 | NA | NA | NA | 9427 | 3638 | 0 | 55049 | 40690 | 75.68348 | 18.370906 | 2.5722538 | 24.316518 | 1.8383622 | 51.17622 | 40.03706 | 13.97846 | 53099 | 5.591657 | 12.417046 | 75.40723 | 10.002112 | 74.06560 | 42.00216 | 2 |
2 | Alabama | Baldwin | 1003 | 72883 | 18458 | 3874 | 66016 | 18424 | 898 | 19145 | 74021 | 248 | 0 | 76995 | 1991 | NA | NA | NA | 37650 | 8416 | 0 | 199510 | 151770 | 83.17879 | 9.225603 | 4.3666984 | 16.821212 | 3.2695103 | 51.19493 | 35.47441 | 18.71485 | 51365 | 6.286843 | 9.972418 | 70.45289 | 7.842227 | 68.40561 | 42.27910 | 3 |
3 | Alabama | Barbour | 1005 | 5454 | 4871 | 144 | 5550 | 5912 | 47 | 4777 | 5436 | 16 | 5297 | 4286 | 463 | NA | NA | NA | 3111 | 3651 | 0 | 26614 | 20375 | 45.88562 | 47.888329 | 4.3097618 | 54.114376 | 2.8593973 | 46.49808 | 37.66439 | 16.52889 | 33956 | 12.824738 | 26.235928 | 87.13221 | 19.579752 | 81.36475 | 67.78963 | 6 |
4 | Alabama | Bibb | 1007 | 6738 | 1874 | 207 | 6132 | 2202 | 86 | 2082 | 6612 | 17 | 1971 | 6670 | 15 | NA | NA | NA | 3525 | 1368 | 0 | 22572 | 17590 | 74.76520 | 21.212121 | 2.2239943 | 25.234804 | 1.3512316 | 46.46465 | 37.32943 | 14.88570 | 39776 | 7.146827 | 19.301587 | 88.00000 | 15.020490 | 87.47177 | 68.35261 | 1 |
5 | Alabama | Blount | 1009 | 22859 | 2156 | 573 | 20757 | 2970 | 279 | 2980 | 22169 | 48 | 2390 | 22367 | 47 | NA | NA | NA | 12074 | 2178 | 0 | 57704 | 42430 | 87.65770 | 1.557951 | 8.7272979 | 12.342299 | 4.2718009 | 50.48524 | 37.24005 | 17.19292 | 46212 | 5.953833 | 19.968585 | 86.95024 | 16.643368 | 86.16361 | 89.95150 | 1 |
6 | Alabama | Bullock | 1011 | 1140 | 3530 | 40 | 1251 | 4061 | 10 | 3364 | 1167 | 6 | 3517 | 889 | 80 | NA | NA | NA | 747 | 2440 | 0 | 10552 | 8195 | 21.68309 | 75.502274 | 0.1231994 | 78.316907 | 1.5163002 | 45.80174 | 37.93594 | 15.13457 | 29335 | 13.258520 | 33.437883 | 89.74499 | 11.780384 | 79.15778 | 51.37438 | 6 |
7 | Alabama | Butler | 1013 | 4901 | 3726 | 105 | 5087 | 4374 | 35 | 3663 | 4840 | 7 | 4088 | 3918 | 321 | NA | NA | NA | 3148 | 2741 | 0 | 20280 | 15425 | 52.78107 | 43.515779 | 1.2475345 | 47.218935 | 1.3954635 | 53.04734 | 37.05621 | 18.12623 | 34315 | 9.804827 | 18.940426 | 83.91999 | 14.604180 | 80.07600 | 71.23216 | 6 |
8 | Alabama | Calhoun | 1015 | 32865 | 13242 | 1757 | 30278 | 15511 | 468 | 14152 | 32976 | 69 | 14000 | 33175 | 75 | NA | NA | NA | 17688 | 9082 | 0 | 115883 | 88525 | 72.99777 | 20.331714 | 3.4362245 | 27.002235 | 2.3937937 | 51.89804 | 38.78308 | 15.93072 | 41954 | 11.681822 | 17.663137 | 82.34703 | 16.399511 | 81.64061 | 33.69683 | 3 |
9 | Alabama | Chambers | 1017 | 7843 | 5784 | 273 | 7626 | 6871 | 114 | 5845 | 7865 | 18 | 5796 | 7907 | 16 | NA | NA | NA | 3635 | 4587 | 0 | 34018 | 26480 | 56.74349 | 40.478570 | 0.4350638 | 43.256511 | 0.9906520 | 52.08713 | 35.87219 | 18.38732 | 36027 | 7.489945 | 19.736732 | 87.51572 | 15.295351 | 83.85879 | 49.14803 | 6 |
10 | Alabama | Cherokee | 1019 | 8953 | 1547 | 233 | 7506 | 2132 | 141 | 1915 | 8636 | 7 | 1702 | 8707 | 12 | NA | NA | NA | 5007 | 1868 | 0 | 25897 | 20505 | 91.87165 | 4.606711 | 1.5638877 | 8.128355 | 0.7259528 | 50.25292 | 32.71035 | 20.32282 | 38925 | 5.855731 | 18.717236 | 86.03830 | 19.256407 | 85.78324 | 85.73627 | 6 |
11 | Alabama | Chilton | 1021 | 15081 | 2911 | 377 | 13932 | 3397 | 133 | 3327 | 14582 | 25 | 2996 | 14723 | 12 | NA | NA | NA | 9028 | 2419 | 0 | 43817 | 31555 | 80.61027 | 10.009813 | 7.6248944 | 19.389735 | 5.5572038 | 50.79764 | 38.83424 | 15.24979 | 42594 | 7.938580 | 19.902472 | 85.14545 | 17.849792 | 84.38194 | 86.74472 | 1 |
12 | Alabama | Choctaw | 1023 | 4106 | 3109 | 77 | 4152 | 3786 | 30 | 2992 | 4035 | 8 | 4332 | 0 | 94 | NA | NA | NA | 2380 | 1947 | 0 | 13287 | 10515 | 56.20531 | 42.229247 | 0.5569354 | 43.794686 | 0.1655754 | 52.21645 | 33.40107 | 20.74960 | 32622 | 13.642814 | 21.147333 | 88.03980 | 17.688938 | 87.45893 | 100.00000 | 9 |
13 | Alabama | Clarke | 1025 | 7140 | 5749 | 142 | 7470 | 6334 | 47 | 5558 | 7158 | 14 | 5214 | 2848 | 127 | NA | NA | NA | 5000 | 3524 | 0 | 24847 | 19195 | 53.17745 | 45.494426 | 0.1730591 | 46.822554 | 0.4668572 | 52.75888 | 36.60402 | 18.06254 | 32735 | 17.083333 | 18.957819 | 87.86277 | 14.136998 | 82.08411 | 75.98034 | 7 |
14 | Alabama | Clay | 1027 | 5245 | 1237 | 142 | 4817 | 1777 | 68 | 1377 | 5147 | 4 | 1317 | 5231 | 4 | NA | NA | NA | 3199 | 1215 | 0 | 13483 | 10360 | 80.26404 | 14.744493 | 3.1076170 | 19.735964 | 1.8912705 | 51.55381 | 34.84388 | 19.40221 | 38815 | 6.591530 | 25.386407 | 88.93712 | 21.804320 | 87.52224 | 100.00000 | 9 |
15 | Alabama | Cleburne | 1029 | 5764 | 684 | 145 | 5272 | 971 | 62 | 847 | 5554 | 6 | 769 | 5644 | 4 | NA | NA | NA | 3129 | 717 | 0 | 14991 | 11295 | 92.79568 | 2.361417 | 2.3280635 | 7.204323 | 1.7477153 | 50.48362 | 36.14169 | 18.16423 | 36316 | 7.173601 | 25.815165 | 88.47193 | 24.262969 | 88.65948 | 100.00000 | 8 |
The dataset was found on GitHub, here: https://github.com/tonmcg/US_County_Level_Election_Results_08-16
Information about this dataset: The 2016 election results were obtained from Townhall.com by utilizing a web-scraping package – beautifulsoup
. This Python package is referenced in the GitHub readme file of the user above. Beautifulsoup
is an HTML parser that assists in scraping data from websites. In this case, the GitHub user located the published election results here, and due to it’s easy-to-scrape html table format, was able to obtain the results quite easily using this package. Websites like Townhall.com have designated data teams that help produce the real-time election results on the night of the election.
The second dataset that we’ll work with contains data related to values, with a respondent identifier that captures their state of residence – which we can use to connect back to our election results. This dataset was found on the Public Religion Research Institute (PRRI) website and contains a large number of questions related to values ranging from respondent’s views on immigration, gun control laws, health care, and much more.
Information about this dataset: This survey was completely designed and conducted by PRRI. Their website states that it is the “eighth annual multi-issue survey of it’s kind”.
Survey Methodology
require(haven)
spss_file <- file.path('https://github.com/zachalexander/data607_cunysps/blob/master/FinalProject/PRRI-2017-American-Values-Survey.sav?raw=true')
avs <- read_sav(spss_file)
kable(head(avs, n = 15L), align = rep('c', 5)) %>%
kable_styling(bootstrap_options = c("striped", "responsive", "condensed"), full_width = FALSE) %>%
kableExtra::scroll_box(width = "100%", height = "500px")
case_id | week | weight | state | metro | region | division | ownhome | marital | employ | hhnum | adults | kids1217 | kids611 | kids06 | parent | age | age2 | educ2 | educ | income | race | partyrot | party | ideorot | ideo | regvote | othtel | sex | ident | l1 | l2a | c1 | c2 | c1a | usborn | relig | religopen | denom | denomopen | born | jewid | secular | jdenom | year | demh1 | demh2 | date | sexorient_1 | sexorient_2 | sexorient_3 | sexorient_4 | sexorient_5 | formrot | q1 | q2 | q3rot | q3 | q4rot | q4 | q5rot | q5 | q5ot | q6rot | q6a | q6b | q6c | q6d | q6e | q6f | partyln | partyst1 | partyst2 | q7 | q8 | q9 | q10 | q11 | q12 | q13rot | q13 | q14rot | q14a | q14b | q14c | q14d | q15 | q16 | q17rot | q17a | q17c | q17b | q17d | q17e | q17f | q17g | q17h | q18 | q19 | q20 | q20ot | q21 | q22rot | q22a | q22b | q22c | q22d | q22e | q22f | q22g | q23 | q24 | q25rot | q25 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10000001 | 842 | 0.3804 | SC | 3 | 3 | 5 | 1 | 6 | 3 | 2 | 2 | NA | NA | NA | 2 | 69 | NA | 6 | 4 | 2 | 1 | 1 | 2 | 1 | 4 | 1 | 1 | 1 | 1 | 2 | NA | NA | NA | NA | NA | 90 | NA | NA | 2 | 1 | NA | 17 | NA | NA | 171018 | 0 | 0 | 0 | 0 | 1 | 1 | 3 | 4 | 1 | 2 | NA | NA | 1 | 4 | ACB | 2 | 3 | 2 | NA | NA | NA | NA | NA | 1 | 4 | 1 | 3 | NA | NA | NA | 1 | 2 | ABCD | 4 | 4 | 1 | 3 | 3 | 1 | EGHCDAFB | 2 | 2 | 3 | 3 | 3 | 2 | 3 | 3 | 1 | 3 | 1 | 2 | DBEFGCA | 1 | 3 | 3 | 3 | 4 | 4 | 1 | 2 | 1 | 1 | 1 | ||||
10000002 | 842 | 0.5964 | GA | 1 | 3 | 5 | 1 | 6 | 3 | 2 | 2 | NA | NA | NA | 2 | 65 | NA | 4 | 3 | 98 | 1 | 1 | 3 | 1 | 1 | 2 | 0 | 2 | 1 | 1 | NA | NA | NA | 8 | NA | 99 | NA | NA | NA | NA | NA | 17 | NA | NA | 171018 | 1 | 0 | 0 | 0 | 0 | 2 | 3 | 4 | 1 | 2 | NA | NA | 4 | 3 | FED | NA | NA | NA | 1 | 1 | 2 | 2 | NA | NA | 4 | 1 | NA | 3 | NA | NA | 1 | 2 | DBCA | 3 | 4 | 1 | 4 | 3 | 1 | DBAFHECG | 2 | 3 | 2 | 3 | 2 | 2 | 3 | 3 | 3 | 1 | 2 | 1 | GBCDEFA | 3 | 4 | 3 | 3 | 2 | 4 | 2 | 1 | 1 | 1 | 2 | ||||
10000003 | 842 | 0.8891 | FL | 1 | 3 | 5 | 1 | 3 | 4 | 2 | 2 | NA | NA | NA | 2 | 99 | 4 | 4 | 3 | 4 | 2 | 2 | 1 | 2 | 3 | 1 | 0 | 2 | 1 | 1 | NA | NA | NA | 1 | NA | 1 | NA | 1 | NA | NA | NA | 17 | NA | NA | 171018 | 1 | 0 | 0 | 0 | 0 | 2 | 5 | 4 | 2 | 2 | NA | NA | 2 | 5 | EDF | NA | NA | NA | 1 | 2 | 3 | NA | 2 | NA | 4 | 2 | NA | 3 | 2 | 2 | 2 | 9 | ADBC | 3 | 4 | 1 | 3 | 3 | 1 | ACFBHGED | 3 | 2 | 3 | 3 | 3 | 3 | 3 | 3 | 1 | 1 | 1 | 1 | BDFCGEA | 2 | 2 | 2 | 1 | 3 | 3 | 2 | 1 | 1 | 2 | 2 | ||||
10000004 | 842 | 0.9169 | FL | 1 | 3 | 5 | 1 | 3 | 3 | 2 | 2 | NA | NA | NA | 2 | 71 | NA | 5 | 3 | 4 | 1 | 2 | 2 | 2 | 4 | 1 | 0 | 1 | 1 | 1 | NA | NA | NA | 1 | NA | 96 | NA | NA | 2 | NA | NA | 17 | NA | NA | 171018 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 4 | 2 | 2 | NA | NA | 5 | 5 | ACB | 1 | 1 | 1 | NA | NA | NA | NA | NA | 1 | 4 | 2 | 3 | NA | NA | NA | 2 | 2 | DACB | 4 | 4 | 1 | 4 | 3 | 1 | DHGCEABF | 3 | 3 | 3 | 4 | 3 | 2 | 3 | 4 | 2 | 1 | 1 | 2 | GACEBFD | 1 | 3 | 3 | 3 | 2 | 3 | 2 | 2 | 1 | 2 | 1 | ||||
10000005 | 842 | 0.8904 | NJ | 3 | 1 | 2 | 2 | 1 | 6 | 1 | 1 | NA | NA | NA | 2 | 33 | NA | 3 | 2 | 1 | 2 | 1 | 2 | 1 | 4 | 1 | 0 | 1 | 1 | 1 | NA | NA | NA | 1 | NA | 22 | 13 | 2 | NA | NA | NA | 17 | 1 | 4 | 171018 | 1 | 0 | 0 | 0 | 0 | 2 | 4 | 2 | NA | NA | 1 | 1 | 3 | 3 | DFE | NA | NA | NA | 3 | 3 | 2 | NA | NA | 2 | 3 | 2 | NA | 3 | NA | NA | 1 | 2 | BDCA | 4 | 4 | 1 | 4 | 3 | 2 | BHCFAEGD | 2 | 2 | 4 | 3 | 2 | 4 | 1 | 2 | 1 | 1 | 1 | 2 | ADEBFGC | 1 | 2 | 2 | 1 | 4 | 4 | 1 | 2 | 1 | 1 | 1 | ||||
10000006 | 842 | 0.6692 | OR | 5 | 4 | 9 | 2 | 4 | 7 | 1 | 1 | NA | NA | NA | 2 | 55 | NA | 3 | 2 | 1 | 1 | 1 | 2 | 1 | 4 | 1 | 0 | 2 | 1 | 1 | NA | NA | NA | 1 | NA | 90 | NA | NA | 2 | 2 | NA | 17 | 1 | 4 | 171018 | 1 | 0 | 0 | 0 | 0 | 1 | 4 | 3 | 1 | 2 | NA | NA | 2 | 4 | ACB | 9 | 2 | 1 | NA | NA | NA | NA | NA | 2 | 3 | 1 | 3 | NA | NA | NA | 1 | 2 | CADB | 9 | 2 | 9 | 4 | 9 | 1 | CGFEDHAB | 2 | 1 | 4 | 3 | 3 | 2 | 2 | 3 | 1 | 1 | 2 | 1 | GABCDEF | 1 | 1 | 1 | 4 | 4 | 4 | 1 | 1 | 1 | 1 | 1 | ||||
10000007 | 842 | 0.3147 | OH | 3 | 2 | 3 | 1 | 3 | 1 | 1 | 1 | NA | NA | NA | 2 | 50 | NA | 4 | 3 | 8 | 1 | 2 | 3 | 2 | 1 | 1 | 4 | 1 | 1 | 1 | NA | NA | NA | 7 | NA | 99 | NA | NA | NA | NA | NA | 17 | 1 | 2 | 171018 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | NA | NA | 2 | 1 | 4 | 2 | ABC | 1 | 9 | 9 | NA | NA | NA | 1 | NA | NA | 1 | 9 | 9 | NA | 9 | 9 | 2 | 9 | BCDA | 1 | 1 | 1 | 1 | 9 | 2 | BGHEAFCD | 2 | 3 | 1 | 1 | 2 | 4 | 9 | 1 | 1 | 2 | 9 | 1 | ECBDFGA | 4 | 1 | 1 | 3 | 3 | 2 | 3 | 9 | 9 | 2 | 2 | ||||
10000008 | 842 | 0.8399 | MD | 2 | 3 | 5 | 1 | 3 | 1 | 2 | 2 | NA | NA | NA | 2 | 61 | NA | 7 | 5 | 12 | 1 | 1 | 2 | 1 | 3 | 1 | 0 | 1 | 1 | 1 | NA | NA | NA | 1 | NA | 99 | NA | NA | NA | NA | NA | 17 | 1 | 1 | 171018 | 1 | 0 | 0 | 0 | 0 | 2 | 1 | 4 | 1 | 2 | NA | NA | 4 | 4 | EDF | NA | NA | NA | 3 | 3 | 2 | NA | NA | 1 | 4 | 1 | NA | 3 | NA | NA | 1 | 2 | ABCD | 4 | 4 | 1 | 4 | 3 | 1 | ECDHFBAG | 1 | 9 | 3 | 3 | 3 | 2 | 4 | 3 | 1 | 1 | 1 | 2 | CAFGEBD | 1 | 3 | 9 | 4 | 4 | 4 | 1 | 2 | 1 | 1 | 1 | ||||
10000009 | 842 | 1.3808 | FL | 2 | 3 | 5 | 1 | 3 | 1 | 4 | 3 | 1 | 0 | 0 | 1 | 56 | NA | 5 | 3 | 13 | 1 | 1 | 3 | 1 | 3 | 1 | 0 | 2 | 1 | 1 | NA | NA | NA | 1 | NA | 2 | NA | NA | NA | NA | NA | 17 | 1 | 1 | 171018 | 1 | 0 | 0 | 0 | 0 | 2 | 4 | 4 | 1 | 2 | NA | NA | 1 | 5 | EFD | NA | NA | NA | 1 | 3 | 1 | 2 | NA | NA | 4 | 2 | NA | 3 | NA | NA | 1 | 2 | DACB | 4 | 3 | 1 | 3 | 3 | 1 | HEDGACBF | 2 | 1 | 4 | 1 | 2 | 4 | 1 | 4 | 4 | 3 | 1 | 2 | GACEBFD | 1 | 3 | 2 | 1 | 3 | 1 | 2 | 2 | 1 | 1 | 1 | ||||
10000011 | 842 | 0.6906 | MI | 3 | 2 | 3 | 1 | 3 | 3 | 2 | 2 | NA | NA | NA | 2 | 68 | NA | 4 | 3 | 6 | 1 | 2 | 3 | 2 | 4 | 1 | 0 | 2 | 1 | 1 | NA | NA | NA | 1 | NA | 95 | NA | NA | 2 | NA | NA | 17 | NA | NA | 171018 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 4 | 2 | 2 | NA | NA | 1 | 4 | CBA | 3 | 2 | 3 | NA | NA | NA | 2 | NA | NA | 4 | 1 | 3 | NA | NA | NA | 2 | 2 | BDCA | 4 | 4 | 1 | 4 | 3 | 1 | BCHAEGFD | 1 | 1 | 4 | 4 | 4 | 2 | 4 | 4 | 1 | 1 | 2 | 2 | AFEGBDC | 2 | 4 | 3 | 3 | 4 | 4 | 1 | 2 | 1 | 2 | 1 | ||||
10000012 | 842 | 1.4924 | NY | 3 | 1 | 2 | 1 | 1 | 6 | 9 | 9 | NA | NA | NA | 2 | 41 | NA | 3 | 2 | 1 | 1 | 2 | 2 | 2 | 5 | 1 | 0 | 1 | 1 | 1 | NA | NA | NA | 1 | NA | 2 | NA | NA | NA | NA | NA | 17 | 1 | 4 | 171018 | 0 | 0 | 0 | 0 | 1 | 2 | 5 | 3 | 2 | 1 | NA | NA | 4 | 3 | DEF | NA | NA | NA | 1 | 1 | 1 | NA | NA | 2 | 4 | 3 | NA | 2 | NA | NA | 2 | 3 | CDBA | 2 | 2 | 2 | 2 | 3 | 1 | CDGBHFEA | 4 | 4 | 4 | 4 | 4 | 4 | 1 | 4 | 2 | 3 | 2 | 2 | CDABEFG | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 2 | 2 | 2 | 1 | ||||
10000013 | 842 | 0.9353 | PA | 3 | 1 | 2 | 1 | 2 | 6 | 2 | 2 | NA | NA | NA | 2 | 74 | NA | 4 | 3 | 5 | 1 | 1 | 2 | 1 | 3 | 1 | 0 | 1 | 1 | 2 | NA | NA | NA | NA | NA | 15 | 3 | 2 | NA | NA | NA | 17 | NA | NA | 171018 | 1 | 0 | 0 | 0 | 0 | 1 | 2 | 4 | 1 | 1 | NA | NA | 4 | 5 | CAB | 3 | 3 | 1 | NA | NA | NA | NA | NA | 1 | 3 | 1 | 2 | NA | NA | NA | 1 | 2 | CBAD | 4 | 3 | 1 | 3 | 3 | 2 | GFCDAEBH | 4 | 3 | 3 | 4 | 9 | 4 | 3 | 4 | 1 | 1 | 1 | 2 | ABFGECD | 2 | 3 | 2 | 4 | 2 | 1 | 1 | 2 | 2 | 1 | 1 | ||||
10000014 | 842 | 0.6506 | NY | 1 | 1 | 2 | 2 | 6 | 3 | 1 | 1 | NA | NA | NA | 2 | 62 | NA | 3 | 2 | 1 | 8 | 2 | 2 | 2 | 3 | 1 | 0 | 2 | 1 | 1 | NA | NA | NA | 1 | NA | 25 | NA | 1 | NA | NA | NA | 17 | 1 | 4 | 171018 | 1 | 0 | 0 | 0 | 0 | 2 | 4 | 3 | 2 | 2 | NA | NA | 3 | 5 | FED | NA | NA | NA | 2 | 1 | 1 | NA | NA | 2 | 3 | 2 | NA | 3 | NA | NA | 2 | 2 | CBAD | 4 | 1 | 1 | 3 | 2 | 2 | GBCHAFED | 4 | 4 | 3 | 4 | 4 | 1 | 2 | 4 | 1 | 3 | 2 | 2 | GACEBFD | 2 | 2 | 2 | 3 | 2 | 3 | 3 | 2 | 1 | 2 | 2 | ||||
10000015 | 842 | 0.6039 | NY | 4 | 1 | 2 | 1 | 2 | 2 | 5 | 4 | 0 | 1 | 0 | 1 | 27 | NA | 4 | 3 | 6 | 1 | 1 | 3 | 1 | 3 | 1 | 0 | 1 | 1 | 1 | NA | NA | NA | 2 | NA | 20 | NA | NA | NA | NA | NA | 17 | 1 | 4 | 171018 | 1 | 0 | 0 | 0 | 0 | 2 | 3 | 4 | 1 | 2 | NA | NA | 5 | 4 | FED | NA | NA | NA | 3 | 3 | 1 | 2 | NA | NA | 4 | 3 | NA | 3 | NA | NA | 1 | 2 | DBCA | 4 | 4 | 1 | 4 | 3 | 1 | HADGECFB | 1 | 1 | 4 | 4 | 4 | 1 | 4 | 4 | 1 | 1 | 2 | 1 | CBEFAGD | 1 | 4 | 3 | 3 | 4 | 4 | 2 | 2 | 1 | 1 | 1 | ||||
10000016 | 842 | 1.7472 | NY | 4 | 1 | 2 | 1 | 3 | 4 | 4 | 4 | NA | NA | NA | 2 | 55 | NA | 3 | 2 | 6 | 1 | 2 | 3 | 2 | 4 | 1 | 0 | 2 | 1 | 1 | NA | NA | NA | 1 | NA | 2 | NA | NA | NA | NA | NA | 17 | 1 | 1 | 171018 | 1 | 0 | 0 | 0 | 0 | 2 | 2 | 4 | 2 | 2 | NA | NA | 5 | 4 | DFE | NA | NA | NA | 2 | 2 | 1 | 2 | NA | NA | 4 | 1 | NA | 3 | NA | NA | 2 | 2 | ACBD | 3 | 3 | 1 | 3 | 3 | 1 | EBADGFHC | 2 | 2 | 4 | 4 | 2 | 2 | 3 | 3 | 2 | 3 | 1 | 2 | ECDFGAB | 1 | 3 | 1 | 3 | 3 | 1 | 1 | 1 | 1 | 2 | 1 |
To start, we subsetted the election data and selected columns that were relevant to the 2016 election, including the state column and the election totals for Republicans, Democrats, and Independents by county. Then, we created a total
column that sums the total votes per county. Next, we used the group_by()
and summarise_all()
functions to group and sum the votes by state. At this point we had a data frame with each row contianing voter information for a state. Finally, we added a percentage column to get the percent of voters that voted for Donald Trump broken down by each state by dividing the number or Trump votes by the total number of votes in each state.
Trump_sub<- election_data[,c(2,5,6,7)]
Trump_sub$total16 <- Trump_sub$trump16 + Trump_sub$clinton16 + Trump_sub$otherpres16
Trump_map <- Trump_sub %>% group_by(state) %>% summarise_all(sum)
Trump_map$trump_per <- Trump_map$trump16/Trump_map$total16
For our next analyses and plots, we had to load in a few mapping packages, as well as the tidyverse package for more data wrangling.
require(sf)
require(maps)
require(RColorBrewer)
require(tools)
In order to create our first map, we needed to use the maps
package to help load in the necessary geometry data that allowed us to create a map of the United States. We were able to load in the proper geometries to draw the state boundaries, as well as project the map to the correct coordinate plane. Then, we needed to subset our election data a bit more to remove the vote totals for Alaska and Hawaii, since there was a very low number of respondents from these states that contributed to the American Values Survey – we felt that any analyses and comparisons would not be representative of these states.
states <- st_as_sf(map("state", plot = FALSE, fill = TRUE))
states$ID <- toTitleCase(states$ID)
USA_votes <- Trump_map[which(Trump_map$state != "Alaska" & Trump_map$state != "Hawaii"),]
states$trump_per <- USA_votes$trump_per
for(i in 1:length(states$trump_per)){
if(states$trump_per[i]-0.5>0.05){
states$color[i] <- "1 - Greater than 55% voted for Trump"
}
if(states$trump_per[i]-0.5< 0.05 & states$trump_per[i]- 0.5 >0){
states$color[i] <- "2 - Between 50% and 55% voted for Trump"
}
if(states$trump_per[i]-0.5> (-0.05) & states$trump_per[i]-0.5<0){
states$color[i] <- "3 - Between 45 and 50% voted for Trump"
}
if(states$trump_per[i]-0.5< (-0.05)){
states$color[i] <- "4 - Less than 45% voted for Trump"
}
}
After getting the data into a state where we could start to plot out the percentages, we decided to use ggplot
to create a map to show the percentage of voters that voted for Donald Trump in the 2016 election. Since we were focusing on Trump votes, our inital pass focused on how many percent more or less than 50% did Donal Trump receive in 2016 in each particular state. At this point we saw that the differences in many states were fairly low, so the contrast in our image was not ideal. To improve this we created a second plot where we binned the percent of Trump votes into 4 groups - ‘more than 55% for Trump’, ‘between 50% and 55% for Trump’, ‘between 45% and 50% for Trump’ and ‘less than 45% for Trump’. This map provided a much better picture of the broader statewise trends. Click between the two tabs below to see both options.
From these maps, we determined that the percentage of votes per state varied quite drastically across different states and regions of the United States. In the next phase, we thought it would be interesting to see if certain values can be attributed to these differences in voting percentages for Donald Trump.
Next, we needed to do some data wrangling in order to isolate the responses that we used later for data analysis and mapping. We decided to create a for()
loop to create multiple data frames that have the state of the respondent as well as their numerical answer. Additionally, we decided to subset these datasets to only include definitive answers. In other words, we removed responses that indicated “Don’t know” or “Refuse to answer”.
question_list <- c("q17a", "q17b", "q17c", "q17d", "q17e", "q17f", "q17g", "q17h")
df_list <- c("Temp_df_a", "Temp_df_b", "Temp_df_c", "Temp_df_d", "Temp_df_e", "Temp_df_f", "Temp_df_g", "Temp_df_h")
q_num <- c(90,92,91,93,94,95,96,97)
for(i in 1:length(q_num)){
assign(df_list[i], avs[which(avs[[paste(question_list[i])]]< 5), c(4, q_num[i])])
}
After the data had been tidy’d into temporary dataframes and cleaned up, we were able to use some of the tidyverse functions in order to group and calculate average response values for each question by state. We did this by using the lapply()
, group_by()
, and summarise_all()
functions.
df_state_list <- c("state_q17a", "state_q17b", "state_q17c", "state_q17d", "state_q17e", "state_q17f", "state_q17g", "state_q17h")
dfs <- list(Temp_df_a, Temp_df_b, Temp_df_c, Temp_df_d, Temp_df_e, Temp_df_f, Temp_df_g, Temp_df_h)
grouped_data <- lapply(dfs, function(x){ x %>% group_by(state) %>% summarise_all(mean)})
for(i in 1:length(grouped_data)){
assign(df_state_list[i], as.data.frame(grouped_data[[i]]))
}
Next, we merged all of the answers to question #17 into one dataframe in order to set up our file for mapping. By then joining these answers to the coordinate and geometric information, we were able to create some maps of answer responses to these questions broken out by U.S. state.
library(openintro)
q17_df <- Reduce(function(x, y) merge(x, y, all=TRUE), list(state_q17a, state_q17b, state_q17c, state_q17d, state_q17e, state_q17f, state_q17g, state_q17h))
q17_df <- q17_df[which(q17_df$state != "HI" & q17_df$state != "AK"),]
q17_df$state <- abbr2state(q17_df$state)
names(q17_df) <- c("ID", "q17a", "q17b", "q17c", "q17d", "q17e", "q17f", "q17g", "q17h")
states <- merge(states,q17_df,by="ID")
So, after all of the data wrangling we performed, how can we interpret it? We decided to take a look at a few questions that seemed to be pertinent and relevant topics to see if they also varied across states. The first question we looked at was Question 17.
After creating maps for each component of question #17, we could visibly see that some items seemed to draw starker contrasts across states than others. For instance, there seemed to generally be pretty broad support in favor of allowing DACA recipients brought to the U.S. as children to gain legal resident status if they attend college or join the military. However, there seemed to be more disagreement across states for support over withdrawing from the nuclear agreement with Iran or temporarily preventing people from some majority Muslim countries from entering the United States.
At this point we decided to build our first regression model using the parts of question 17 as the variables and the percentage of votes won by Trump as the response.
elec_regr1 <- lm(trump_per ~ q17a + q17b + q17c + q17d + q17e + q17f + q17g + q17h , data = states)
summary(elec_regr1)
##
## Call:
## lm(formula = trump_per ~ q17a + q17b + q17c + q17d + q17e + q17f +
## q17g + q17h, data = states)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.192144 -0.057717 -0.006362 0.040472 0.185307
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.10387 0.35658 0.291 0.7723
## q17a 0.02848 0.07726 0.369 0.7144
## q17b 0.05903 0.07537 0.783 0.4381
## q17c 0.17792 0.06744 2.638 0.0118 *
## q17d -0.15185 0.07908 -1.920 0.0620 .
## q17e -0.10146 0.06234 -1.628 0.1115
## q17f 0.07681 0.05615 1.368 0.1790
## q17g 0.04865 0.07621 0.638 0.5269
## q17h 0.07845 0.05163 1.519 0.1366
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08921 on 40 degrees of freedom
## Multiple R-squared: 0.529, Adjusted R-squared: 0.4348
## F-statistic: 5.616 on 8 and 40 DF, p-value: 8.602e-05
The first model resulted in an \(R^{2}\) value of 0.529. Not too shabby for a first pass, but something that could hopefully be improved on. Before moving on to additional variables, we decided to apply some backward elimination to reduce our focus. This resulted in the following model with a slightly lower \(R^{2}\) of 0.5173, but fewer variables and a higher adjusted \(R^{2}\).
elec_regr2 <- lm(trump_per ~ q17c + q17d + q17e + q17f + q17h , data = states)
summary(elec_regr2)
##
## Call:
## lm(formula = trump_per ~ q17c + q17d + q17e + q17f + q17h, data = states)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.203295 -0.045223 -0.009582 0.043034 0.205947
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.33976 0.20697 1.642 0.1080
## q17c 0.17803 0.05465 3.258 0.0022 **
## q17d -0.11308 0.05841 -1.936 0.0595 .
## q17e -0.09842 0.05951 -1.654 0.1055
## q17f 0.08291 0.05149 1.610 0.1147
## q17h 0.07806 0.04322 1.806 0.0779 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0871 on 43 degrees of freedom
## Multiple R-squared: 0.5173, Adjusted R-squared: 0.4612
## F-statistic: 9.217 on 5 and 43 DF, p-value: 5.102e-06
At this point we decided to expand our field of variables and look at more questions. The next step was scrubbing the data from questions 22, 24 and 25 and incorporating them into our regression model.
As you can see from above, we created maps for each component of question #22 as well. Again, there seemed to be noticeable contrasts in responses to these value-based questions across states. For instance, there seemed to generally be pretty broad support that people were not bothered when they come in contact with immigrants who speak little or no English. However, there seemed to be more disagreement across states for attribution of the severity of recent natural disasters to global climate change or whether professional athletes should be required to stand during the national anthem at sporting events.
These responses were also helpful when building out our regression models later on in our investigation. We decided to map out one more question from the survey (Q24) and adding the data for another (Q25) before creating another model.
Question 25:
“When you think about what it means to be American, which of the following comes closer to your own view?” (we ignored responses of “neither”, “both” and “Don’t Know/Refused”)
Once we’d aggregated the data from these questions, we decided to take a look at an expanded regression model incorporating our new variables.
elec_regr3 <- lm(trump_per ~ q17c + q17d + q17e + q17f + q17h + q22a + q22b + q22c + q22d + q22e + q22f + q24 + q25, data = states)
summary(elec_regr3)
##
## Call:
## lm(formula = trump_per ~ q17c + q17d + q17e + q17f + q17h + q22a +
## q22b + q22c + q22d + q22e + q22f + q24 + q25, data = states)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.22147 -0.04476 0.01023 0.04069 0.18626
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.60758 0.44145 1.376 0.1775
## q17c 0.17187 0.06492 2.647 0.0121 *
## q17d -0.18641 0.09576 -1.947 0.0596 .
## q17e -0.08579 0.07242 -1.185 0.2441
## q17f 0.03859 0.06500 0.594 0.5566
## q17h 0.05032 0.05260 0.957 0.3453
## q22a -0.01970 0.04869 -0.404 0.6883
## q22b 0.02579 0.04730 0.545 0.5890
## q22c 0.01259 0.05970 0.211 0.8342
## q22d -0.03885 0.04599 -0.845 0.4039
## q22e 0.03615 0.05672 0.637 0.5280
## q22f 0.03407 0.06094 0.559 0.5797
## q24 -0.17640 0.17272 -1.021 0.3141
## q25 0.14444 0.15583 0.927 0.3603
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0888 on 35 degrees of freedom
## Multiple R-squared: 0.5917, Adjusted R-squared: 0.44
## F-statistic: 3.901 on 13 and 35 DF, p-value: 0.0006468
Adding our 8 new variables yielded a slight increase in the \(R^{2}\) value (about 0.09), but many of the variables showed high p-values and could therefore be eliminated from the model without significant losses to the \(R^{2}\) value. Our backward elimination process here led to us removing all of the responses to the parts of question 22. It appears that the responses to the parts of that question were not strong predictors of a vote for Trump in 2016, though question 24 did seem to have a significant effect.
elec_regr4 <- lm(trump_per ~ q17c + q17d + q17e + q24 + q25, data = states)
summary(elec_regr4)
##
## Call:
## lm(formula = trump_per ~ q17c + q17d + q17e + q24 + q25, data = states)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.249979 -0.050020 0.007008 0.050064 0.180692
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.80889 0.26212 3.086 0.00354 **
## q17c 0.18846 0.05385 3.500 0.00110 **
## q17d -0.08684 0.05515 -1.575 0.12267
## q17e -0.11323 0.05644 -2.006 0.05113 .
## q24 -0.33525 0.11133 -3.011 0.00434 **
## q25 0.21083 0.12797 1.647 0.10675
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08329 on 43 degrees of freedom
## Multiple R-squared: 0.5586, Adjusted R-squared: 0.5072
## F-statistic: 10.88 on 5 and 43 DF, p-value: 8.307e-07
At this point we took another look at the questions from the American Values Survey. We decided to avoid questions directly related to Donald Trump and party affiliation due to the potential bias they may have introduced. From what remained we decided to incorporate the responses to question 6 into our model as well. Below are plots of the responses organized by state.
This next regression model incorporated all of the responses to questions 6a-f.
elec_regr5 <- lm(trump_per ~ q17c + q17d + q17e + q24 + q25 + q6a + q6b + q6c + q6d + q6e + q6f, data = states)
summary(elec_regr5)
##
## Call:
## lm(formula = trump_per ~ q17c + q17d + q17e + q24 + q25 + q6a +
## q6b + q6c + q6d + q6e + q6f, data = states)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.165018 -0.042201 0.000819 0.034778 0.206267
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.681799 0.448701 1.519 0.137621
## q17c 0.197609 0.054766 3.608 0.000953 ***
## q17d -0.077017 0.064896 -1.187 0.243304
## q17e -0.046973 0.066633 -0.705 0.485498
## q24 -0.158915 0.127393 -1.247 0.220519
## q25 0.034444 0.138029 0.250 0.804401
## q6a 0.033999 0.049307 0.690 0.495024
## q6b 0.149202 0.050256 2.969 0.005366 **
## q6c -0.123089 0.055270 -2.227 0.032474 *
## q6d -0.076825 0.061716 -1.245 0.221472
## q6e -0.077962 0.045761 -1.704 0.097304 .
## q6f -0.004134 0.044065 -0.094 0.925787
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0752 on 35 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.6951, Adjusted R-squared: 0.5992
## F-statistic: 7.253 on 11 and 35 DF, p-value: 2.925e-06
Finally, a respectable \(R^{2}\) value! We cleaned up the model a little by removing some of the less impactful variables and generated some new models based on the results of our regressions analyses. At this point we added the responses to two additional questions:
Additionally, based on our highest performing model, we calculated predictive values of the percent that voted for Trump in 2016 by state, and took a look at the residuals.
require(ggiraph)
require(ggiraphExtra)
states <- states %>% dplyr::select(ID, trump_per, q17a, q17b, q17d, q17f, q22e, q22f, q6b, q6c, q6e, q6f, q20, q21)
# tested the model based on our highest performing lm that you created
for(i in 1:length(states$ID)){
states$model[i] <- 0.83630 +
(0.07863 * states$q17a[i]) +
(-0.08399 * states$q17b[i]) +
(-0.14377 * states$q17d[i]) +
(0.09759 * states$q17f[i]) +
(-0.06089 * states$q22e[i]) +
(0.07426 * states$q22f[i]) +
(0.14610 * states$q6b[i]) +
(-0.12655 * states$q6c[i]) +
(-0.08417 * states$q6e[i]) +
(-0.04930 * states$q6f[i]) +
(-0.41984 * states$q20[i]) +
(0.40825 * states$q21[i])
states$diff[i] <- states$model[i] - states$trump_per[i]
}
model <- lm(formula = trump_per ~ q17a + q17b + q17d + q17f+ q22e+ q22f+ q6b+ q6c+ q6e+ q6f + q20+ q21, data = states)
summary(model)
##
## Call:
## lm(formula = trump_per ~ q17a + q17b + q17d + q17f + q22e + q22f +
## q6b + q6c + q6e + q6f + q20 + q21, data = states)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.125656 -0.033505 -0.005109 0.027585 0.107003
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.50031 0.29191 1.714 0.095643 .
## q17a 0.09861 0.04776 2.065 0.046642 *
## q17b -0.08374 0.05652 -1.482 0.147676
## q17d -0.11875 0.05411 -2.194 0.035132 *
## q17f 0.11878 0.04961 2.394 0.022312 *
## q22e -0.06999 0.03854 -1.816 0.078203 .
## q22f 0.08089 0.04001 2.022 0.051139 .
## q6b 0.16623 0.03726 4.462 8.47e-05 ***
## q6c -0.11416 0.04553 -2.508 0.017096 *
## q6e -0.08910 0.03589 -2.482 0.018146 *
## q6f -0.01926 0.03247 -0.593 0.556898
## q20 -0.30816 0.07645 -4.031 0.000296 ***
## q21 0.36764 0.10182 3.611 0.000973 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05765 on 34 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.8259, Adjusted R-squared: 0.7645
## F-statistic: 13.44 on 12 and 34 DF, p-value: 1.33e-09
step(model, direction = "both", trace=FALSE )
##
## Call:
## lm(formula = trump_per ~ q17a + q17b + q17d + q17f + q22e + q22f +
## q6b + q6c + q6e + q20 + q21, data = states)
##
## Coefficients:
## (Intercept) q17a q17b q17d q17f q22e q22f q6b q6c q6e q20 q21
## 0.41340 0.10186 -0.08226 -0.10941 0.11533 -0.06833 0.07729 0.16358 -0.11869 -0.08723 -0.28826 0.36884
Using backward elimination, we decided to remove one extra variable from the model to boost the \(R^{2}\) value a little more.
# tested the model again
model <- lm(formula = trump_per ~ q17a + q17b + q17d + q17f+ q22e+ q22f+ q6b+ q6c+ q6e + q20+ q21, data = states)
summary(model)
##
## Call:
## lm(formula = trump_per ~ q17a + q17b + q17d + q17f + q22e + q22f +
## q6b + q6c + q6e + q20 + q21, data = states)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.12519 -0.02898 -0.00432 0.02730 0.10780
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.41340 0.25014 1.653 0.107344
## q17a 0.10186 0.04700 2.167 0.037129 *
## q17b -0.08226 0.05594 -1.470 0.150364
## q17d -0.10941 0.05129 -2.133 0.040005 *
## q17f 0.11533 0.04881 2.363 0.023825 *
## q22e -0.06833 0.03808 -1.794 0.081443 .
## q22f 0.07729 0.03918 1.973 0.056482 .
## q6b 0.16358 0.03665 4.464 8.01e-05 ***
## q6c -0.11869 0.04446 -2.669 0.011440 *
## q6e -0.08723 0.03542 -2.463 0.018857 *
## q20 -0.28826 0.06806 -4.235 0.000157 ***
## q21 0.36884 0.10085 3.657 0.000831 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05711 on 35 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.8241, Adjusted R-squared: 0.7688
## F-statistic: 14.91 on 11 and 35 DF, p-value: 3.914e-10
for(i in 1:length(states$ID)){
states$model_two[i] <- 0.41340 +
(0.10186 * states$q17a[i]) +
(-0.08226 * states$q17b[i]) +
(-0.10941 * states$q17d[i]) +
(0.11533 * states$q17f[i]) +
(-0.06833 * states$q22e[i]) +
(0.07729 * states$q22f[i]) +
(0.16358 * states$q6b[i]) +
(-0.11869 * states$q6c[i]) +
(-0.08723 * states$q6e[i]) +
(-0.28826 * states$q20[i]) +
(0.36884 * states$q21[i])
states$diff_two[i] <- states$model_two[i] - states$trump_per[i]
}
After calculating the predicted values using the updated model, we noticed that we still had some percent total calculations for some states that were much higher than the actual values. To adjust for this, we decided to remove these states and re-run the model one final time.
## the residuals show that these were very far from 0, so decided to remove them to see if it would improve the model
states_out <- states %>%
filter(ID != "California" & ID != "Maryland" & ID != "Oklahoma" & ID != "Montana" & ID != "North Dakota")
## last model
model_three <- lm(formula = trump_per ~ q17a + q17b + q17d + q17f+ q22e+ q22f+ q6b+ q6c+ q6e + q20+ q21, data = states_out)
summary(model_three)
##
## Call:
## lm(formula = trump_per ~ q17a + q17b + q17d + q17f + q22e + q22f +
## q6b + q6c + q6e + q20 + q21, data = states_out)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.06845 -0.02835 -0.01096 0.03545 0.08189
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.35615 0.20578 1.731 0.09313 .
## q17a 0.08077 0.03892 2.075 0.04608 *
## q17b -0.07477 0.04593 -1.628 0.11336
## q17d -0.12467 0.04261 -2.926 0.00627 **
## q17f 0.12577 0.04121 3.052 0.00455 **
## q22e -0.08823 0.03179 -2.776 0.00913 **
## q22f 0.06634 0.03237 2.049 0.04869 *
## q6b 0.15742 0.03008 5.233 1.01e-05 ***
## q6c -0.09137 0.03711 -2.462 0.01938 *
## q6e -0.06588 0.02989 -2.204 0.03484 *
## q20 -0.26037 0.05611 -4.640 5.64e-05 ***
## q21 0.42384 0.08423 5.032 1.81e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04671 on 32 degrees of freedom
## Multiple R-squared: 0.8778, Adjusted R-squared: 0.8357
## F-statistic: 20.89 on 11 and 32 DF, p-value: 1.452e-11
for(i in 1:length(states_out$ID)){
states_out$model_three[i] <- 0.35615 +
(0.08077 * states_out$q17a[i]) +
(-0.07477 * states_out$q17b[i]) +
(-0.12467 * states_out$q17d[i]) +
(0.12577 * states_out$q17f[i]) +
(-0.08823 * states_out$q22e[i]) +
(0.06634 * states_out$q22f[i]) +
(0.15742 * states_out$q6b[i]) +
(-0.09137 * states_out$q6c[i]) +
(-0.06588 * states_out$q6e[i]) +
(-0.26037 * states_out$q20[i]) +
(0.42384 * states_out$q21[i])
states_out$diff_three[i] <- states_out$model_three[i] - states_out$trump_per[i]
}
model_data <- as.data.frame(states_out) %>% dplyr::select(ID, trump_per, model_three, diff_three)
With our final model yielding a relatively strong \(R^{2}\), we mapped out the residuals between the actual and predicted values for states with enough data:
Finally, we were able to run our model against the data as a whole and come up with an estimate of what the popular vote comes out to for Donald Trump for the 2016 election:
## [1] 0.4895869
48.95% is not so far off from the 46.1% Trump got in the 2016 Presidential Election (https://en.wikipedia.org/wiki/2016_United_States_presidential_election)
When we started building our regression model, it was interesting to see that certain values and questions from the survey seemed to be more predictive of the amount of votes for Trump than others. Here are the questions and their corresponding values that made it into our final model:
Do you strongly favor, favor, oppose, or strongly oppose…
Do you completely agree, mostly agree, mostly disagree, or completely disagree…
Do you think that the following should be the highest priority, high but not highest priority, or lower priority?…
Do you see monuments to Confederate soldiers more as a symbol of Southern pride or more as symbols of racism?
Do you think most reporters have a personal or political agenda, OR are most reporters trying to report the news fairly and accurately?
As you can see, it is interesting that these value-based questions seemed to span a wide range of topics, and didn’t focus solely on one polarizing value. This seems to indicate that there truly was strong polarization across many different types of values across the country during the time of the 2016 election.
Elections often come down to the wire and are decided by a relatively small number of votes. Candidates spend millions of dollars on polling to get an idea of just how tight these margins are. We aimed to create a model that could predict polling preferences based on a select number of questions pertaining to major policy points and diversity. From a statistical point of view, the final result was a fairly strong model with an \(R^{2}\) above 0.85. This is a prime example, however, of a case where the margins in the real world are much finer than in the mathematical world. The residuals of our model at times approached 0.1, which would mean a miscalculation of the amount of votes Trump would get by 10%. 10% is the difference between a close race and a landslide.
While working through this process, we were able to examine values that seemed to be quite predictive of favoring votes for a particular candidate. For instance, it is interesting that responses to questions related to gun control, health care, our tax system, immigration, and building a wall along the southern border with Mexico all made it into our model. These topics have been at the forefront of the Trump campaign and his time in office.
When thinking about building out a similar model to predict the votes garnered by Hillary Clinton, we could re-examine these values that showed to be quite predictive of votes for Trump, and see if values outside the scope of this project could be added in later on that tend to hold more weight among left-leaning voters than right-leaning voters.
As mentioned at the outset, this election proved to be a very polarizing one, with people claiming at times that they would vote against a particular candidate instead of voting for one. The long list of values that make up our final model prove that this is true, given that there were ample opportunities for voters to project why they felt their candidate was the right choice to lead the United States.
To utilize more types of data for our project, we also decided to make a webpage that shows a representation of the performance of our model relative to the actual percentage of votes for Trump by state in the 2016 election.
This webpage was made using a Javascript framework (Angular), and the visualization was created using d3.js – a Javascript charting library.
We converted the data to json format utilizing the rjson()
library:
library(rjson)
library(jsonlite)
webdata <- toJSON(unname(split(model_data, 1:nrow(model_data))))
# # to view the json data uncomment below
# prettify(webdata)
Then, we were able to utilize this dataset to create a d3.js graphic and webpage. Here is the final product.