Packages

Tidyverse: https://www.tidyverse.org/

Introduction

The following dataset was pulled from data.thirtyeightfive.com as a series of polls for House Candidates within each state. I’d like to filter out some unnecessary columns and focus primarily on State, Candidate, Dates, Sample Size,Party, and Percent to get this dataset ready for some deeper analysis. https://fivethirtyeight.com/features/desantis-is-polling-well-against-trump-as-long-as-no-one-else-runs/

Article Specifics

This is more of an observational study than any kind of predictor/independent relationship. I’m sure you could use the state as an indicator since this contains both REP and DEM. My article is specifically about the nature of REP polling. Some of these polls frame a republican primary race between all candidates, and other frame it as just a race between Trump and Desantis. I wanted to get an idea of all the candidates across parties and polls to get an idea of the “big” picture for those who have not been keeping up with primaries and support for candidates.

Code Chunks

pollsOG <- read_csv("https://github.com/d-ev-craig/DATA607/raw/main/Week%201%20-%20Data%20Structures%20%26%20Basic%20Ops/president_primary_polls.csv")
## Rows: 6146 Columns: 43
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (23): pollster, sponsors, display_name, pollster_rating_name, fte_grade,...
## dbl (12): poll_id, pollster_id, pollster_rating_id, sponsor_candidate_id, qu...
## lgl  (7): tracking, internal, seat_name, election_date, nationwide_batch, ra...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(pollsOG)
##  [1] "poll_id"                   "pollster_id"              
##  [3] "pollster"                  "sponsor_ids"              
##  [5] "sponsors"                  "display_name"             
##  [7] "pollster_rating_id"        "pollster_rating_name"     
##  [9] "fte_grade"                 "methodology"              
## [11] "state"                     "start_date"               
## [13] "end_date"                  "sponsor_candidate_id"     
## [15] "sponsor_candidate"         "sponsor_candidate_party"  
## [17] "question_id"               "sample_size"              
## [19] "population"                "subpopulation"            
## [21] "population_full"           "tracking"                 
## [23] "created_at"                "notes"                    
## [25] "url"                       "source"                   
## [27] "internal"                  "partisan"                 
## [29] "race_id"                   "cycle"                    
## [31] "office_type"               "seat_number"              
## [33] "seat_name"                 "election_date"            
## [35] "stage"                     "nationwide_batch"         
## [37] "ranked_choice_reallocated" "ranked_choice_round"      
## [39] "party"                     "answer"                   
## [41] "candidate_id"              "candidate_name"           
## [43] "pct"
pollsOG
## # A tibble: 6,146 × 43
##    poll_id pollster_id pollster  spons…¹ spons…² displ…³ polls…⁴ polls…⁵ fte_g…⁶
##      <dbl>       <dbl> <chr>       <dbl> <chr>   <chr>     <dbl> <chr>   <chr>  
##  1   82068        1250 Trafalga…      NA <NA>    Trafal…     338 Trafal… A-     
##  2   82068        1250 Trafalga…      NA <NA>    Trafal…     338 Trafal… A-     
##  3   82068        1250 Trafalga…      NA <NA>    Trafal…     338 Trafal… A-     
##  4   82068        1250 Trafalga…      NA <NA>    Trafal…     338 Trafal… A-     
##  5   82068        1250 Trafalga…      NA <NA>    Trafal…     338 Trafal… A-     
##  6   82068        1250 Trafalga…      NA <NA>    Trafal…     338 Trafal… A-     
##  7   82068        1250 Trafalga…      NA <NA>    Trafal…     338 Trafal… A-     
##  8   82068        1250 Trafalga…      NA <NA>    Trafal…     338 Trafal… A-     
##  9   82068        1250 Trafalga…      NA <NA>    Trafal…     338 Trafal… A-     
## 10   82068        1250 Trafalga…      NA <NA>    Trafal…     338 Trafal… A-     
## # … with 6,136 more rows, 34 more variables: methodology <chr>, state <chr>,
## #   start_date <chr>, end_date <chr>, sponsor_candidate_id <dbl>,
## #   sponsor_candidate <chr>, sponsor_candidate_party <chr>, question_id <dbl>,
## #   sample_size <dbl>, population <chr>, subpopulation <chr>,
## #   population_full <chr>, tracking <lgl>, created_at <chr>, notes <chr>,
## #   url <chr>, source <dbl>, internal <lgl>, partisan <chr>, race_id <dbl>,
## #   cycle <dbl>, office_type <chr>, seat_number <dbl>, seat_name <lgl>, …
polls <- pollsOG %>% select(state,party,candidate_name,pct,start_date,end_date,pollster,methodology,office_type,stage)
polls %>% group_by(candidate_name) %>% arrange(desc(pct),by_group = TRUE)
## # A tibble: 6,146 × 10
## # Groups:   candidate_name [101]
##    state       party candi…¹   pct start…² end_d…³ polls…⁴ metho…⁵ offic…⁶ stage
##    <chr>       <chr> <chr>   <dbl> <chr>   <chr>   <chr>   <chr>   <chr>   <chr>
##  1 <NA>        REP   Donald…  79   7/15/2… 7/16/2… Premise Online  U.S. P… prim…
##  2 Florida     REP   Donald…  76   10/31/… 11/2/2… Data f… Online  U.S. P… prim…
##  3 North Caro… REP   Donald…  75.6 11/30/… 12/2/2… Univer… Online  U.S. P… prim…
##  4 <NA>        REP   Donald…  75   1/18/2… 1/19/2… Harris… Online  U.S. P… prim…
##  5 Florida     REP   Donald…  75   10/31/… 11/2/2… Data f… Online  U.S. P… prim…
##  6 Georgia     REP   Donald…  73   12/30/… 1/3/20… Univer… Online  U.S. P… prim…
##  7 <NA>        REP   Donald…  71   12/14/… 12/15/… Harris… Online  U.S. P… prim…
##  8 <NA>        REP   Donald…  71   9/2/20… 9/5/20… Premise Online  U.S. P… prim…
##  9 <NA>        REP   Donald…  68   4/18/2… 4/20/2… Echelo… Online  U.S. P… prim…
## 10 <NA>        REP   Donald…  67.7 8/10/2… 8/12/2… Big Vi… Online  U.S. P… prim…
## # … with 6,136 more rows, and abbreviated variable names ¹​candidate_name,
## #   ²​start_date, ³​end_date, ⁴​pollster, ⁵​methodology, ⁶​office_type
avgPolls <-polls %>% group_by(candidate_name) %>% summarise(mean(pct))

avgPolls <- avgPolls %>% arrange(desc(avgPolls[2]))
avgPolls
## # A tibble: 101 × 2
##    candidate_name         `mean(pct)`
##    <chr>                        <dbl>
##  1 Donald Trump                 48.6 
##  2 Joe Biden                    34.1 
##  3 Ron DeSantis                 30.4 
##  4 Jerome Michael Segal         22   
##  5 Kamala Harris                21.2 
##  6 Michelle Obama               18.8 
##  7 Bernard Sanders              11.7 
##  8 Mike Pence                   11.0 
##  9 Donald Trump Jr.             11.0 
## 10 Hillary Rodham Clinton        9.91
## # … with 91 more rows
names(avgPolls)[2] <- 'avgPct'

avgPolls[1] <- unlist(avgPolls[1])
avgPolls[2] <- unlist(avgPolls[2])

avgPolls <- as.data.frame(avgPolls)

avgPolls <-head(avgPolls, n=10)
avgPolls
##            candidate_name    avgPct
## 1            Donald Trump 48.593778
## 2               Joe Biden 34.104535
## 3            Ron DeSantis 30.403425
## 4    Jerome Michael Segal 22.000000
## 5           Kamala Harris 21.151156
## 6          Michelle Obama 18.800000
## 7         Bernard Sanders 11.663939
## 8              Mike Pence 11.038068
## 9        Donald Trump Jr. 11.024638
## 10 Hillary Rodham Clinton  9.907536
g<-ggplot(data=avgPolls,aes(y=candidate_name,x=avgPct,fill=candidate_name)) 

g + geom_bar(stat="identity") +xlab("Average Poll Percentage") +ylab("Candidate")+theme(legend.position = "none")

Conclusions

I think from here I would continue along the route of the article and start learning about each poll. How did they phrase the questions? I’d like to confirm the idea that despite DeSantis’ seemingly lacking polling percentage could be overcome by having a head to head race vs Trump. I would need to create another column with an identifier explaining which were “1v1” vs “FFA” and start to only compare “1v1” polls.