Tidyverse: https://www.tidyverse.org/
The following dataset was pulled from data.thirtyeightfive.com as a series of polls for House Candidates within each state. I’d like to filter out some unnecessary columns and focus primarily on State, Candidate, Dates, Sample Size,Party, and Percent to get this dataset ready for some deeper analysis. https://fivethirtyeight.com/features/desantis-is-polling-well-against-trump-as-long-as-no-one-else-runs/
This is more of an observational study than any kind of predictor/independent relationship. I’m sure you could use the state as an indicator since this contains both REP and DEM. My article is specifically about the nature of REP polling. Some of these polls frame a republican primary race between all candidates, and other frame it as just a race between Trump and Desantis. I wanted to get an idea of all the candidates across parties and polls to get an idea of the “big” picture for those who have not been keeping up with primaries and support for candidates.
pollsOG <- read_csv("https://github.com/d-ev-craig/DATA607/raw/main/Week%201%20-%20Data%20Structures%20%26%20Basic%20Ops/president_primary_polls.csv")
## Rows: 6146 Columns: 43
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (23): pollster, sponsors, display_name, pollster_rating_name, fte_grade,...
## dbl (12): poll_id, pollster_id, pollster_rating_id, sponsor_candidate_id, qu...
## lgl (7): tracking, internal, seat_name, election_date, nationwide_batch, ra...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(pollsOG)
## [1] "poll_id" "pollster_id"
## [3] "pollster" "sponsor_ids"
## [5] "sponsors" "display_name"
## [7] "pollster_rating_id" "pollster_rating_name"
## [9] "fte_grade" "methodology"
## [11] "state" "start_date"
## [13] "end_date" "sponsor_candidate_id"
## [15] "sponsor_candidate" "sponsor_candidate_party"
## [17] "question_id" "sample_size"
## [19] "population" "subpopulation"
## [21] "population_full" "tracking"
## [23] "created_at" "notes"
## [25] "url" "source"
## [27] "internal" "partisan"
## [29] "race_id" "cycle"
## [31] "office_type" "seat_number"
## [33] "seat_name" "election_date"
## [35] "stage" "nationwide_batch"
## [37] "ranked_choice_reallocated" "ranked_choice_round"
## [39] "party" "answer"
## [41] "candidate_id" "candidate_name"
## [43] "pct"
pollsOG
## # A tibble: 6,146 × 43
## poll_id pollster_id pollster spons…¹ spons…² displ…³ polls…⁴ polls…⁵ fte_g…⁶
## <dbl> <dbl> <chr> <dbl> <chr> <chr> <dbl> <chr> <chr>
## 1 82068 1250 Trafalga… NA <NA> Trafal… 338 Trafal… A-
## 2 82068 1250 Trafalga… NA <NA> Trafal… 338 Trafal… A-
## 3 82068 1250 Trafalga… NA <NA> Trafal… 338 Trafal… A-
## 4 82068 1250 Trafalga… NA <NA> Trafal… 338 Trafal… A-
## 5 82068 1250 Trafalga… NA <NA> Trafal… 338 Trafal… A-
## 6 82068 1250 Trafalga… NA <NA> Trafal… 338 Trafal… A-
## 7 82068 1250 Trafalga… NA <NA> Trafal… 338 Trafal… A-
## 8 82068 1250 Trafalga… NA <NA> Trafal… 338 Trafal… A-
## 9 82068 1250 Trafalga… NA <NA> Trafal… 338 Trafal… A-
## 10 82068 1250 Trafalga… NA <NA> Trafal… 338 Trafal… A-
## # … with 6,136 more rows, 34 more variables: methodology <chr>, state <chr>,
## # start_date <chr>, end_date <chr>, sponsor_candidate_id <dbl>,
## # sponsor_candidate <chr>, sponsor_candidate_party <chr>, question_id <dbl>,
## # sample_size <dbl>, population <chr>, subpopulation <chr>,
## # population_full <chr>, tracking <lgl>, created_at <chr>, notes <chr>,
## # url <chr>, source <dbl>, internal <lgl>, partisan <chr>, race_id <dbl>,
## # cycle <dbl>, office_type <chr>, seat_number <dbl>, seat_name <lgl>, …
polls <- pollsOG %>% select(state,party,candidate_name,pct,start_date,end_date,pollster,methodology,office_type,stage)
polls %>% group_by(candidate_name) %>% arrange(desc(pct),by_group = TRUE)
## # A tibble: 6,146 × 10
## # Groups: candidate_name [101]
## state party candi…¹ pct start…² end_d…³ polls…⁴ metho…⁵ offic…⁶ stage
## <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 <NA> REP Donald… 79 7/15/2… 7/16/2… Premise Online U.S. P… prim…
## 2 Florida REP Donald… 76 10/31/… 11/2/2… Data f… Online U.S. P… prim…
## 3 North Caro… REP Donald… 75.6 11/30/… 12/2/2… Univer… Online U.S. P… prim…
## 4 <NA> REP Donald… 75 1/18/2… 1/19/2… Harris… Online U.S. P… prim…
## 5 Florida REP Donald… 75 10/31/… 11/2/2… Data f… Online U.S. P… prim…
## 6 Georgia REP Donald… 73 12/30/… 1/3/20… Univer… Online U.S. P… prim…
## 7 <NA> REP Donald… 71 12/14/… 12/15/… Harris… Online U.S. P… prim…
## 8 <NA> REP Donald… 71 9/2/20… 9/5/20… Premise Online U.S. P… prim…
## 9 <NA> REP Donald… 68 4/18/2… 4/20/2… Echelo… Online U.S. P… prim…
## 10 <NA> REP Donald… 67.7 8/10/2… 8/12/2… Big Vi… Online U.S. P… prim…
## # … with 6,136 more rows, and abbreviated variable names ¹candidate_name,
## # ²start_date, ³end_date, ⁴pollster, ⁵methodology, ⁶office_type
avgPolls <-polls %>% group_by(candidate_name) %>% summarise(mean(pct))
avgPolls <- avgPolls %>% arrange(desc(avgPolls[2]))
avgPolls
## # A tibble: 101 × 2
## candidate_name `mean(pct)`
## <chr> <dbl>
## 1 Donald Trump 48.6
## 2 Joe Biden 34.1
## 3 Ron DeSantis 30.4
## 4 Jerome Michael Segal 22
## 5 Kamala Harris 21.2
## 6 Michelle Obama 18.8
## 7 Bernard Sanders 11.7
## 8 Mike Pence 11.0
## 9 Donald Trump Jr. 11.0
## 10 Hillary Rodham Clinton 9.91
## # … with 91 more rows
names(avgPolls)[2] <- 'avgPct'
avgPolls[1] <- unlist(avgPolls[1])
avgPolls[2] <- unlist(avgPolls[2])
avgPolls <- as.data.frame(avgPolls)
avgPolls <-head(avgPolls, n=10)
avgPolls
## candidate_name avgPct
## 1 Donald Trump 48.593778
## 2 Joe Biden 34.104535
## 3 Ron DeSantis 30.403425
## 4 Jerome Michael Segal 22.000000
## 5 Kamala Harris 21.151156
## 6 Michelle Obama 18.800000
## 7 Bernard Sanders 11.663939
## 8 Mike Pence 11.038068
## 9 Donald Trump Jr. 11.024638
## 10 Hillary Rodham Clinton 9.907536
g<-ggplot(data=avgPolls,aes(y=candidate_name,x=avgPct,fill=candidate_name))
g + geom_bar(stat="identity") +xlab("Average Poll Percentage") +ylab("Candidate")+theme(legend.position = "none")
I think from here I would continue along the route of the article and start learning about each poll. How did they phrase the questions? I’d like to confirm the idea that despite DeSantis’ seemingly lacking polling percentage could be overcome by having a head to head race vs Trump. I would need to create another column with an identifier explaining which were “1v1” vs “FFA” and start to only compare “1v1” polls.