First, tidyverse!
library(tidyverse)
library(knitr)
Read in the Donald Trump favorability ratings data from FiveThirtyEight
favdata <- read_csv("https://projects.fivethirtyeight.com/polls/data/favorability_polls.csv", show_col_types=FALSE)
head(favdata)
## # A tibble: 6 × 37
## poll_id pollster_id pollster spons…¹ spons…² displ…³ polls…⁴ polls…⁵ fte_g…⁶
## <dbl> <dbl> <chr> <dbl> <chr> <chr> <dbl> <chr> <chr>
## 1 81309 1302 Echelon I… NA <NA> Echelo… 407 Echelo… B/C
## 2 81309 1302 Echelon I… NA <NA> Echelo… 407 Echelo… B/C
## 3 81267 568 YouGov 352 Econom… YouGov 391 YouGov B+
## 4 81267 568 YouGov 352 Econom… YouGov 391 YouGov B+
## 5 81267 568 YouGov 352 Econom… YouGov 391 YouGov B+
## 6 81296 458 Suffolk 135 USA To… Suffol… 323 Suffol… B+
## # … with 28 more variables: methodology <chr>, state <lgl>, start_date <chr>,
## # end_date <chr>, sponsor_candidate_id <lgl>, sponsor_candidate <lgl>,
## # sponsor_candidate_party <lgl>, question_id <dbl>, sample_size <dbl>,
## # population <chr>, subpopulation <lgl>, population_full <chr>,
## # tracking <lgl>, created_at <chr>, notes <chr>, url <chr>, source <dbl>,
## # internal <lgl>, partisan <chr>, politician_id <dbl>, politician <chr>,
## # favorable <dbl>, unfavorable <dbl>, alternate_answers <dbl>, …
colnames(favdata)
## [1] "poll_id" "pollster_id"
## [3] "pollster" "sponsor_ids"
## [5] "sponsors" "display_name"
## [7] "pollster_rating_id" "pollster_rating_name"
## [9] "fte_grade" "methodology"
## [11] "state" "start_date"
## [13] "end_date" "sponsor_candidate_id"
## [15] "sponsor_candidate" "sponsor_candidate_party"
## [17] "question_id" "sample_size"
## [19] "population" "subpopulation"
## [21] "population_full" "tracking"
## [23] "created_at" "notes"
## [25] "url" "source"
## [27] "internal" "partisan"
## [29] "politician_id" "politician"
## [31] "favorable" "unfavorable"
## [33] "alternate_answers" "very_favorable"
## [35] "somewhat_favorable" "somewhat_unfavorable"
## [37] "very_unfavorable"
Some of the most useful tidyverse functions come from dplyr, which we can use to easily select, group, and summarize useful information.
fav_unfav <- favdata |> select(favorable, unfavorable)
favdata <- favdata |> mutate(responding=rowSums(fav_unfav))
kable(head(favdata[c("favorable", "unfavorable", "responding")]))
| favorable | unfavorable | responding |
|---|---|---|
| 43.0 | 54.0 | 97.0 |
| 43.0 | 54.0 | 97.0 |
| 39.0 | 54.0 | 93.0 |
| 41.0 | 56.0 | 97.0 |
| 41.0 | 57.0 | 98.0 |
| 34.8 | 57.9 | 92.7 |
Get average population size, and average number responding, per pollster.
averages <- favdata |> group_by(pollster) |> summarize(Avg_Sample_Size=mean(sample_size), Avg_Responding=mean(responding))
kable(averages)
| pollster | Avg_Sample_Size | Avg_Responding |
|---|---|---|
| ALG Research/Hart Research Associates | 805.0000 | 98.00000 |
| Anzalone Liszt Grove | 1000.0000 | 98.00000 |
| AP-NORC | 1143.0000 | 96.68200 |
| AtlasIntel | 5188.0000 | 98.00000 |
| Beacon Research | 2523.0000 | 96.00000 |
| Beacon Research/Shaw & Company | 1071.7500 | 98.50000 |
| Bullfinch | 1108.0000 | 98.00000 |
| Change Research | 1804.0000 | 93.66667 |
| Civiqs | 1609.0000 | 98.00000 |
| Cygnal Political | 5688.6667 | 97.65000 |
| East Carolina University | 1105.0000 | 89.40000 |
| Echelon Insights | 1026.5294 | 96.73529 |
| Emerson College Polling Society | 1011.5000 | 97.60000 |
| Fabrizio/Impact | 1437.6667 | 96.00000 |
| Gallup | 1018.0000 | 99.00000 |
| Global Strategy Group | 900.5000 | 98.00000 |
| Global Strategy Group/GBAO/Navigator Research | 1034.2045 | 98.29545 |
| Greenberg Quinlan Rosner | 1000.0000 | 91.50000 |
| Harris Poll | 1881.0000 | 93.30000 |
| HarrisX | 2825.0000 | 100.00000 |
| Hart Research Associates | 1000.0000 | 95.00000 |
| Hart Research Associates/Public Opinion Strategies | 1000.0000 | 88.22222 |
| Hill Research Consultants | 1000.0000 | 97.00000 |
| Hofstra University | 2000.0000 | 99.19000 |
| Ipsos | 1849.2778 | 95.83333 |
| Marist | 1179.7500 | 95.25000 |
| Marquette Law School | 1074.7500 | 97.25000 |
| McLaughlin | 1000.0000 | 99.00000 |
| Monmouth U. | 789.6667 | 89.66667 |
| Morning Consult | 2143.9926 | 96.71852 |
| PEM Management Corporation | 1000.0000 | 96.16667 |
| Pew | 6174.0000 | 98.00000 |
| Public Religion Research Institute | 2866.2500 | 97.75000 |
| Quinnipiac | 1319.0000 | 92.44444 |
| Rasmussen (Pulse Opinion Research) | 1003.2000 | 97.60000 |
| RealClear Opinion Research | 1885.5000 | 97.00000 |
| RMG Research | 1200.0000 | 97.00000 |
| Schoen Cooperman | 750.0000 | 96.00000 |
| Selzer | 1012.0000 | 95.00000 |
| Siena College/NYT Upshot | 958.0000 | 95.50000 |
| SocialSphere | 6565.0000 | 96.00000 |
| SSRS | 1003.0000 | 97.00000 |
| Suffolk | 1000.0000 | 94.32500 |
| Susquehanna | 800.0000 | 92.50000 |
| U. Massachusetts - Lowell | 1000.0000 | 90.00000 |
| Winston | 1027.2727 | 95.45455 |
| YouGov | 1379.3846 | 95.38861 |
Plot the relationship between sample size and response rate. Notable that larger samples give higher response rates. Maybe because those polls have more resources to contact all those respondents?
ggplot(data=averages, aes(x=Avg_Sample_Size, y=Avg_Responding)) +
geom_point(size=2, color='darkblue', shape=16) +
geom_smooth(method = 'lm', se=FALSE, color='red')
Nesting is part of the ‘purr’ tidyverse package and allows for summarizing of rows, simliar to dplyr’s group_by().
favdata_nested <- favdata |> group_by(pollster) |> nest()
favdata_nested
## # A tibble: 47 × 2
## # Groups: pollster [47]
## pollster data
## <chr> <list>
## 1 Echelon Insights <tibble [34 × 37]>
## 2 YouGov <tibble [208 × 37]>
## 3 Suffolk <tibble [4 × 37]>
## 4 Morning Consult <tibble [135 × 37]>
## 5 Ipsos <tibble [18 × 37]>
## 6 Winston <tibble [11 × 37]>
## 7 Harris Poll <tibble [20 × 37]>
## 8 Siena College/NYT Upshot <tibble [4 × 37]>
## 9 Susquehanna <tibble [4 × 37]>
## 10 AP-NORC <tibble [5 × 37]>
## # … with 37 more rows