Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset.

First, tidyverse!

library(tidyverse)
library(knitr)

Read in the Donald Trump favorability ratings data from FiveThirtyEight

favdata <- read_csv("https://projects.fivethirtyeight.com/polls/data/favorability_polls.csv", show_col_types=FALSE)
head(favdata)
## # A tibble: 6 × 37
##   poll_id pollster_id pollster   spons…¹ spons…² displ…³ polls…⁴ polls…⁵ fte_g…⁶
##     <dbl>       <dbl> <chr>        <dbl> <chr>   <chr>     <dbl> <chr>   <chr>  
## 1   81309        1302 Echelon I…      NA <NA>    Echelo…     407 Echelo… B/C    
## 2   81309        1302 Echelon I…      NA <NA>    Echelo…     407 Echelo… B/C    
## 3   81267         568 YouGov         352 Econom… YouGov      391 YouGov  B+     
## 4   81267         568 YouGov         352 Econom… YouGov      391 YouGov  B+     
## 5   81267         568 YouGov         352 Econom… YouGov      391 YouGov  B+     
## 6   81296         458 Suffolk        135 USA To… Suffol…     323 Suffol… B+     
## # … with 28 more variables: methodology <chr>, state <lgl>, start_date <chr>,
## #   end_date <chr>, sponsor_candidate_id <lgl>, sponsor_candidate <lgl>,
## #   sponsor_candidate_party <lgl>, question_id <dbl>, sample_size <dbl>,
## #   population <chr>, subpopulation <lgl>, population_full <chr>,
## #   tracking <lgl>, created_at <chr>, notes <chr>, url <chr>, source <dbl>,
## #   internal <lgl>, partisan <chr>, politician_id <dbl>, politician <chr>,
## #   favorable <dbl>, unfavorable <dbl>, alternate_answers <dbl>, …
colnames(favdata)
##  [1] "poll_id"                 "pollster_id"            
##  [3] "pollster"                "sponsor_ids"            
##  [5] "sponsors"                "display_name"           
##  [7] "pollster_rating_id"      "pollster_rating_name"   
##  [9] "fte_grade"               "methodology"            
## [11] "state"                   "start_date"             
## [13] "end_date"                "sponsor_candidate_id"   
## [15] "sponsor_candidate"       "sponsor_candidate_party"
## [17] "question_id"             "sample_size"            
## [19] "population"              "subpopulation"          
## [21] "population_full"         "tracking"               
## [23] "created_at"              "notes"                  
## [25] "url"                     "source"                 
## [27] "internal"                "partisan"               
## [29] "politician_id"           "politician"             
## [31] "favorable"               "unfavorable"            
## [33] "alternate_answers"       "very_favorable"         
## [35] "somewhat_favorable"      "somewhat_unfavorable"   
## [37] "very_unfavorable"

Some of the most useful tidyverse functions come from dplyr, which we can use to easily select, group, and summarize useful information.

fav_unfav <- favdata |> select(favorable, unfavorable)
favdata <- favdata |> mutate(responding=rowSums(fav_unfav))
kable(head(favdata[c("favorable", "unfavorable", "responding")]))
favorable unfavorable responding
43.0 54.0 97.0
43.0 54.0 97.0
39.0 54.0 93.0
41.0 56.0 97.0
41.0 57.0 98.0
34.8 57.9 92.7

Get average population size, and average number responding, per pollster.

averages <- favdata |> group_by(pollster) |> summarize(Avg_Sample_Size=mean(sample_size), Avg_Responding=mean(responding))
kable(averages)
pollster Avg_Sample_Size Avg_Responding
ALG Research/Hart Research Associates 805.0000 98.00000
Anzalone Liszt Grove 1000.0000 98.00000
AP-NORC 1143.0000 96.68200
AtlasIntel 5188.0000 98.00000
Beacon Research 2523.0000 96.00000
Beacon Research/Shaw & Company 1071.7500 98.50000
Bullfinch 1108.0000 98.00000
Change Research 1804.0000 93.66667
Civiqs 1609.0000 98.00000
Cygnal Political 5688.6667 97.65000
East Carolina University 1105.0000 89.40000
Echelon Insights 1026.5294 96.73529
Emerson College Polling Society 1011.5000 97.60000
Fabrizio/Impact 1437.6667 96.00000
Gallup 1018.0000 99.00000
Global Strategy Group 900.5000 98.00000
Global Strategy Group/GBAO/Navigator Research 1034.2045 98.29545
Greenberg Quinlan Rosner 1000.0000 91.50000
Harris Poll 1881.0000 93.30000
HarrisX 2825.0000 100.00000
Hart Research Associates 1000.0000 95.00000
Hart Research Associates/Public Opinion Strategies 1000.0000 88.22222
Hill Research Consultants 1000.0000 97.00000
Hofstra University 2000.0000 99.19000
Ipsos 1849.2778 95.83333
Marist 1179.7500 95.25000
Marquette Law School 1074.7500 97.25000
McLaughlin 1000.0000 99.00000
Monmouth U. 789.6667 89.66667
Morning Consult 2143.9926 96.71852
PEM Management Corporation 1000.0000 96.16667
Pew 6174.0000 98.00000
Public Religion Research Institute 2866.2500 97.75000
Quinnipiac 1319.0000 92.44444
Rasmussen (Pulse Opinion Research) 1003.2000 97.60000
RealClear Opinion Research 1885.5000 97.00000
RMG Research 1200.0000 97.00000
Schoen Cooperman 750.0000 96.00000
Selzer 1012.0000 95.00000
Siena College/NYT Upshot 958.0000 95.50000
SocialSphere 6565.0000 96.00000
SSRS 1003.0000 97.00000
Suffolk 1000.0000 94.32500
Susquehanna 800.0000 92.50000
U. Massachusetts - Lowell 1000.0000 90.00000
Winston 1027.2727 95.45455
YouGov 1379.3846 95.38861

Plot the relationship between sample size and response rate. Notable that larger samples give higher response rates. Maybe because those polls have more resources to contact all those respondents?

ggplot(data=averages, aes(x=Avg_Sample_Size, y=Avg_Responding)) +
  geom_point(size=2, color='darkblue', shape=16) +
  geom_smooth(method = 'lm', se=FALSE, color='red')

Nesting is part of the ‘purr’ tidyverse package and allows for summarizing of rows, simliar to dplyr’s group_by().

favdata_nested <- favdata |> group_by(pollster) |> nest()
favdata_nested
## # A tibble: 47 × 2
## # Groups:   pollster [47]
##    pollster                 data               
##    <chr>                    <list>             
##  1 Echelon Insights         <tibble [34 × 37]> 
##  2 YouGov                   <tibble [208 × 37]>
##  3 Suffolk                  <tibble [4 × 37]>  
##  4 Morning Consult          <tibble [135 × 37]>
##  5 Ipsos                    <tibble [18 × 37]> 
##  6 Winston                  <tibble [11 × 37]> 
##  7 Harris Poll              <tibble [20 × 37]> 
##  8 Siena College/NYT Upshot <tibble [4 × 37]>  
##  9 Susquehanna              <tibble [4 × 37]>  
## 10 AP-NORC                  <tibble [5 × 37]>  
## # … with 37 more rows
This can be very useful for running, for example, a linear regression to try to predict the relationship between favorability and some other variable. And then mutating it to the nested table. I don’t quite have the code down, but would love to see this extended :).
favdata_nested <- favdata_nested |> mutate(regression=lm(sample_size ~ favorable, data=favdata_nested$data))
Note: a classification model may be more appropriate than a linear, to predict, say, the political candidate based on the polling information.