Tidyverse CREATE

Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset.

First, tidyverse!

library(tidyverse)
library(knitr)

Read in the Donald Trump favorability ratings data from FiveThirtyEight

favdata <- read_csv("https://projects.fivethirtyeight.com/polls/data/favorability_polls.csv", show_col_types=FALSE)
head(favdata)

## # A tibble: 6 × 37
##   poll_id pollster_id pollster   spons…¹ spons…² displ…³ polls…⁴ polls…⁵ fte_g…⁶
##     <dbl>       <dbl> <chr>        <dbl> <chr>   <chr>     <dbl> <chr>   <chr>  
## 1   81309        1302 Echelon I…      NA <NA>    Echelo…     407 Echelo… B/C    
## 2   81309        1302 Echelon I…      NA <NA>    Echelo…     407 Echelo… B/C    
## 3   81267         568 YouGov         352 Econom… YouGov      391 YouGov  B+     
## 4   81267         568 YouGov         352 Econom… YouGov      391 YouGov  B+     
## 5   81267         568 YouGov         352 Econom… YouGov      391 YouGov  B+     
## 6   81296         458 Suffolk        135 USA To… Suffol…     323 Suffol… B+     
## # … with 28 more variables: methodology <chr>, state <lgl>, start_date <chr>,
## #   end_date <chr>, sponsor_candidate_id <lgl>, sponsor_candidate <lgl>,
## #   sponsor_candidate_party <lgl>, question_id <dbl>, sample_size <dbl>,
## #   population <chr>, subpopulation <lgl>, population_full <chr>,
## #   tracking <lgl>, created_at <chr>, notes <chr>, url <chr>, source <dbl>,
## #   internal <lgl>, partisan <chr>, politician_id <dbl>, politician <chr>,
## #   favorable <dbl>, unfavorable <dbl>, alternate_answers <dbl>, …

colnames(favdata)

##  [1] "poll_id"                 "pollster_id"            
##  [3] "pollster"                "sponsor_ids"            
##  [5] "sponsors"                "display_name"           
##  [7] "pollster_rating_id"      "pollster_rating_name"   
##  [9] "fte_grade"               "methodology"            
## [11] "state"                   "start_date"             
## [13] "end_date"                "sponsor_candidate_id"   
## [15] "sponsor_candidate"       "sponsor_candidate_party"
## [17] "question_id"             "sample_size"            
## [19] "population"              "subpopulation"          
## [21] "population_full"         "tracking"               
## [23] "created_at"              "notes"                  
## [25] "url"                     "source"                 
## [27] "internal"                "partisan"               
## [29] "politician_id"           "politician"             
## [31] "favorable"               "unfavorable"            
## [33] "alternate_answers"       "very_favorable"         
## [35] "somewhat_favorable"      "somewhat_unfavorable"   
## [37] "very_unfavorable"

Some of the most useful tidyverse functions come from dplyr, which we can use to easily select, group, and summarize useful information.

fav_unfav <- favdata |> select(favorable, unfavorable)
favdata <- favdata |> mutate(responding=rowSums(fav_unfav))
kable(head(favdata[c("favorable", "unfavorable", "responding")]))

favorable	unfavorable	responding
43.0	54.0	97.0
43.0	54.0	97.0
39.0	54.0	93.0
41.0	56.0	97.0
41.0	57.0	98.0
34.8	57.9	92.7

Get average population size, and average number responding, per pollster.

averages <- favdata |> group_by(pollster) |> summarize(Avg_Sample_Size=mean(sample_size), Avg_Responding=mean(responding))
kable(averages)

pollster	Avg_Sample_Size	Avg_Responding
ALG Research/Hart Research Associates	805.0000	98.00000
Anzalone Liszt Grove	1000.0000	98.00000
AP-NORC	1143.0000	96.68200
AtlasIntel	5188.0000	98.00000
Beacon Research	2523.0000	96.00000
Beacon Research/Shaw & Company	1071.7500	98.50000
Bullfinch	1108.0000	98.00000
Change Research	1804.0000	93.66667
Civiqs	1609.0000	98.00000
Cygnal Political	5688.6667	97.65000
East Carolina University	1105.0000	89.40000
Echelon Insights	1026.5294	96.73529
Emerson College Polling Society	1011.5000	97.60000
Fabrizio/Impact	1437.6667	96.00000
Gallup	1018.0000	99.00000
Global Strategy Group	900.5000	98.00000
Global Strategy Group/GBAO/Navigator Research	1034.2045	98.29545
Greenberg Quinlan Rosner	1000.0000	91.50000
Harris Poll	1881.0000	93.30000
HarrisX	2825.0000	100.00000
Hart Research Associates	1000.0000	95.00000
Hart Research Associates/Public Opinion Strategies	1000.0000	88.22222
Hill Research Consultants	1000.0000	97.00000
Hofstra University	2000.0000	99.19000
Ipsos	1849.2778	95.83333
Marist	1179.7500	95.25000
Marquette Law School	1074.7500	97.25000
McLaughlin	1000.0000	99.00000
Monmouth U.	789.6667	89.66667
Morning Consult	2143.9926	96.71852
PEM Management Corporation	1000.0000	96.16667
Pew	6174.0000	98.00000
Public Religion Research Institute	2866.2500	97.75000
Quinnipiac	1319.0000	92.44444
Rasmussen (Pulse Opinion Research)	1003.2000	97.60000
RealClear Opinion Research	1885.5000	97.00000
RMG Research	1200.0000	97.00000
Schoen Cooperman	750.0000	96.00000
Selzer	1012.0000	95.00000
Siena College/NYT Upshot	958.0000	95.50000
SocialSphere	6565.0000	96.00000
SSRS	1003.0000	97.00000
Suffolk	1000.0000	94.32500
Susquehanna	800.0000	92.50000
U. Massachusetts - Lowell	1000.0000	90.00000
Winston	1027.2727	95.45455
YouGov	1379.3846	95.38861

Plot the relationship between sample size and response rate. Notable that larger samples give higher response rates. Maybe because those polls have more resources to contact all those respondents?

ggplot(data=averages, aes(x=Avg_Sample_Size, y=Avg_Responding)) +
  geom_point(size=2, color='darkblue', shape=16) +
  geom_smooth(method = 'lm', se=FALSE, color='red')

Nesting is part of the ‘purr’ tidyverse package and allows for summarizing of rows, simliar to dplyr’s group_by().

favdata_nested <- favdata |> group_by(pollster) |> nest()
favdata_nested

## # A tibble: 47 × 2
## # Groups:   pollster [47]
##    pollster                 data               
##    <chr>                    <list>             
##  1 Echelon Insights         <tibble [34 × 37]> 
##  2 YouGov                   <tibble [208 × 37]>
##  3 Suffolk                  <tibble [4 × 37]>  
##  4 Morning Consult          <tibble [135 × 37]>
##  5 Ipsos                    <tibble [18 × 37]> 
##  6 Winston                  <tibble [11 × 37]> 
##  7 Harris Poll              <tibble [20 × 37]> 
##  8 Siena College/NYT Upshot <tibble [4 × 37]>  
##  9 Susquehanna              <tibble [4 × 37]>  
## 10 AP-NORC                  <tibble [5 × 37]>  
## # … with 37 more rows

This can be very useful for running, for example, a linear regression to try to predict the relationship between favorability and some other variable. And then mutating it to the nested table. I don’t quite have the code down, but would love to see this extended :).

Tidyverse CREATE

Benjamin Inbar

2022-10-29

Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset.

This can be very useful for running, for example, a linear regression to try to predict the relationship between favorability and some other variable. And then mutating it to the nested table. I don’t quite have the code down, but would love to see this extended :).

favdata_nested <- favdata_nested |> mutate(regression=lm(sample_size ~ favorable, data=favdata_nested$data))

Note: a classification model may be more appropriate than a linear, to predict, say, the political candidate based on the polling information.