16.9 Exercises

1. Create this table. Now for each poll use the CLT to create a 95% confidence interval for the spread reported by each poll. Call the resulting object cis with columns lower and upper for the limits of the confidence intervals. Use the select function to keep the columns state, startdate, end date, pollster, grade, spread, lower, upper.

library(tidyverse)
## Warning: package 'stringr' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dslabs)
data("polls_us_election_2016")
polls<-polls_us_election_2016 |> filter(state != "U.S." & enddate>="2016-10-31") |> mutate(spread= rawpoll_clinton/100-rawpoll_trump/100)
cis <- polls %>% mutate(X_hat=(spread+1)/2, se=2*sqrt(X_hat*(1-X_hat)/samplesize), lower=spread-qnorm(0.975)*se, upper=spread+qnorm(0.975)*se) %>%
select(state, startdate, enddate, pollster, grade, spread, lower, upper)

2. You can add the final result to the cis table you just created using the right_join function like this. Now determine how often the 95% confidence interval includes the actual result.

add<-results_us_election_2016 |> mutate(actual_spread=clinton/100-trump/100) |> select(state, actual_spread)
cis_data<-cis %>% mutate(state=as.character(state)) %>% left_join(add, by="state")
p_hits<-cis_data %>% mutate(hit=lower<=actual_spread & upper>=actual_spread) %>% summarize(proportion_hits=mean(hit))

3. Repeat this, but show the proportion of hits for each pollster. Show only pollsters with more than 5 polls and order them from best to worst. Show the number of polls conducted by each pollster and the FiveThirtyEight grade of each pollster. Hint: use n=n(), grade=grade[1] in the call to summarize.

add<-results_us_election_2016 %>% mutate(actual_spread= clinton/100-trump/100) %>% select(state, actual_spread)
ci_data<-cis %>% mutate(state=as.character(state)) %>% left_join(add, by="state")

p_hits<-ci_data %>% mutate(hit=lower<=actual_spread & upper>= actual_spread) %>% summarize(proportion_hits=mean(hit))
p_hits
##   proportion_hits
## 1         0.66133

4. Repeat exercise 3, but instead of pollster, stratify by state. Note that here we can’t show grades.

add<-results_us_election_2016 %>% mutate(actual_spread=clinton/100-trump/100) %>% select(state, actual_spread)
ci_data<-cis %>% mutate(state=as.character(state)) %>% left_join(add, by="state")
p_hits<-ci_data %>% mutate(hit=lower<=actual_spread & upper>=actual_spread) %>% group_by(state) %>% filter(n()>=5) %>% summarize(proportion_hits=mean(hit), n=n()) %>% arrange(desc(proportion_hits))

5. Make a barplot based on the result of exercise 4. Use coord_flip.

p_hits %>% mutate(state=reorder(state, proportion_hits)) %>% 
  ggplot(aes(state, proportion_hits)) + geom_bar(stat="identity") + coord_flip()

p_hits
## # A tibble: 51 × 3
##    state        proportion_hits     n
##    <chr>                  <dbl> <int>
##  1 Connecticut            1        13
##  2 Delaware               1        12
##  3 Rhode Island           1        10
##  4 New Mexico             0.941    17
##  5 Washington             0.933    15
##  6 Oregon                 0.929    14
##  7 Illinois               0.923    13
##  8 Nevada                 0.923    26
##  9 Maine                  0.917    12
## 10 Montana                0.917    12
## # ℹ 41 more rows

6. Add two columns to the cis table by computing, for each poll, the difference between the predicted spread and the actual spread, and define a column hit that is true if the signs are the same. Hint: use the function sign. Call the object resids.

cis<-cis%>%mutate(state=as.character(state))%>%left_join(add, by="state")
head(cis)
##            state  startdate    enddate                pollster grade spread
## 1     New Mexico 2016-11-06 2016-11-06                Zia Poll  <NA>   0.02
## 2       Virginia 2016-11-03 2016-11-04   Public Policy Polling    B+   0.05
## 3           Iowa 2016-11-01 2016-11-04        Selzer & Company    A+  -0.07
## 4      Wisconsin 2016-10-26 2016-10-31    Marquette University     A   0.06
## 5 North Carolina 2016-11-04 2016-11-06           Siena College     A   0.00
## 6        Georgia 2016-11-06 2016-11-06 Landmark Communications     B  -0.03
##          lower         upper actual_spread
## 1 -0.001331221  0.0413312213         0.083
## 2 -0.005634504  0.1056345040         0.054
## 3 -0.139125210 -0.0008747905        -0.094
## 4  0.004774064  0.1152259363        -0.007
## 5 -0.069295191  0.0692951912        -0.036
## 6 -0.086553820  0.0265538203        -0.051
resids<-cis%>%mutate(error=spread-ci_data$actual_spread, hit=sign(spread)==sign(ci_data$actual_spread))
tail(resids)
##            state  startdate    enddate                pollster grade  spread
## 807         Utah 2016-10-04 2016-11-06                  YouGov     B -0.0910
## 808         Utah 2016-10-25 2016-10-31 Google Consumer Surveys     B -0.0121
## 809 South Dakota 2016-10-28 2016-11-02                   Ipsos    A- -0.1875
## 810   Washington 2016-10-21 2016-11-02                   Ipsos    A-  0.0838
## 811         Utah 2016-11-01 2016-11-07 Google Consumer Surveys     B -0.1372
## 812       Oregon 2016-10-21 2016-11-02                   Ipsos    A-  0.0905
##             lower       upper actual_spread   error  hit
## 807 -0.1660704570 -0.01592954        -0.180  0.0890 TRUE
## 808 -0.1373083389  0.11310834        -0.180  0.1679 TRUE
## 809 -0.3351563485 -0.03984365        -0.298  0.1105 TRUE
## 810 -0.0004028265  0.16800283         0.162 -0.0782 TRUE
## 811 -0.2519991224 -0.02240088        -0.180  0.0428 TRUE
## 812 -0.0019261469  0.18292615         0.110 -0.0195 TRUE

7. Create a plot like in exercise 5, but for the proportion of times the sign of the spread agreed.

resids<-cis %>% mutate(error=spread-ci_data$actual_spread, hit=sign(spread)== sign(ci_data$actual_spread))
p_hits<-resids %>% group_by(state) %>% summarize(proportion_hits=mean(hit), n=n())
p_hits %>% mutate(state=reorder(state, proportion_hits)) %>% ggplot(aes(state, proportion_hits)) + geom_bar(stat="identity") + coord_flip()

8. In exercise 7, we see that for most states the polls had it right 100% of the time. For only 9 states did the polls miss more than 25% of the time. In particular, notice that in Wisconsin every single poll got it wrong. In Pennsylvania and Michigan more than 90% of the polls had the signs wrong. Make a histogram of the errors. What is the median of these errors?

hist(resids$error)

median(resids$error)
## [1] 0.037

9. We see that at the state level, the median error was 3% in favor of Clinton. The distribution is not centered at 0, but at 0.03. This is the general bias we described in the section above. Create a boxplot to see if the bias was general to all states or it affected some states differently. Use filter(grade %in% c(“A+”,“A”,“A-”,“B+”) | is.na(grade))) to only include pollsters with high grades.

head(resids)
##            state  startdate    enddate                pollster grade spread
## 1     New Mexico 2016-11-06 2016-11-06                Zia Poll  <NA>   0.02
## 2       Virginia 2016-11-03 2016-11-04   Public Policy Polling    B+   0.05
## 3           Iowa 2016-11-01 2016-11-04        Selzer & Company    A+  -0.07
## 4      Wisconsin 2016-10-26 2016-10-31    Marquette University     A   0.06
## 5 North Carolina 2016-11-04 2016-11-06           Siena College     A   0.00
## 6        Georgia 2016-11-06 2016-11-06 Landmark Communications     B  -0.03
##          lower         upper actual_spread  error   hit
## 1 -0.001331221  0.0413312213         0.083 -0.063  TRUE
## 2 -0.005634504  0.1056345040         0.054 -0.004  TRUE
## 3 -0.139125210 -0.0008747905        -0.094  0.024  TRUE
## 4  0.004774064  0.1152259363        -0.007  0.067 FALSE
## 5 -0.069295191  0.0692951912        -0.036  0.036 FALSE
## 6 -0.086553820  0.0265538203        -0.051  0.021  TRUE
resids %>% filter(grade %in% c("A+","A","A-","B+")|is.na(grade)) %>% mutate(state= reorder(state, error)) %>% ggplot(aes(state, error)) + geom_boxplot() + geom_point()

10. Some of these states only have a few polls. Repeat exercise 9, but only include states with 5 good polls or more. Hint: use group_by, filter then ungroup. You will see that the West (Washington, New Mexico, California) underestimated Hillary’s performance, while the Midwest (Michigan, Pennsylvania, Wisconsin, Ohio, Missouri) overestimated it. In our simulation, we did not model this behavior since we added general bias, rather than a regional bias. Note that some pollsters may now be modeling correlation between similar states and estimating this correlation from historical data. To learn more about this, you can learn about random effects and mixed models.

resids %>% filter(grade %in% c("A+","A","A-","B+") | is.na(grade)) %>% group_by(state) %>% filter(n() >= 5) %>% ungroup() %>%mutate(state=reorder(state, error)) %>% ggplot(aes(state, error)) + geom_boxplot() + geom_point()