16.3 Exercises We have been using urn models to motivate the use of probability models. Most data science applications are not related to data obtained from urns. More common are data that come from individuals. The reason probability plays a role here is because the data come from a random sample. The random sample is taken from a population and the urn serves as an analogy for the population.
Let’s revisit the heights dataset. Suppose we consider the males in our course the population.
library(dslabs)
library(dplyr, warn.conflicts=FALSE)
data(heights)
x<- heights|> filter(sex=="Male")|>pull(height)
1. Mathematically speaking, x is our population. Using
the urn analogy, we have an urn with the values of x in it.
What are the average and standard deviation of our population?
mean(x)
## [1] 69.31475
sd(x)
## [1] 3.611024
2. Call the population average computed above
\(μ\) and the standard deviation \(σ\). Now take a sample of size 50, with
replacement, and construct an estimate for \(μ\) and \(σ\).
set.seed(1)
n<-50
X<-sample(x, n, replace=TRUE)
mean(X)
## [1] 70.47293
sd(X)
## [1] 3.426742
3. What does the theory tell us about the sample average \(\overline{X}\) and how it is related to \(μ\)? B. It is a random variable with expected value \(μ\) and standard error \(σ/\sqrt{N}\).
4. So how is this useful? We are going to use an oversimplified yet illustrative example. Suppose we want to know the average height of our male students, but we only get to measure 50 of the 708. We will use \(\overline{X}\) as our estimate. We know from the answer to exercise 3 that the standard estimate of our \(\overline{X}-μ\) is \(σ/\sqrt{N}\). We want to compute this, but we don’t know \(σ\). Based on what is described in this section, show your estimate of \(σ\).
se<-sd(X)
5. Now that we have an estimate of \(σ\), let’s call our estimate \(s\). Construct a 95% confidence interval for \(μ\).
ci<-c(qnorm(0.025, mean(X), se), qnorm(0.975, mean(X), se))
6. Now run a Monte Carlo simulation in which you compute 10,000 confidence intervals as you have just done. What proportion of these intervals include \(μ\)?
mu<-mean(x)
set.seed(1)
n<-50
b<-10^4
result<-replicate(b, {
X<-sample(x, n, replace=T)
Xhat<-mean(X)
sehat<-sd(X)
se<-sehat/sqrt(n)
interval<-c(qnorm(0.025, mean(X), se), qnorm(.975, mean(X), se))
between(mu, interval[1], interval[2])
})
7. In this section, we talked about pollster bias. We used visualization to motivate the presence of such bias. Here we will give it a more rigorous treatment. We will consider two pollsters that conducted daily polls. We will look at national polls for the month before the election.
We want to answer the question: is there a poll bias? Make a plot showing the spreads for each poll.
library(ggplot2)
data(polls_us_election_2016)
polls <- polls_us_election_2016 |> filter(pollster %in% c("Rasmussen Reports/Pulse Opinion Research", "The Times-Picayune/Lucid") & enddate >= "2016-10-15" & state=="U.S.") |> mutate(spread=rawpoll_clinton/100-rawpoll_trump/100)
polls %>% ggplot(aes(pollster, spread))+geom_boxplot()+geom_point()
8. The data does seem to suggest there is a difference. However, these data are subject to variability. Perhaps the differences we observe are due to chance.
The urn model theory says nothing about pollster effect. Under the urn model, both pollsters have the same expected value: the election day difference, that we call \(d\). To answer the question “is there an urn model?”, we will model the observed data $Y_{i,j} in the following way: \(Y_{i,j}= d + b_{i} + ε_{i, j}\)
with i=1, 2 indexing the two pollsters, \(b_{i}\) the bias for pollster \(i\) and \(ε_{i,j}\) poll to poll chance variability. We assume the \(ε\) are independent from each other, have expected value \(0\) and standard deviation \(σ_{i}\) regardless of \(j\).
Which of the following best represents our question? C. Is \(b_{1}≠b_{2}\)?
9. In the right side of this model only \(ε_{i,j}\) is a random variable. The other two are constants. What is the expected value of \(Y_{1,j}\)? \(E[Y_{1,j}]=d + b_{i}\)
10. Suppose we define \(\overline{Y}_{1}\) as the average of poll results from the first poll, \(Y_{1,1}...Y_{1,N_{1}}\) with \(N_{1}\) the number of polls conducted by the first pollster:
polls |>
filter(pollster=="Rasmussen Reports/Pulse Opinion Research") |> summarize(N_1 = n())
## N_1
## 1 16
What is the expected values \(\overline{Y}_{1}\)? \(E[\overline{Y}_{1}]=d + b_{1}\)
11. What is the standard error of \(\overline{Y}_{1}\)? The standard error of \(SE[\overline{Y}_{1}]\) is \(σ_{1}/\sqrt{N_{1}}\).
12. Suppose we define \(\overline{Y}_{2}\) as the average of poll results from the first poll, \(Y_{2,1}...Y_{2,N_{2}}\) with \(N_{2}\) with the number of polls conducted by the first pollster. What is the expected value of \(\overline{Y}_{2}\)? \(E[\overline{Y}_{2}]=d + b_{2}\)
13. What is the standard error of \(\overline{Y}_{2}\)? The standard error of \(SE[\overline{Y}_{2}]\) is \(σ_{2}/\sqrt{N_{2}}\).
14. Using what we learned by answering the questions above, what is the expected value of \(\overline{Y}_{2}-\overline{Y}_{1}\)? \(E[\overline{Y}_{2}-\overline{Y}_{1}]=b_{2}-b_{1}\)
15. Using what we learned by answering the questions above, what is the standard error of \(\overline{Y}_{2}-\overline{Y}_{1}\)? The standard error of \(SE[\overline{Y}_{2}-\overline{Y}_{1}]\) is \(\sqrt{σ^{2}_{2}/{N_{2}}-σ^{2}_{1}/{N_{1}}}\).
16. The answer to the question above depends on \(σ_{2}\) and \(σ_{1}\) which we don’t know. We learned that we can estimate these with the sample standard deviation. Write code that computes these two estimates.
head(polls)
## state startdate enddate pollster grade
## 1 U.S. 2016-11-05 2016-11-07 The Times-Picayune/Lucid <NA>
## 2 U.S. 2016-11-02 2016-11-06 Rasmussen Reports/Pulse Opinion Research C+
## 3 U.S. 2016-11-01 2016-11-03 Rasmussen Reports/Pulse Opinion Research C+
## 4 U.S. 2016-11-04 2016-11-06 The Times-Picayune/Lucid <NA>
## 5 U.S. 2016-10-31 2016-11-02 Rasmussen Reports/Pulse Opinion Research C+
## 6 U.S. 2016-11-03 2016-11-05 The Times-Picayune/Lucid <NA>
## samplesize population rawpoll_clinton rawpoll_trump rawpoll_johnson
## 1 2521 lv 45 40 5
## 2 1500 lv 45 43 4
## 3 1500 lv 44 44 4
## 4 2584 lv 45 40 5
## 5 1500 lv 42 45 4
## 6 2526 lv 45 40 5
## rawpoll_mcmullin adjpoll_clinton adjpoll_trump adjpoll_johnson
## 1 NA 45.13966 42.26495 3.679914
## 2 NA 45.56041 43.13745 4.418502
## 3 NA 44.66353 44.28981 4.331246
## 4 NA 45.22830 42.30433 3.770880
## 5 NA 42.67626 45.41689 4.239500
## 6 NA 45.31041 42.34422 3.779955
## adjpoll_mcmullin spread
## 1 NA 0.05
## 2 NA 0.02
## 3 NA 0.00
## 4 NA 0.05
## 5 NA -0.03
## 6 NA 0.05
sigma<-polls %>% group_by(pollster) %>% summarize(s=sd(spread))
sigma
## # A tibble: 2 × 2
## pollster s
## <fct> <dbl>
## 1 Rasmussen Reports/Pulse Opinion Research 0.0177
## 2 The Times-Picayune/Lucid 0.0268
17. What does the CLT tell us about the distribution of \(\overline{Y}_{2}-\overline{Y}_{1}\)? Note that \(\overline{Y}_{2}\) and \(\overline{Y}_{1}\) are sample averages, so if we assume \(N_{2}\) and \(N_{1}\) are large enough, each is approximately normal. The difference of normals is also normal.
18. We have constructed a random variable that has expected value \(b_{2}-b_{1}\), the pollster bias difference. If our model holds, then this random variable has an approximately normal distribution and we know its standard error. The standard error depends on \(σ_{1}\) and \(σ_{2}\), but we can plug the sample standard deviations we computed above. We started off by asking: is \(b_{2}-b_{1}\) different from 0? Use all the information we have learned above to construct a 95% confidence interval for the difference \(b_{2}\) and \(b_{1}\)
head(polls)
## state startdate enddate pollster grade
## 1 U.S. 2016-11-05 2016-11-07 The Times-Picayune/Lucid <NA>
## 2 U.S. 2016-11-02 2016-11-06 Rasmussen Reports/Pulse Opinion Research C+
## 3 U.S. 2016-11-01 2016-11-03 Rasmussen Reports/Pulse Opinion Research C+
## 4 U.S. 2016-11-04 2016-11-06 The Times-Picayune/Lucid <NA>
## 5 U.S. 2016-10-31 2016-11-02 Rasmussen Reports/Pulse Opinion Research C+
## 6 U.S. 2016-11-03 2016-11-05 The Times-Picayune/Lucid <NA>
## samplesize population rawpoll_clinton rawpoll_trump rawpoll_johnson
## 1 2521 lv 45 40 5
## 2 1500 lv 45 43 4
## 3 1500 lv 44 44 4
## 4 2584 lv 45 40 5
## 5 1500 lv 42 45 4
## 6 2526 lv 45 40 5
## rawpoll_mcmullin adjpoll_clinton adjpoll_trump adjpoll_johnson
## 1 NA 45.13966 42.26495 3.679914
## 2 NA 45.56041 43.13745 4.418502
## 3 NA 44.66353 44.28981 4.331246
## 4 NA 45.22830 42.30433 3.770880
## 5 NA 42.67626 45.41689 4.239500
## 6 NA 45.31041 42.34422 3.779955
## adjpoll_mcmullin spread
## 1 NA 0.05
## 2 NA 0.02
## 3 NA 0.00
## 4 NA 0.05
## 5 NA -0.03
## 6 NA 0.05
names(polls)
## [1] "state" "startdate" "enddate" "pollster"
## [5] "grade" "samplesize" "population" "rawpoll_clinton"
## [9] "rawpoll_trump" "rawpoll_johnson" "rawpoll_mcmullin" "adjpoll_clinton"
## [13] "adjpoll_trump" "adjpoll_johnson" "adjpoll_mcmullin" "spread"
result<- polls %>% group_by(pollster) %>% summarize(avg=mean(spread), s=sd(spread), N=n())
result
## # A tibble: 2 × 4
## pollster avg s N
## <fct> <dbl> <dbl> <int>
## 1 Rasmussen Reports/Pulse Opinion Research 0.000625 0.0177 16
## 2 The Times-Picayune/Lucid 0.0529 0.0268 24
19. The confidence interval tells us there is relatively strong pollster effect resulting in a difference of about 5%. Random variability does not seem to explain it. We can compute a p-value to relay the fact that chance does not explain it. What is the p-value?
est<-result$avg[2]-result$avg[1]
sehat<-sqrt(result$s[2]^2/result$N[2] + result$s[1]^2/result$N[1])
2*(1-pnorm(est/sehat, 0, 1))
## [1] 1.030287e-13
20. The statistic formed by dividing our estimate of \(b_{2}-b_{1}\) by its estimated standard error is called the t-statistic. \(\frac{\overline{Y}_{2}-\overline{Y}_{1}}{\sqrt{s^{2}_{2}/{N_{2}}-s^{2}_{1}/{N_{1}}}}\)
Now notice that we have more than two pollsters. We can also test for pollster effect using all pollsters, not just two. The idea is to compare the variability across polls to variability within polls. We can actually construct statistics to test for effects and approximate their distribution. The area of statistics that does this is called Analysis of Variance or ANOVA. We do not cover it here, but ANOVA provides a very useful set of tools to answer questions such as: is there a pollster effect?
For this exercise, create a new table. Compute the average and standard deviation for each pollster and examine the variability across the averages and how it compares to the variability within the pollsters, summarized by the standard deviation.
polls<-polls_us_election_2016 |> filter(enddate>= "2016-10-15" & state=="U.S.") |> group_by(pollster) |> filter(n()>=5) |>
mutate(spread=rawpoll_clinton/100-rawpoll_trump/100) |> ungroup()
var<-polls %>% group_by(pollster) %>% summarize(avg=mean(spread), s=sd(spread))
var
## # A tibble: 11 × 3
## pollster avg s
## <fct> <dbl> <dbl>
## 1 ABC News/Washington Post 0.0373 0.0339
## 2 CVOTER International 0.0279 0.0180
## 3 Google Consumer Surveys 0.0303 0.0185
## 4 Gravis Marketing 0.016 0.0152
## 5 IBD/TIPP 0.00105 0.0168
## 6 Ipsos 0.0553 0.0195
## 7 Morning Consult 0.0414 0.0146
## 8 Rasmussen Reports/Pulse Opinion Research 0.000625 0.0177
## 9 The Times-Picayune/Lucid 0.0529 0.0268
## 10 USC Dornsife/LA Times -0.0213 0.0207
## 11 YouGov 0.0398 0.00709