Chapter 16 (Part 1)

16.3 Exercises We have been using urn models to motivate the use of probability models. Most data science applications are not related to data obtained from urns. More common are data that come from individuals. The reason probability plays a role here is because the data come from a random sample. The random sample is taken from a population and the urn serves as an analogy for the population.

Let’s revisit the heights dataset. Suppose we consider the males in our course the population.

library(dslabs)
library(dplyr, warn.conflicts=FALSE)
data(heights)
x<- heights|> filter(sex=="Male")|>pull(height)

1. Mathematically speaking, x is our population. Using the urn analogy, we have an urn with the values of x in it. What are the average and standard deviation of our population?

mean(x)

## [1] 69.31475

sd(x)

## [1] 3.611024

2. Call the population average computed above
$μ$ and the standard deviation $σ$. Now take a sample of size 50, with replacement, and construct an estimate for $μ$ and $σ$.

set.seed(1)
n<-50
X<-sample(x, n, replace=TRUE)
mean(X)

## [1] 70.47293

sd(X)

## [1] 3.426742

3. What does the theory tell us about the sample average $\overline{X}$ and how it is related to $μ$? B. It is a random variable with expected value $μ$ and standard error $σ/\sqrt{N}$.

4. So how is this useful? We are going to use an oversimplified yet illustrative example. Suppose we want to know the average height of our male students, but we only get to measure 50 of the 708. We will use $\overline{X}$ as our estimate. We know from the answer to exercise 3 that the standard estimate of our $\overline{X}-μ$ is $σ/\sqrt{N}$. We want to compute this, but we don’t know $σ$. Based on what is described in this section, show your estimate of $σ$.

se<-sd(X)

5. Now that we have an estimate of $σ$, let’s call our estimate $s$. Construct a 95% confidence interval for $μ$.

ci<-c(qnorm(0.025, mean(X), se), qnorm(0.975, mean(X), se))

6. Now run a Monte Carlo simulation in which you compute 10,000 confidence intervals as you have just done. What proportion of these intervals include $μ$?

mu<-mean(x)
set.seed(1)
n<-50
b<-10^4
result<-replicate(b, {
  X<-sample(x, n, replace=T)
  Xhat<-mean(X)
  sehat<-sd(X)
  se<-sehat/sqrt(n)
  interval<-c(qnorm(0.025, mean(X), se), qnorm(.975, mean(X), se))
  between(mu, interval[1], interval[2])
})

7. In this section, we talked about pollster bias. We used visualization to motivate the presence of such bias. Here we will give it a more rigorous treatment. We will consider two pollsters that conducted daily polls. We will look at national polls for the month before the election.

We want to answer the question: is there a poll bias? Make a plot showing the spreads for each poll.

library(ggplot2)
data(polls_us_election_2016) 
polls <- polls_us_election_2016 |> filter(pollster %in% c("Rasmussen Reports/Pulse Opinion Research", "The Times-Picayune/Lucid") & enddate >= "2016-10-15" & state=="U.S.") |> mutate(spread=rawpoll_clinton/100-rawpoll_trump/100)
polls %>% ggplot(aes(pollster, spread))+geom_boxplot()+geom_point()

8. The data does seem to suggest there is a difference. However, these data are subject to variability. Perhaps the differences we observe are due to chance.

The urn model theory says nothing about pollster effect. Under the urn model, both pollsters have the same expected value: the election day difference, that we call $d$. To answer the question “is there an urn model?”, we will model the observed data $Y_{i,j} in the following way: $Y_{i,j}= d + b_{i} + ε_{i, j}$

with i=1, 2 indexing the two pollsters, $b_{i}$ the bias for pollster $i$ and $ε_{i,j}$ poll to poll chance variability. We assume the $ε$ are independent from each other, have expected value $0$ and standard deviation $σ_{i}$ regardless of $j$.

Which of the following best represents our question? C. Is $b_{1}≠b_{2}$?

9. In the right side of this model only $ε_{i,j}$ is a random variable. The other two are constants. What is the expected value of $Y_{1,j}$? $E[Y_{1,j}]=d + b_{i}$

10. Suppose we define $\overline{Y}_{1}$ as the average of poll results from the first poll, $Y_{1,1}...Y_{1,N_{1}}$ with $N_{1}$ the number of polls conducted by the first pollster:

polls |> 
  filter(pollster=="Rasmussen Reports/Pulse Opinion Research") |>    summarize(N_1 = n())

##   N_1
## 1  16

What is the expected values $\overline{Y}_{1}$? $E[\overline{Y}_{1}]=d + b_{1}$

11. What is the standard error of $\overline{Y}_{1}$? The standard error of $SE[\overline{Y}_{1}]$ is $σ_{1}/\sqrt{N_{1}}$.

12. Suppose we define $\overline{Y}_{2}$ as the average of poll results from the first poll, $Y_{2,1}...Y_{2,N_{2}}$ with $N_{2}$ with the number of polls conducted by the first pollster. What is the expected value of $\overline{Y}_{2}$? $E[\overline{Y}_{2}]=d + b_{2}$

13. What is the standard error of $\overline{Y}_{2}$? The standard error of $SE[\overline{Y}_{2}]$ is $σ_{2}/\sqrt{N_{2}}$.

14. Using what we learned by answering the questions above, what is the expected value of $\overline{Y}_{2}-\overline{Y}_{1}$? $E[\overline{Y}_{2}-\overline{Y}_{1}]=b_{2}-b_{1}$

15. Using what we learned by answering the questions above, what is the standard error of $\overline{Y}_{2}-\overline{Y}_{1}$? The standard error of $SE[\overline{Y}_{2}-\overline{Y}_{1}]$ is $\sqrt{σ^{2}_{2}/{N_{2}}-σ^{2}_{1}/{N_{1}}}$.

16. The answer to the question above depends on $σ_{2}$ and $σ_{1}$ which we don’t know. We learned that we can estimate these with the sample standard deviation. Write code that computes these two estimates.

head(polls)

##   state  startdate    enddate                                 pollster grade
## 1  U.S. 2016-11-05 2016-11-07                 The Times-Picayune/Lucid  <NA>
## 2  U.S. 2016-11-02 2016-11-06 Rasmussen Reports/Pulse Opinion Research    C+
## 3  U.S. 2016-11-01 2016-11-03 Rasmussen Reports/Pulse Opinion Research    C+
## 4  U.S. 2016-11-04 2016-11-06                 The Times-Picayune/Lucid  <NA>
## 5  U.S. 2016-10-31 2016-11-02 Rasmussen Reports/Pulse Opinion Research    C+
## 6  U.S. 2016-11-03 2016-11-05                 The Times-Picayune/Lucid  <NA>
##   samplesize population rawpoll_clinton rawpoll_trump rawpoll_johnson
## 1       2521         lv              45            40               5
## 2       1500         lv              45            43               4
## 3       1500         lv              44            44               4
## 4       2584         lv              45            40               5
## 5       1500         lv              42            45               4
## 6       2526         lv              45            40               5
##   rawpoll_mcmullin adjpoll_clinton adjpoll_trump adjpoll_johnson
## 1               NA        45.13966      42.26495        3.679914
## 2               NA        45.56041      43.13745        4.418502
## 3               NA        44.66353      44.28981        4.331246
## 4               NA        45.22830      42.30433        3.770880
## 5               NA        42.67626      45.41689        4.239500
## 6               NA        45.31041      42.34422        3.779955
##   adjpoll_mcmullin spread
## 1               NA   0.05
## 2               NA   0.02
## 3               NA   0.00
## 4               NA   0.05
## 5               NA  -0.03
## 6               NA   0.05

sigma<-polls %>% group_by(pollster) %>% summarize(s=sd(spread))
sigma

## # A tibble: 2 × 2
##   pollster                                      s
##   <fct>                                     <dbl>
## 1 Rasmussen Reports/Pulse Opinion Research 0.0177
## 2 The Times-Picayune/Lucid                 0.0268

17. What does the CLT tell us about the distribution of $\overline{Y}_{2}-\overline{Y}_{1}$? Note that $\overline{Y}_{2}$ and $\overline{Y}_{1}$ are sample averages, so if we assume $N_{2}$ and $N_{1}$ are large enough, each is approximately normal. The difference of normals is also normal.

18. We have constructed a random variable that has expected value $b_{2}-b_{1}$, the pollster bias difference. If our model holds, then this random variable has an approximately normal distribution and we know its standard error. The standard error depends on $σ_{1}$ and $σ_{2}$, but we can plug the sample standard deviations we computed above. We started off by asking: is $b_{2}-b_{1}$ different from 0? Use all the information we have learned above to construct a 95% confidence interval for the difference $b_{2}$ and $b_{1}$

head(polls)

##   state  startdate    enddate                                 pollster grade
## 1  U.S. 2016-11-05 2016-11-07                 The Times-Picayune/Lucid  <NA>
## 2  U.S. 2016-11-02 2016-11-06 Rasmussen Reports/Pulse Opinion Research    C+
## 3  U.S. 2016-11-01 2016-11-03 Rasmussen Reports/Pulse Opinion Research    C+
## 4  U.S. 2016-11-04 2016-11-06                 The Times-Picayune/Lucid  <NA>
## 5  U.S. 2016-10-31 2016-11-02 Rasmussen Reports/Pulse Opinion Research    C+
## 6  U.S. 2016-11-03 2016-11-05                 The Times-Picayune/Lucid  <NA>
##   samplesize population rawpoll_clinton rawpoll_trump rawpoll_johnson
## 1       2521         lv              45            40               5
## 2       1500         lv              45            43               4
## 3       1500         lv              44            44               4
## 4       2584         lv              45            40               5
## 5       1500         lv              42            45               4
## 6       2526         lv              45            40               5
##   rawpoll_mcmullin adjpoll_clinton adjpoll_trump adjpoll_johnson
## 1               NA        45.13966      42.26495        3.679914
## 2               NA        45.56041      43.13745        4.418502
## 3               NA        44.66353      44.28981        4.331246
## 4               NA        45.22830      42.30433        3.770880
## 5               NA        42.67626      45.41689        4.239500
## 6               NA        45.31041      42.34422        3.779955
##   adjpoll_mcmullin spread
## 1               NA   0.05
## 2               NA   0.02
## 3               NA   0.00
## 4               NA   0.05
## 5               NA  -0.03
## 6               NA   0.05

names(polls)

##  [1] "state"            "startdate"        "enddate"          "pollster"        
##  [5] "grade"            "samplesize"       "population"       "rawpoll_clinton" 
##  [9] "rawpoll_trump"    "rawpoll_johnson"  "rawpoll_mcmullin" "adjpoll_clinton" 
## [13] "adjpoll_trump"    "adjpoll_johnson"  "adjpoll_mcmullin" "spread"

result<- polls %>% group_by(pollster) %>% summarize(avg=mean(spread), s=sd(spread), N=n())
result

## # A tibble: 2 × 4
##   pollster                                      avg      s     N
##   <fct>                                       <dbl>  <dbl> <int>
## 1 Rasmussen Reports/Pulse Opinion Research 0.000625 0.0177    16
## 2 The Times-Picayune/Lucid                 0.0529   0.0268    24

19. The confidence interval tells us there is relatively strong pollster effect resulting in a difference of about 5%. Random variability does not seem to explain it. We can compute a p-value to relay the fact that chance does not explain it. What is the p-value?

est<-result$avg[2]-result$avg[1]
sehat<-sqrt(result$s[2]^2/result$N[2] + result$s[1]^2/result$N[1])
2*(1-pnorm(est/sehat, 0, 1))

## [1] 1.030287e-13

20. The statistic formed by dividing our estimate of $b_{2}-b_{1}$ by its estimated standard error is called the t-statistic. $\frac{\overline{Y}_{2}-\overline{Y}_{1}}{\sqrt{s^{2}_{2}/{N_{2}}-s^{2}_{1}/{N_{1}}}}$

Now notice that we have more than two pollsters. We can also test for pollster effect using all pollsters, not just two. The idea is to compare the variability across polls to variability within polls. We can actually construct statistics to test for effects and approximate their distribution. The area of statistics that does this is called Analysis of Variance or ANOVA. We do not cover it here, but ANOVA provides a very useful set of tools to answer questions such as: is there a pollster effect?

For this exercise, create a new table. Compute the average and standard deviation for each pollster and examine the variability across the averages and how it compares to the variability within the pollsters, summarized by the standard deviation.

polls<-polls_us_election_2016 |> filter(enddate>= "2016-10-15" & state=="U.S.") |> group_by(pollster) |> filter(n()>=5) |>
mutate(spread=rawpoll_clinton/100-rawpoll_trump/100) |> ungroup()
var<-polls %>% group_by(pollster) %>% summarize(avg=mean(spread), s=sd(spread))
var

## # A tibble: 11 × 3
##    pollster                                       avg       s
##    <fct>                                        <dbl>   <dbl>
##  1 ABC News/Washington Post                  0.0373   0.0339 
##  2 CVOTER International                      0.0279   0.0180 
##  3 Google Consumer Surveys                   0.0303   0.0185 
##  4 Gravis Marketing                          0.016    0.0152 
##  5 IBD/TIPP                                  0.00105  0.0168 
##  6 Ipsos                                     0.0553   0.0195 
##  7 Morning Consult                           0.0414   0.0146 
##  8 Rasmussen Reports/Pulse Opinion Research  0.000625 0.0177 
##  9 The Times-Picayune/Lucid                  0.0529   0.0268 
## 10 USC Dornsife/LA Times                    -0.0213   0.0207 
## 11 YouGov                                    0.0398   0.00709

Chapter 16 (Part 1)

Dimple K. Patel

2024-04-27