Introduction

Quantities \(\bar{x}\) and \(\hat{p}\) are good point estimates for the population mean and population proportion, respectively. These point estimates vary from one sample to another. We will first investigate this variability below. Second, we will use package infer to get a deeper understanding of simulation-based inference. To get started, load packages tidyverse and infer.

library(tidyverse)
library(infer)

Below is a basic custom theme. Feel free to try it out when you use ggplot(). Simply add it as layer to your plot. Rather than using theme_bw() you can use theme_custom(). This custom theme increases the font point size on axes and their labels.

theme_custom <- function() {
  theme_bw() +
  theme(axis.title = element_text(size = 16), 
        title = element_text(size = 20),
        axis.text.x = element_text(size = 12),
        axis.text.y = element_text(size = 12),
        plot.caption = element_text(size = 10))
}

Variability in estimates

Data

We will work with data from NASA on 23 shuttle launches. The data set is available on the UCI Machine Learning Repository website. Below we read-in the data and set it as a tibble.

# save url
url <- paste("https://archive.ics.uci.edu/ml/",
             "machine-learning-databases/space-shuttle/",
             "o-ring-erosion-only.data", sep = "")

# read data and set as tibble
nasa <- as_tibble(read.table(url, header = F))

Variables in nasa:

Number of O-rings at risk on a given flight
Number experiencing thermal distress
Launch temperature (degrees F)
Leak-check pressure (psi)
Temporal order of flight

Let’s rename the variables in nasa with dplyr function rename().

nasa <- nasa %>% 
          rename(total_rings = V1,
                 damage_count = V2,
                 temperature = V3,
                 leak_check_pressure = V4,
                 flight_number = V5)

glimpse(nasa)

Observations: 23
Variables: 5
$ total_rings         <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
$ damage_count        <int> 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 2, ...
$ temperature         <int> 66, 70, 69, 68, 67, 72, 73, 70, 57, 63, 70...
$ leak_check_pressure <int> 50, 50, 50, 50, 50, 50, 100, 100, 200, 200...
$ flight_number       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...

Sampling variability in \(\bar{x}\)

Take the 23 launch temperature values as our population. Below, we use function sample() to take a sample of size 2 of the temperatures at the time of launch for each shuttle, sample with replacement, and compute the sample mean.

temp2 <- sample(x = nasa$temperature, size = 2, replace = TRUE)
mean(temp2)

[1] 70

Let’s turn the above code into a function with no arguments. The function should return the sample mean.

sample_temp2 <- function() {
  x <- sample(x = nasa$temperature, size = 2, replace = TRUE)
  return(mean(x))
}

Test sample_temp2():

sample_temp2()

[1] 68

To see how these sample means vary from sample-to-sample, we generate 10000 samples of size 2 and take the mean of each sample. To visualize this variability, we plot a histogram of the 10000 \(\bar{x}\) values.

# generate data
mean_temps <- replicate(n = 10000, expr = sample_temp2())
mean_temps <- as_tibble(mean_temps)

# histogram of sample means
mean_temps %>% 
  rename(temp = value) %>% 
  ggplot(mapping = aes(x = temp)) +
  geom_histogram(binwidth = 2, fill = "#FC3D21", color = "#0B3D91") +
  labs(x = "Sample mean temperature", y = "Frequency",
       title = expression(paste("Variability in ", bar(x))),
       subtitle = "among 10,000 samples",
       caption = "Each sample is of size 2") +
  theme_custom()

Modify the above chunks of code to change the sampling with replacement to be of size 23, as opposed to size 2. Don’t overwrite the above chunks.

Sampling variability in \(\hat{p}\)

Investigate the variability in point estimate \(\hat{p}\). Suppose the proportion of shuttle launches that result in some o-ring damage is 0.25. Create a histogram to visualize the variability of \(\hat{p}\) for a sample size of \(n = 23\).

Simulation-based inference

Data come from Gallup’s most recent survey of the country, conducted Sept. 27-Nov. 28, 2018, a few months before Juan Guaido was sworn in as interim president of Venezuela on Jan. 23. Guaido is the head of the country’s opposition and the president of the Venezuelan National Assembly. His assumption of the presidency was a direct challenge to Nicolas Maduro, who has presided over the country since former Venezuelan President Hugo Chavez died in 2013.

From the survey of 1,000 adults, only 53% reported having enough money for adequate shelter. Does this provide convincing evidence that a majority of Venezuelan adults can afford an adequate shelter at the 5% significance level? What about at the 1% significance level?

Hypotheses

State the null and alternative hypotheses given the problem above.

Below we create a data set to use with package infer. We will create a data frame shelter that has “yes” or “no” outcomes with regards to the question: Do you have enough money for adequate shelter?

# create data frame shelter
shelter <- as_tibble(rep(x = c("no", "yes"), times = c(470, 530)))

# verify 53% are yes as per the survey results from above
shelter %>% 
  group_by(value) %>% 
  summarise(proportion = n() / nrow(shelter))

Simulated null distribution

Simulate the null distribution using a sequence of functions from package infer. Follow the template in the notes and take a look at the help for each function. Plot a histogram of the null distribution and place a vertical line at the value of the observed sample proportion of 0.53.

Compute the p-value

Use the simulated null distribution to compute the p-value. Recall that the p-value is the probability of observing data at least as favorable to the alternative hypothesis as our current data set, given that the null hypothesis is true.

Conclusion

State your conclusion to “Does this provide convincing evidence that a majority of Venezuelan adults can afford an adequate shelter at the 5% significance level? What about at the 1% significance level?” Be sure to put your conclusion in plain English and in the context of the problem.

Simulation-based inference

Single population proportion

Shawn Santo

March 12, 2019

Introduction

Variability in estimates

Data

Sampling variability in \(\bar{x}\)

Sampling variability in \(\hat{p}\)

Simulation-based inference

Hypotheses

Simulated null distribution

Compute the p-value

Conclusion

References