Packages

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(openintro)

## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata

library(infer)
library(shiny)

## 
## Attaching package: 'shiny'
## 
## The following object is masked from 'package:infer':
## 
##     observe

library(png)
library(jpeg)

global_monitor <- tibble(
  scientist_work = c(rep("Benefits", 80000), rep("Doesn't benefit", 20000))
)

us_adults <- tibble(
  climate_change_affects = c(rep("Yes", 62000), rep("No", 38000))
)

ggplot(us_adults, aes(x = climate_change_affects)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "Do you think climate change is affecting your local community?"
  ) +
  coord_flip()

us_adults %>%
  count(climate_change_affects) %>%
  mutate(p = n /sum(n))

n <- 60
samp <- us_adults %>%
  sample_n(size = n)

Exercise 1

What percent of the adults in your sample think climate change affects their local community? Hint: Just like we did with the population, we can calculate the proportion of those in this sample who think climate change affects their local community.

62% of the adults in your sample think climate change affects their local community

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Exercise 2

Would you expect another student’s sample proportion to be identical to yours? Would you expect it to be similar? Why or why not?

Certainly not, as there is a high likelihood that we will experience an outcome that is distinct but still bears some resemblance.

Exercise 3

In the interpretation above, we used the phrase “95% confident”. What does “95% confidence” mean?

It means that if we were to repeat this sampling and estimation process many times, we would expect the resulting confidence intervals to contain the true population proportion in approximately 95% of those repetitions. In other words, we acknowledge some uncertainty in our estimate, but we are reasonably confident that the true value lies within the reported bounds.

samp %>%
  specify(response = climate_change_affects, success = "Yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.95)

sample_props50 <- global_monitor %>%
                    rep_sample_n(size = 50, reps = 15000, replace = TRUE) %>%
                    count(scientist_work) %>%
                    mutate(p_hat = n /sum(n)) %>%
                    filter(scientist_work == "Doesn't benefit")

ggplot(data = sample_props50, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.02) +
  labs(
    x = "p_hat (Doesn't benefit)",
    title = "Sampling distribution of p_hat",
    subtitle = "Sample size = 50, Number of samples = 15000"
  )

Exercise 4

Does your confidence interval capture the true population proportion of US adults who think climate change affects their local community? If you are working on this lab in a classroom, does your neighbor’s interval capture this value?

No it doesn’t and neither does it for my classmate.

Exercise 5

Each student should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why?

I think that it would probably be at one standard deviation

Exercise 6

Given a sample size of 60, 1000 bootstrap samples for each interval, and 50 confidence intervals constructed (the default values for the above app), what proportion of your confidence intervals include the true population proportion? Is this proportion exactly equal to the confidence level? If not, explain why. Make sure to include your plot in your answer.

samp %>%
 specify(response = climate_change_affects, success = "Yes") %>%
 generate(reps = 1000, type = "bootstrap") %>%
 calculate(stat = "prop") %>%
 get_ci(level = 0.95)

samp %>%
 specify(response = climate_change_affects, success = "Yes") %>%
 generate(reps = 1000, type = "bootstrap") %>%
 calculate(stat = "prop") %>%
 get_ci(level = 0.5)

downloads_path <- file.path(Sys.getenv("USERPROFILE"), "Downloads")
image_path <- file.path(downloads_path, "Q6.png")
img <- readPNG(image_path)

# Display the image
plot(2:1, type = "n", xlab = "", ylab = "")
rasterImage(img, 1, 1, 2, 2)

Exercise 7

Choose a different confidence level than 95%. Would you expect a confidence interval at this level to me wider or narrower than the confidence interval you calculated at the 95% confidence level? Explain your reasoning.

I am using an 80% confidence interval and since it is only covering 80% of the mean it will be narrower than the 95% confidence interval

samp %>%
  specify(response = climate_change_affects, success = "Yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.80)

Exercise 8

Using code from the infer package and data from the one sample you have (samp), find a confidence interval for the proportion of US Adults who think climate change is affecting their local community with a confidence level of your choosing (other than 95%) and interpret it.

samp %>%
  specify(response = climate_change_affects, success = "Yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.99)

Exercise 9

Using the app, calculate 50 confidence intervals at the confidence level you chose in the previous question, and plot all intervals on one plot, and calculate the proportion of intervals that include the true population proportion. How does this percentage compare to the confidence level selected for the intervals?

Using a confidence interval of 70 from the previous 95 I would expect that the likelihood of the true mean would be less than the 95 confidence interval, since the 70% confidence interval covers a smaller range of values

downloads_path <- file.path(Sys.getenv("USERPROFILE"), "Downloads")
image_path2 <- file.path(downloads_path, "Q9.png")
img2 <- readPNG(image_path)

# Display the image
plot(2:1, type = "n", xlab = "", ylab = "")
rasterImage(img, 1, 1, 2, 2)

Exercise 10

Lastly, try one more (different) confidence level. First, state how you expect the width of this interval to compare to previous ones you calculated. Then, calculate the bounds of the interval using the infer package and data from samp and interpret it. Finally, use the app to generate many intervals and calculate the proportion of intervals that are capture the true population proportion.

I expect the width of this interval to be much wider than the previous of 70% interval since I will be using a confidence interval of 90%

samp %>%
  specify(response = climate_change_affects, success = "Yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.90)

downloads_path <- file.path(Sys.getenv("USERPROFILE"), "Downloads")
image_path3 <- file.path(downloads_path, "Q10.png")
img3 <- readPNG(image_path)

# Display the image
plot(2:1, type = "n", xlab = "", ylab = "")
rasterImage(img3, 1, 1, 2, 2)

Exercise 11

Using the app, experiment with different sample sizes and comment on how the widths of intervals change as sample size changes (increases and decreases).

Bigger samples mean tighter confidence intervals, like having more pieces for a clearer picture. Smaller samples result in wider intervals

downloads_path <- file.path(Sys.getenv("USERPROFILE"), "Downloads")
image_path4 <- file.path(downloads_path, "11(100).png")


img4 <- readPNG(image_path4)
plot(2:1, type = "n", xlab = "Sample Size of 100", ylab = "")
rasterImage(img4, 1, 1, 2, 2 )

downloads_path <- file.path(Sys.getenv("USERPROFILE"), "Downloads")
image_path5 <- file.path(downloads_path, "11(500).png")


img5 <- readPNG(image_path5)
plot(2:1, type = "n", xlab = "Sample Size of 500", ylab = "")
rasterImage(img5, 1, 1, 2, 2 )

downloads_path <- file.path(Sys.getenv("USERPROFILE"), "Downloads")
image_path6 <- file.path(downloads_path, "11(1000).png")


img6 <- readPNG(image_path6)
plot(2:1, type = "n", xlab = "Sample Size of 1000", ylab = "")
rasterImage(img6, 1, 1, 2, 2 )

Exercise 12

Finally, given a sample size (say, 60), how does the width of the interval change as you increase the number of bootstrap samples. Hint: Does changing the number of bootstrap samples affect the standard error?

Increasing the number of bootstrap samples generally leads to narrower confidence intervals and more precise estimates of the standard error

downloads_path <- file.path(Sys.getenv("USERPROFILE"), "Downloads")
image_path7 <- file.path(downloads_path, "Q12(1000)boot.png")


img7 <- readPNG(image_path7)
plot(2:1, type = "n", xlab = "Bootstrap of 1000", ylab = "")
rasterImage(img4, 1, 1, 2, 2 )

downloads_path <- file.path(Sys.getenv("USERPROFILE"), "Downloads")
image_path8 <- file.path(downloads_path, "Q12(5000)boot.png")


img8 <- readPNG(image_path8)
plot(2:1, type = "n", xlab = "Bootstrap of 5000", ylab = "")
rasterImage(img4, 1, 1, 2, 2 )

downloads_path <- file.path(Sys.getenv("USERPROFILE"), "Downloads")
image_path9 <- file.path(downloads_path, "Q12(10000)boot.png")


img9 <- readPNG(image_path9)
plot(2:1, type = "n", xlab = "Bootstrap of 10000", ylab = "")
rasterImage(img4, 1, 1, 2, 2 )

Data 606 lab 5 b

Mikhail Broomes

2023-10-12