library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(infer)
us_adults <- tibble(
climate_change_affects = c(rep("Yes", 62000), rep("No", 38000))
)
n <- 60
samp <- us_adults %>%
sample_n(size = n)
samp %>%
count(climate_change_affects) %>%
mutate(p = n /sum(n))
## # A tibble: 2 × 3
## climate_change_affects n p
## <chr> <int> <dbl>
## 1 No 29 0.483
## 2 Yes 31 0.517
Because the sample was taken from our population, I would assume that the percentage would reflect a proportion similar to that of the population. Based on the calculation, the proportion is similar to the population proportion with slightly more No’s than were expected.
I would expect another student’s sample to be similar to mine but not identical. Because it is a random choosing of the population there is bound to be some deviations.
samp %>%
specify(response = climate_change_affects, success = "Yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.95)
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.400 0.633
A confidence interval of 95% means that you are 95% certain that the true population’s proportion lies within that particular range.
The confidence interval captures the population mean that I’ve previously calculated. My neighbors interval would capture this data 95% of the time.
I’d assume that 95% of those intervals would capture the true population mean. As each of the subsets are randomized there is a chance for anomalies to occur.
n <- 60
samp <- us_adults %>%
sample_n(size = n)
If I chose a higher confidence level than 95%, I would expect a wider confidence interval than the one acheived at 95% and vice versa. My reasoning is that if we had a confidence level of 100% for example, that means our interval would have to contain all possible values of the populations statistic.
samp %>%
specify(response = climate_change_affects, success = "Yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = .1)
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.55 0.567
Based on this calculation, I am 10% confident that the actual population proportion is within the interval of .58 and .6
Out of 50 confidence intervals collected, there were only 4 that include the true population proportion. Based on the confidence interval of 10%, I was expecting around 5 of the intervals to have the actual population proportion so that result makes sense.
I will place the confidence interval to be 50%. I’d expect the width’s to be wider than 10% but smaller than 95%. The app shows 21 intervals that capture the true population proportion.
samp %>%
specify(response = climate_change_affects, success = "Yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = .5)
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.517 0.617
The widths of the intervals increase as the confidence interval increases and decreases if the C.I. decreases.
Playing around with the bootstrap samples, doing multiple trials I noticed larger deviations in number of expected intervals that held the true proportion. For example, in testing only 50 bootstrap samples, I receieved in one trial 33/50 containing the proportion and in another, 14/50 with a 50% confidence interval.