This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.2
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.2 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.1.2
## Warning: package 'stringr' was built under R version 4.1.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(openintro)
## Warning: package 'openintro' was built under R version 4.1.2
## Loading required package: airports
## Warning: package 'airports' was built under R version 4.1.2
## Loading required package: cherryblossom
## Warning: package 'cherryblossom' was built under R version 4.1.2
## Loading required package: usdata
## Warning: package 'usdata' was built under R version 4.1.2
library(infer)
## Warning: package 'infer' was built under R version 4.1.2
us_adults <- tibble(
climate_change_affects = c(rep("Yes", 62000), rep("No", 38000))
)
us_adults
## # A tibble: 100,000 x 1
## climate_change_affects
## <chr>
## 1 Yes
## 2 Yes
## 3 Yes
## 4 Yes
## 5 Yes
## 6 Yes
## 7 Yes
## 8 Yes
## 9 Yes
## 10 Yes
## # ... with 99,990 more rows
ggplot(us_adults, aes(x = climate_change_affects)) +
geom_bar(fill = 'blue') +
labs(
x = "", y = "",
title = "Do you think climate change is affecting your local community?"
) +
coord_flip()
us_adults %>%
count(climate_change_affects) %>%
mutate(p = n /sum(n))
## # A tibble: 2 x 3
## climate_change_affects n p
## <chr> <int> <dbl>
## 1 No 38000 0.38
## 2 Yes 62000 0.62
n <- 60
samp <- us_adults %>%
sample_n(size = n)
samp
## # A tibble: 60 x 1
## climate_change_affects
## <chr>
## 1 Yes
## 2 Yes
## 3 No
## 4 Yes
## 5 No
## 6 Yes
## 7 Yes
## 8 No
## 9 No
## 10 Yes
## # ... with 50 more rows
# 62% of the adults in the sample think that the climate change affects their local community
Why or why not?
# No we would not expect the samples proportion to be identical but they can be similar.
# With just one sample, the best estimate of the proportion of US adults who think climate change affects their
# local community would be the sample proportion, usually denoted as p^ (here we are calling it p_hat).
# That serves as a good point estimate, but it would be useful to also communicate how uncertain you are of
# that estimate. This uncertainty can be quantified using a confidence interval.
# One way of calculating a confidence interval for a population proportion is based on the Central Limit Theorem
# Another way is using simulation, or to be more specific, using bootstrapping
samp %>%
specify(response = climate_change_affects, success = "Yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.95)
## # A tibble: 1 x 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.6 0.817
# Will 62% of the adults in the sample represent the actual thinking of the total population that the climate change affects their local community? This is where the confidence interval come into play.
# If we repeatedly take the sampling then there is 95% confidence interval that the 62% of the adults in the sample represent the actual thinking of the total population lies between the lower limit of 0.333 and upper limit of 0.583 confidence interval.
# The interval that is shown 33.3% through 58.3% has 45.8% in the exact center. Adding 12.5% to 45.8% gives us the upper limit of the interval, which is 58.3%. and subtracting 12.5% to 45.8% gives the lower limit of the interval.
# This +/-12.5% difference for achieving the upper and lower limit for 95% confidence of interval is called 'margin of error'.
# The +/-Margin of Error to the mean gives the upper and lower limit for the confidence of interval. And there is 95% probability that the actual population proportion falls in the upper and lower limit of the confidence interval.
#Given a sample size of 60, 1000 bootstrap samples for each interval, and 50 confidence intervals constructed # (the default values for the above app), what proportion of your confidence intervals include the # true population proportion? Is this proportion exactly equal to the confidence level? # If not, explain why. Make sure to include your plot in your answer.
samp %>%
specify(response = climate_change_affects, success = "Yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.50)
## # A tibble: 1 x 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.683 0.75
# this is interpreted as saying that 50% confidence that the true population proportion falls
# between 41.6% and 50% confidence interval. For 50% confidence, the confidence interval get narrowed.
# 45.8% is the center of the interval. Adding 4.2% margin of error to the center will give upper and lower limit of
# confidence interval.
# The interval gets narrower because the confidence level is lower.
samp %>%
specify(response = climate_change_affects, success = "Yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.75)
## # A tibble: 1 x 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.65 0.783
# 75% confidence level and the boundaries were very narrow. The distribution became very narrow
# because of lower confidence level, that the sample is not be truth.
# It is still narrow, as less than 95% confidence level the distribution becomes narrower, but it looks very similar
# to the sample size 50 or much higher.
samp %>%
specify(response = climate_change_affects, success = "Yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.35)
## # A tibble: 1 x 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.683 0.75
# It is very narrow, as confidence level is at 35%. It is visible the uncertainty of
# data at such a low confidence level.
# The higher the confidence level, the wider the intervals become. As the confidence interval gets smaller, the intervals get wider.
# In sample size 60 the higher bootrap sample the broader and better distribution of the true population proportion.