R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Load Packages:

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.1.2

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.2     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1

## Warning: package 'ggplot2' was built under R version 4.1.2

## Warning: package 'stringr' was built under R version 4.1.2

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(openintro)

## Warning: package 'openintro' was built under R version 4.1.2

## Loading required package: airports

## Warning: package 'airports' was built under R version 4.1.2

## Loading required package: cherryblossom

## Warning: package 'cherryblossom' was built under R version 4.1.2

## Loading required package: usdata

## Warning: package 'usdata' was built under R version 4.1.2

library(infer)

## Warning: package 'infer' was built under R version 4.1.2

we will assume a total population size of 100,000 (even though that’s smaller than the population size of all US adults).

In this lab, you will assume this 62% is a true population proportion and learn about how sample proportions can vary from

of 100,000. This means 62,000 (62%) of the adult population think climate change impacts their community,

and the remaining 38,000 does not think so.

us_adults <- tibble(
  climate_change_affects = c(rep("Yes", 62000), rep("No", 38000))
)

us_adults

## # A tibble: 100,000 x 1
##    climate_change_affects
##    <chr>                 
##  1 Yes                   
##  2 Yes                   
##  3 Yes                   
##  4 Yes                   
##  5 Yes                   
##  6 Yes                   
##  7 Yes                   
##  8 Yes                   
##  9 Yes                   
## 10 Yes                   
## # ... with 99,990 more rows

We can quickly visualize the distribution of these responses using a bar plot.

ggplot(us_adults, aes(x = climate_change_affects)) +
  geom_bar(fill = 'blue') +
  labs(
    x = "", y = "",
    title = "Do you think climate change is affecting your local community?"
  ) +
  coord_flip()

We can also obtain summary statistics to confirm we constructed the data frame correctly.

us_adults %>%
  count(climate_change_affects) %>%
  mutate(p = n /sum(n))

## # A tibble: 2 x 3
##   climate_change_affects     n     p
##   <chr>                  <int> <dbl>
## 1 No                     38000  0.38
## 2 Yes                    62000  0.62

a simple random sample of size 60 from the population.

n <- 60
samp <- us_adults %>%
  sample_n(size = n)

samp

## # A tibble: 60 x 1
##    climate_change_affects
##    <chr>                 
##  1 Yes                   
##  2 Yes                   
##  3 No                    
##  4 Yes                   
##  5 No                    
##  6 Yes                   
##  7 Yes                   
##  8 No                    
##  9 No                    
## 10 Yes                   
## # ... with 50 more rows

Exercise 1:

What percent of the adults in your sample think climate change affects their local community?

# 62% of the adults in the sample think that the climate change affects their local community

Exercise 2:

Would you expect another student’s sample proportion to be identical to yours? Would you expect it to be similar?

Why or why not?

# No we would not expect the samples proportion to be identical but they can be similar.

# With just one sample, the best estimate of the proportion of US adults who think climate change affects their 
# local community would be the sample proportion, usually denoted as p^ (here we are calling it p_hat). 
# That serves as a good point estimate, but it would be useful to also communicate how uncertain you are of 
# that estimate. This uncertainty can be quantified using a confidence interval.

# One way of calculating a confidence interval for a population proportion is based on the Central Limit Theorem
# Another way is using simulation, or to be more specific, using bootstrapping

Exercise 3:

In the interpretation above, we used the phrase “95% confident”. What does “95% confidence” mean?

This code will find the 95 percent confidence interval for proportion of US adults who think

climate change affects their local community.

samp %>%
  specify(response = climate_change_affects, success = "Yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.95)

## # A tibble: 1 x 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1      0.6    0.817

Even though we don’t know what the full population looks like, we’re 95% confident that the

true proportion of US adults who think climate change affects their local community is between

the two bounds reported as result of this pipeline.

Confidence levels:

Exercise 4:

Does your confidence interval capture the true population proportion of US adults who think

climate change affects their local community? If you are working on this lab in a classroom,

does your neighbor’s interval capture this value?

# Will 62% of the adults in the sample represent the actual thinking of the total population that the climate change affects their local community? This is where the confidence interval come into play. 

# If we repeatedly take the sampling then there is 95% confidence interval that the 62% of the adults in the sample represent the actual thinking of the total population lies between the lower limit of 0.333 and upper limit of 0.583 confidence interval. 

# The interval that is shown 33.3% through 58.3% has 45.8% in the exact center. Adding 12.5% to 45.8% gives us the upper limit of the interval, which is 58.3%. and subtracting 12.5% to 45.8% gives the lower limit of the interval.

# This +/-12.5% difference for achieving the upper and lower limit for 95% confidence of interval is called 'margin of error'.

Exercise 5:

Each student should have gotten a slightly different confidence interval. What proportion of those

intervals would you expect to capture the true population mean? Why?

# The +/-Margin of Error to the mean gives the upper and lower limit for the confidence of interval. And there is 95% probability that the actual population proportion falls in the upper and lower limit of the confidence interval.

Exercise 6:

#Given a sample size of 60, 1000 bootstrap samples for each interval, and 50 confidence intervals constructed # (the default values for the above app), what proportion of your confidence intervals include the # true population proportion? Is this proportion exactly equal to the confidence level? # If not, explain why. Make sure to include your plot in your answer.

samp %>%
  specify(response = climate_change_affects, success = "Yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.50)

## # A tibble: 1 x 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    0.683     0.75

# this is interpreted as saying that 50% confidence that the true population proportion falls 
# between 41.6% and 50% confidence interval. For 50% confidence, the confidence interval get narrowed.
# 45.8% is the center of the interval. Adding 4.2% margin of error to the center will give upper and lower limit of 
# confidence interval.

Exercise 7:

Choose a different confidence level than 95%. Would you expect a confidence interval at this level to me

wider or narrower than the confidence interval you calculated at the 95% confidence level? Explain your reasoning.

# The interval gets narrower because the confidence level is lower.

Exercise 8:

Using code from the infer package and data fromt the one sample you have (samp), find a confidence interval

for the proportion of US Adults who think climate change is affecting their local community with a confidence

level of your choosing (other than 95%) and interpret it.

samp %>%
  specify(response = climate_change_affects, success = "Yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.75)

## # A tibble: 1 x 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1     0.65    0.783

# 75% confidence level and the boundaries were very narrow. The distribution became very narrow 
# because of lower confidence level, that the sample is not be truth.

Exercise 9:

Using the app, calculate 50 confidence intervals at the confidence level you chose in the previous question,

and plot all intervals on one plot, and calculate the proportion of intervals that include the true

population proportion. How does this percentage compare to the confidence level selected for the intervals?

# It is still narrow, as less than 95% confidence level the distribution becomes narrower, but it looks very similar 
# to the sample size 50 or much higher.

Exercise 10:

Lastly, try one more (different) confidence level. First, state how you expect the width of this interval to

compare to previous ones you calculated. Then, calculate the bounds of the interval using the infer package

and data from samp and interpret it. Finally, use the app to generate many intervals and calculate the

proportion of intervals that are capture the true population proportion.

samp %>%
  specify(response = climate_change_affects, success = "Yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.35)

## # A tibble: 1 x 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    0.683     0.75

# It is very narrow, as confidence level is at 35%. It is visible the uncertainty of 
# data at such a low confidence level.

Exercise 11:

Using the app, experiment with different sample sizes and comment on how the widths of intervals

change as sample size changes (increases and decreases).

# The higher the confidence level, the wider the intervals become. As the confidence interval gets smaller, the intervals get wider.

Exercise 12:

Finally, given a sample size (say, 60), how does the width of the interval change as you increase

the number of bootstrap samples. Hint: Does changing the number of bootstap samples affect the standard error?

# In sample size 60 the higher bootrap sample the broader and better distribution of the true population proportion.

Foundations for statistical inference - Confidence intervals