# From previous steps
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
gss2016 <- gss %>%
filter(year == 2016)
library(ggplot2)
ggplot(gss2016, aes(x = consci)) +
geom_bar()
# Compute proportion of high conf
p_hat <- gss2016 %>%
summarize(prop_high = mean(consci == "High")) %>%
pull()
# From previous step
library(infer)
##
## Attaching package: 'infer'
## The following object is masked _by_ '.GlobalEnv':
##
## gss
boot1 <- gss2016 %>%
specify(response = consci, success = "High") %>%
generate(reps = 1, type = "bootstrap")
# Compute proportion with high conf
boot1 %>%
summarize(prop_high = mean(consci == "High")) %>%
pull()
## [1] 0.4266667
# From previous steps
boot_dist <- gss2016 %>%
specify(response = consci, success = "High") %>%
generate(reps = 500, type = "bootstrap") %>%
calculate(stat = "prop")
ggplot(boot_dist, aes(x = stat)) +
geom_density()
SE <- boot_dist %>%
summarize(se = sd(stat)) %>%
pull()
# Create CI
c(p_hat - 2 * SE, p_hat + 2 * SE)
## [1] 0.3535923 0.5130744
If we had used 5000 instead of 500 bootstrap samples, which of the following would be true?
[Video]
We learned that for a 95% confidence interval (a confidence level of .95), if we were to take many samples of the same size and compute many intervals, we would expect 95% of the resulting intervals to contain the parameter. Based on the set of confidence intervals plotted here, what is your best guess at the confidence level used in these intervals?
As in the video, the population proportion is represented by the p in the cloud and the dotted line and each confidence interval is represented by a segment that extends out from it’s p-hat. Intervals that capture the true value are in green; those that miss it are in red.
# From previous steps
boot_dist_small <- gss2016_small %>%
specify(response = consci, success = "High") %>%
generate(reps = 500, type = "bootstrap") %>%
calculate(stat = "prop")
SE_small_n <- boot_dist_small %>%
summarize(se = sd(stat)) %>%
pull()
boot_dist_smaller <- gss2016_smaller %>%
specify(response = consci, success = "High") %>%
generate(reps = 500, type = "bootstrap") %>%
calculate(stat = "prop")
# Compute and save estimate of second SE
SE_smaller_n <- boot_dist_smaller %>%
summarize(se = sd(stat)) %>%
pull()
# Compare the results for each dataset size
message("gss2016_small has ", nrow(gss2016_small), " rows and standard error ", SE_small_n)
## gss2016_small has 50 rows and standard error 0.0714496333104684
message("gss2016_smaller has ", nrow(gss2016_smaller), " rows and standard error ", SE_smaller_n)
## gss2016_smaller has 10 rows and standard error 0.163956101461375
# From previous steps
ggplot(gss2016, aes(x = meta_region)) +
geom_bar()
boot_dist <- gss2016 %>%
specify(response = meta_region, success = "pacific") %>%
generate(reps = 500, type = "bootstrap") %>%
calculate(stat = "prop")
# Calculate std error
SE_low_p <- boot_dist %>%
summarize(se = sd(stat)) %>%
pull()
# Compare SEs
c(SE_low_p, SE)
## [1] 0.02537620 0.03987052
[Video]
# From previous step
n <- nrow(gss2016)
p_hat <- gss2016 %>%
summarize(prop_pacific = mean(meta_region == "pacific")) %>%
pull()
# Check conditions
n * p_hat >= 10
## [1] TRUE
n * (1 - p_hat) >= 10
## [1] TRUE
# Calculate SE
SE_approx <- sqrt(p_hat * (1 - p_hat) / n)
# Form 95% CI
c(p_hat - 2 * SE_approx, p_hat + 2 * SE_approx)
## [1] 0.05101021 0.14898979
Now you have seen two ways to construct a confidence interval for a proportion. Which of the following about the two methods is true?
[Video]
Michael is a hybrid thinker and doer—a byproduct of being a StrengthsFinder “Learner” over time. With 20+ years of engineering, design, and product experience, he helps organizations identify market needs, mobilize internal and external resources, and deliver delightful digital customer experiences that align with business goals. He has been entrusted with problem-solving for brands—ranging from Fortune 500 companies to early-stage startups to not-for-profit organizations.
Michael earned his BS in Computer Science from New York Institute of Technology and his MBA from the University of Maryland, College Park. He is also a candidate to receive his MS in Applied Analytics from Columbia University.
LinkedIn | Twitter | www.michaelmallari.com/data | www.columbia.edu/~mm5470