library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(infer)
yrbss %>% count(text_while_driving_30d)
## # A tibble: 9 × 2
## text_while_driving_30d n
## <chr> <int>
## 1 0 4792
## 2 1-2 925
## 3 10-19 373
## 4 20-29 298
## 5 3-5 493
## 6 30 827
## 7 6-9 311
## 8 did not drive 4646
## 9 <NA> 918
no_helmet <- yrbss %>% filter(helmet_12m == "never")
no_helmet <- no_helmet %>% mutate(text_ind = ifelse(text_while_driving_30d == "30", "yes", "no"))
no_helmet %>% count(text_ind)
## # A tibble: 3 × 2
## text_ind n
## <chr> <int>
## 1 no 6040
## 2 yes 463
## 3 <NA> 474
There were 463 people who have texted while driving everyday in the past 30 days and also never wear helmets.
no_helmet$text_ind <- replace_na(no_helmet$text_ind, "no")
no_helmet %>%
specify(response = text_ind, success = "yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.95)
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.0606 0.0725
Margin_of_Error <- .5*(.07209-.06077)
Margin_of_Error
## [1] 0.00566
The margin of error is .005 based on that confidence interval.
females <- yrbss %>% filter(gender == "female")
females <- females %>% mutate(tall = ifelse(height > 1.6, "yes", "no"))
females$tall <- replace_na(females$tall, "no")
females %>%
specify(response = tall, success = "yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.95)
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.481 0.505
males <- yrbss %>% filter(gender == "male")
males <- males %>% mutate(tall = ifelse(height > 1.6, "yes", "no"))
males$tall <- replace_na(males$tall, "no")
males %>%
specify(response = tall, success = "yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.95)
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.872 0.888
I chose 1.6m as a benchmark for being tall and chose the population of men and women as my test. The interval gives a 95% certainty that 48%-50% of women and 87%-88% of men are considered tall within our sample population. The margins of error for women and men are .015 and .023 respectively.
n <- 1000
p <- seq(from = 0, to = 1, by = 0.01)
me <- 2 * sqrt(p * (1 - p)/n)
dd <- data.frame(p = p, me = me)
ggplot(data = dd, aes(x = p, y = me)) +
geom_line() +
labs(x = "Population Proportion", y = "Margin of Error")
As population increases, the margin of error increases and hits a peak at 50% of the population proportion then goes back towards 0 as the population nears 100%.
The sampling distribution seems normal with the center being at 0.1 and spread from 0.02 to 0.15
The distribution seems to be taller as we get tend towards the opposite ends of p. When p_hat is .5, the distribution seems to be more spread out.
As n increases, the shape of the distribution gets taller. When there is sample size of 10000, the distribution seems to not deviate from the mean at all.
sleep <- yrbss %>% filter(school_night_hours_sleep == "10+")
sleep <- sleep %>% mutate(strengthTrained7 = ifelse(strength_training_7d == '7', "yes", "no"))
sleep$strengthTrained7 <- replace_na(sleep$strengthTrained7, "no")
sleep %>%
specify(response = strengthTrained7, success = "yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.95)
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.218 0.317
Our null hypothesis is if there is no correlation between sleeping 10+ hours a day and training for 7 days a week or otherwise there is a correlatoin between sleeping and strength training. Based on the test we are 95% confident that 21% to 31% of those that sleep 10 hours a day strength train. I would say that there is close to no correlation between the two.
A type 1 error is one where we may get something outside of the actual population proportion that in our sample. Since we are 95% confident of our interval, the probability that we may get something not expected would be 5%.
Since the margin of error is dependent on our p and n and we don’t know what p is, we can assume that n would have to be relatively large in order for us to meet the guidelines. Using the equation for margin of error as \(ME=z\cdot \sqrt{\frac{p(1-p)}{n}}\). Solving for n and setting p to .5 to be our worst case scenario and z to 1.96, we get 9604 as