Exercise 1

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.0      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(infer)

yrbss %>% count(text_while_driving_30d)
## # A tibble: 9 × 2
##   text_while_driving_30d     n
##   <chr>                  <int>
## 1 0                       4792
## 2 1-2                      925
## 3 10-19                    373
## 4 20-29                    298
## 5 3-5                      493
## 6 30                       827
## 7 6-9                      311
## 8 did not drive           4646
## 9 <NA>                     918

Exercise 2

no_helmet <- yrbss %>% filter(helmet_12m == "never")

no_helmet <- no_helmet %>% mutate(text_ind = ifelse(text_while_driving_30d == "30", "yes", "no"))

no_helmet %>% count(text_ind)
## # A tibble: 3 × 2
##   text_ind     n
##   <chr>    <int>
## 1 no        6040
## 2 yes        463
## 3 <NA>       474

There were 463 people who have texted while driving everyday in the past 30 days and also never wear helmets.

Exercise 3

no_helmet$text_ind <- replace_na(no_helmet$text_ind, "no")

no_helmet %>%
  specify(response = text_ind, success = "yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.95)
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1   0.0606   0.0725
Margin_of_Error <- .5*(.07209-.06077)
Margin_of_Error
## [1] 0.00566

The margin of error is .005 based on that confidence interval.

Exercise 4

females <- yrbss %>% filter(gender == "female")
females <- females %>% mutate(tall = ifelse(height > 1.6, "yes", "no"))
females$tall <- replace_na(females$tall, "no")


females %>%
  specify(response = tall, success = "yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.95)
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    0.481    0.505
males <- yrbss %>% filter(gender == "male")
males <- males %>% mutate(tall = ifelse(height > 1.6, "yes", "no"))
males$tall <- replace_na(males$tall, "no")


males %>%
  specify(response = tall, success = "yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.95)
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    0.872    0.888

I chose 1.6m as a benchmark for being tall and chose the population of men and women as my test. The interval gives a 95% certainty that 48%-50% of women and 87%-88% of men are considered tall within our sample population. The margins of error for women and men are .015 and .023 respectively.

Exercise 5

n <- 1000

p <- seq(from = 0, to = 1, by = 0.01)
me <- 2 * sqrt(p * (1 - p)/n)

dd <- data.frame(p = p, me = me)
ggplot(data = dd, aes(x = p, y = me)) + 
  geom_line() +
  labs(x = "Population Proportion", y = "Margin of Error")

As population increases, the margin of error increases and hits a peak at 50% of the population proportion then goes back towards 0 as the population nears 100%.

Exercise 6

The sampling distribution seems normal with the center being at 0.1 and spread from 0.02 to 0.15

Exercise 7

The distribution seems to be taller as we get tend towards the opposite ends of p. When p_hat is .5, the distribution seems to be more spread out.

Exercise 8

As n increases, the shape of the distribution gets taller. When there is sample size of 10000, the distribution seems to not deviate from the mean at all.

Exercise 9

sleep <- yrbss %>% filter(school_night_hours_sleep == "10+")
sleep <- sleep %>% mutate(strengthTrained7 = ifelse(strength_training_7d == '7', "yes", "no"))
sleep$strengthTrained7 <- replace_na(sleep$strengthTrained7, "no")

sleep %>%
  specify(response = strengthTrained7, success = "yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.95)
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    0.218    0.317

Our null hypothesis is if there is no correlation between sleeping 10+ hours a day and training for 7 days a week or otherwise there is a correlatoin between sleeping and strength training. Based on the test we are 95% confident that 21% to 31% of those that sleep 10 hours a day strength train. I would say that there is close to no correlation between the two.

Exercise 10

A type 1 error is one where we may get something outside of the actual population proportion that in our sample. Since we are 95% confident of our interval, the probability that we may get something not expected would be 5%.

Exercise 11

Since the margin of error is dependent on our p and n and we don’t know what p is, we can assume that n would have to be relatively large in order for us to meet the guidelines. Using the equation for margin of error as \(ME=z\cdot \sqrt{\frac{p(1-p)}{n}}\). Solving for n and setting p to .5 to be our worst case scenario and z to 1.96, we get 9604 as