library(tidyverse)
library(openintro)
library(infer)
set.seed(74226)I will be analyzing the same dataset as in the previous lab, where you delved into a sample from the Youth Risk Behavior Surveillance System (YRBSS) survey, which uses data from high schoolers to help discover health patterns. The dataset is called yrbss.
Please see the following table displaying the count of the different categories
yrbss %>%
count(text_while_driving_30d, sort=TRUE)## # A tibble: 9 × 2
## text_while_driving_30d n
## <chr> <int>
## 1 0 4792
## 2 did not drive 4646
## 3 1-2 925
## 4 <NA> 918
## 5 30 827
## 6 3-5 493
## 7 10-19 373
## 8 6-9 311
## 9 20-29 298
did not drive: 4646
danger <- yrbss %>%
filter(helmet_12m=="never") %>%
filter(!is.na(text_while_driving_30d)) %>%
mutate(text_ind_everyday = ifelse(text_while_driving_30d == "30", "yes", "no"))
danger %>%
count(text_ind_everyday)## # A tibble: 2 × 2
## text_ind_everyday n
## <chr> <int>
## 1 no 6040
## 2 yes 463
The proportion of students who have texted everyday in the past 30 days & have not worn a helmet out of all students who have not worn a helmet is 464/6504 = 7.13%
danger %>%
specify(response = text_ind_everyday, success = "yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.95)## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.0646 0.0772
The 95% Confidence Interval for the proportion of students who text and drive everyday of all students who do not wear a helmet is between 6.5% and 7.7%.
The margin of error in this case is .006 or .06%
glimpse(yrbss)## Rows: 13,583
## Columns: 13
## $ age <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15, 1…
## $ gender <chr> "female", "female", "female", "female", "fema…
## $ grade <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9", …
## $ hispanic <chr> "not", "not", "hispanic", "not", "not", "not"…
## $ race <chr> "Black or African American", "Black or Africa…
## $ height <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88, 1…
## $ weight <dbl> NA, NA, 84.37, 55.79, 46.72, 67.13, 131.54, 7…
## $ helmet_12m <chr> "never", "never", "never", "never", "did not …
## $ text_while_driving_30d <chr> "0", NA, "30", "0", "did not drive", "did not…
## $ physically_active_7d <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, 7, …
## $ hours_tv_per_school_day <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5+",…
## $ strength_training_7d <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, 7, …
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "<5"…
yrbss %>%
count(hours_tv_per_school_day, sort=TRUE)## # A tibble: 8 × 2
## hours_tv_per_school_day n
## <chr> <int>
## 1 2 2705
## 2 <1 2168
## 3 3 2139
## 4 do not watch 1840
## 5 1 1750
## 6 5+ 1595
## 7 4 1048
## 8 <NA> 338
yrbss %>%
count(school_night_hours_sleep, sort=TRUE)## # A tibble: 8 × 2
## school_night_hours_sleep n
## <chr> <int>
## 1 7 3461
## 2 8 2692
## 3 6 2658
## 4 5 1480
## 5 <NA> 1248
## 6 <5 965
## 7 9 763
## 8 10+ 316
tv_time<- yrbss %>%
filter(!is.na(hours_tv_per_school_day)) %>%
mutate(tv_ind_everyday = ifelse(hours_tv_per_school_day == "<1", "yes", "no"))
tv_time %>%
count(tv_ind_everyday)## # A tibble: 2 × 2
## tv_ind_everyday n
## <chr> <int>
## 1 no 11077
## 2 yes 2168
tv_time %>%
specify(response = tv_ind_everyday, success = "yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.95)## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.157 0.170
Proportion of Interest: students who reported watching less than 1 hr of tv per school day: (2168/(2168+11077))*100%= 16.37%
95% Confidence Interval generated: [.1575 -
.1698]
We are 95% confident that the proportion of students who watch less than
an hr of tv per schoolday is between 15.75% and 16.98%.
Margin of Error: .00615 or .0615%
sleep_time<- yrbss %>%
filter(!is.na(school_night_hours_sleep)) %>%
mutate(sleep_ind_everyday = ifelse(school_night_hours_sleep == "<5", "yes", "no"))
sleep_time %>%
count(sleep_ind_everyday)## # A tibble: 2 × 2
## sleep_ind_everyday n
## <chr> <int>
## 1 no 11370
## 2 yes 965
sleep_time %>%
specify(response = sleep_ind_everyday, success = "yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.95)## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.0738 0.0831
Proportion of Interest: students who reported less than 5hrs of sleep on school nights:(965/(965+11370))x100%= 7.82%
95% Confidence Interval generated: [.07377 -
.0831]
We are 95% confident that the proportion of students who watch less than
an hr of tv per schoolday is between 7.38% and 8.31%.
Margin of Error: .00465 or .0465%
n <- 1000
p <- seq(from = 0, to = 1, by = 0.01)
me <- 2 * sqrt(p * (1 - p)/n)dd <- data.frame(p = p, me = me)
ggplot(data = dd, aes(x = p, y = me)) +
geom_line() +
labs(x = "Population Proportion", y = "Margin of Error")The margin of error increases as the population proportion increases. Margin of error is greatest at the population of 50%. The greatest that the numerator could ever be when calculating standard error by proportion is .5 x .5.
# Insert code for Exercise 6 here
set.seed(12)
n <- 300
p <- 0.1
reps=1000
samples <- replicate(reps, rbinom(1,n,p))
hist(samples,
#ylim = c(0, 1.4),
col = "steelblue",
freq = F,
breaks = 25)
curve(dnorm(x, mean=n*p, sd=sqrt(p*(1-p)*n)),
col = "red",
lwd = "2",
add = T)sample.proportions <- samples / n
hist(sample.proportions,
#ylim = c(0, 1.4),
col = "steelblue",
freq = F,
breaks = 25)
curve(dnorm(x, mean=p, sd=sqrt(p*(1-p)/n)),
col = "red",
lwd = "2",
add = T)The distribution of sampling proportions with sampling size of 300 is mostly bell-curved and symmetrical -
Center is at .10 and standard deviation of the sample proportions is .017. Spread conforms fairly uniformly to the normal bell curve shape but with slightly larger tails.
# Insert code for Exercise 7 here
set.seed(1)
n <- 300
p <- 0.5
samples <- replicate(reps, rbinom(1,n,p))
hist(samples,
#ylim = c(0, 1.4),
col = "steelblue",
freq = F,
breaks = 25)
curve(dnorm(x, mean=n*p, sd=sqrt(p*(1-p)*n)),
col = "red",
lwd = "2",
add = T)sample.proportions <- samples / n
hist(sample.proportions,
#ylim = c(0, 1.4),
col = "steelblue",
freq = F,
breaks = 25)
curve(dnorm(x, mean=p, sd=sqrt(p*(1-p)/n)),
col = "red",
lwd = "2",
add = T)As p increases, spread gets wider. Overall shape more similar to a normal distribution.
# Insert code for Exercise 8 here
set.seed(2)
n <- 1000
p <- 0.1
samples <- replicate(reps, rbinom(1,n,p))
hist(samples,
#ylim = c(0, 1.4),
col = "steelblue",
freq = F,
breaks = 25)
curve(dnorm(x, mean=n*p, sd=sqrt(p*(1-p)*n)),
col = "red",
lwd = "2",
add = T)sample.proportions <- samples / n
hist(sample.proportions,
#ylim = c(0, 1.4),
col = "steelblue",
freq = F,
breaks = 25)
curve(dnorm(x, mean=p, sd=sqrt(p*(1-p)/n)),
col = "red",
lwd = "2",
add = T)As sample size increases,symmetry increases.The center is still 0.1 or the p value. Enough sampling proportions/means reasonably approximating normal distribution shape.
Null Hypothesis: There is no difference in strength training between students that sleep more than 10+ hours and those who don’t.
Alternative: There is a difference in strength training between students that sleep more than 10+ hours and those who don’t.
We are 95% confident that the student proportion of those students that sleep more than 10+ hours are between 0.224 and 0.317.
good_sleep <- yrbss %>%
filter(school_night_hours_sleep == "10+")%>%
filter(!is.na(strength_training_7d)) %>%
mutate(strength = ifelse(strength_training_7d == "7", "yes", "no"))
good_sleep %>%
specify(response = strength, success = "yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.95)## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.224 0.317
There would be a 5% chance of detecting a change. A type 1 error is a false positive. Typically what happens is when researchers incorrectly reject the true null hypothesis.They would report their findings and state that the finds are significant when they’re really insignificant.
ME = 1.96 × SE=1.96 × sqrtp(1−p)/n
n = (0.3)2/(0.01/1.96)2
n= 3457