In this lab, we will explore and visualize the data using the tidyverse suite of packages, and perform statistical inference using infer. The data can be found in the companion package for OpenIntro resources, openintro.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(infer)
What are the counts within each category for the amount of days these students have texted while driving within the past 30 days?
data(yrbss)
glimpse(yrbss)
## Rows: 13,583
## Columns: 13
## $ age <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15, 1…
## $ gender <chr> "female", "female", "female", "female", "fema…
## $ grade <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9", …
## $ hispanic <chr> "not", "not", "hispanic", "not", "not", "not"…
## $ race <chr> "Black or African American", "Black or Africa…
## $ height <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88, 1…
## $ weight <dbl> NA, NA, 84.37, 55.79, 46.72, 67.13, 131.54, 7…
## $ helmet_12m <chr> "never", "never", "never", "never", "did not …
## $ text_while_driving_30d <chr> "0", NA, "30", "0", "did not drive", "did not…
## $ physically_active_7d <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, 7, …
## $ hours_tv_per_school_day <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5+",…
## $ strength_training_7d <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, 7, …
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "<5"…
yrbss %>%
count(text_while_driving_30d) %>%
arrange(desc(n))
## # A tibble: 9 × 2
## text_while_driving_30d n
## <chr> <int>
## 1 0 4792
## 2 did not drive 4646
## 3 1-2 925
## 4 <NA> 918
## 5 30 827
## 6 3-5 493
## 7 10-19 373
## 8 6-9 311
## 9 20-29 298
As we can see in the output, the count of high school students that claimed to never text while driving (0) in the past 30 days is 4792, 925 claimed to have done it between 1 and times, 493 did it between 3 and 5 times, 311 between 6 and 9 times, 373 between 10 and 19 times, 298 between 20 and 29 times, 827 claimed to have done it every day in the last 30 days, 4646 didn’t drive in the past month, and 918 didn’t answer or didn’t know. This shows that the great majority claimed to never text while driving or they just didn’t drive in that period. However, there are still high students who text while drive (827 every single day!).
What is the proportion of people who have texted while driving every day in the past 30 days and never wear helmets?
no_helmet <- yrbss %>%
filter(helmet_12m == "never")
no_helmet <- no_helmet %>%
mutate(text_ind = ifelse(text_while_driving_30d == "30", "yes", "no"))
no_helmet %>%
count(text_ind) %>%
mutate(proportion = 100 * (n / sum(n)))
## # A tibble: 3 × 3
## text_ind n proportion
## <chr> <int> <dbl>
## 1 no 6040 86.6
## 2 yes 463 6.64
## 3 <NA> 474 6.79
By looking at the output table, we can see that 86.6% who never wear helmets claim to not have texted, while 6.64 % did not wear helmet and texted while driving.
What is the margin of error for the estimate of the proportion of non-helmet wearers that have texted while driving each day for the past 30 days based on this survey?
no_helmet %>%
filter(text_ind %in% c("yes", "no")) %>%
specify(response = text_ind, success = "yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.95) %>%
mutate(ME = (upper_ci - lower_ci) / 2)
## # A tibble: 1 × 3
## lower_ci upper_ci ME
## <dbl> <dbl> <dbl>
## 1 0.0649 0.0770 0.00608
The margin of error, as we can see in the output, is 0.00654 or 0.654 percentage points (comparing both percentages).
Using the infer package, calculate confidence intervals for two other categorical variables (you’ll need to decide which level to call “success”, and report the associated margins of error. Interpet the interval in context of the data. It may be helpful to create new data sets for each of the two countries first, and then use these data sets to construct the confidence intervals.
unique(yrbss$hours_tv_per_school_day)
## [1] "5+" "2" "3" "do not watch" "<1"
## [6] "4" "1" NA
unique(yrbss$school_night_hours_sleep)
## [1] "8" "6" "<5" "9" "10+" "7" "5" NA
# success is watching less than 2h of tv
tv_hours_df <- yrbss %>%
filter(!is.na(hours_tv_per_school_day)) %>%
mutate(success = if_else(hours_tv_per_school_day %in% c("do not watch","<1","1","2"), "yes", "no")) %>%
mutate(success = factor(success, levels = c("no","yes")))
table(tv_hours_df$success)
##
## no yes
## 4782 8463
prop.table(table(tv_hours_df$success))
##
## no yes
## 0.3610419 0.6389581
# Manual bootstrap for TV
set.seed(1234)
n_reps <- 1000
bootstrap_props <- replicate(n_reps, {
sample_data <- sample(tv_hours_df$success, size = nrow(tv_hours_df), replace = TRUE)
mean(sample_data == "yes")
})
ci_tv_low <- data.frame(
lower_ci = quantile(bootstrap_props, 0.025),
upper_ci = quantile(bootstrap_props, 0.975)
) %>%
mutate(ME = (upper_ci - lower_ci) / 2)
p_hat_tv <- mean(tv_hours_df$success == "yes")
ci_tv_low
## lower_ci upper_ci ME
## 2.5% 0.6304247 0.6474896 0.008532465
p_hat_tv
## [1] 0.6389581
# success is sleeping 8h or more
sleep_hours_df <- yrbss %>%
filter(!is.na(school_night_hours_sleep)) %>%
mutate(success = if_else(school_night_hours_sleep %in% c("8","9","10+"), "yes", "no")) %>%
mutate(success = factor(success, levels = c("no","yes")))
table(sleep_hours_df$success)
##
## no yes
## 8564 3771
prop.table(table(sleep_hours_df$success))
##
## no yes
## 0.6942846 0.3057154
# Manual bootstrap for sleep
set.seed(1234)
bootstrap_props_sleep <- replicate(n_reps, {
sample_data <- sample(sleep_hours_df$success, size = nrow(sleep_hours_df), replace = TRUE)
mean(sample_data == "yes")
})
ci_sleep_enough <- data.frame(
lower_ci = quantile(bootstrap_props_sleep, 0.025),
upper_ci = quantile(bootstrap_props_sleep, 0.975)
) %>%
mutate(ME = (upper_ci - lower_ci) / 2)
p_hat_sleep <- mean(sleep_hours_df$success == "yes")
ci_sleep_enough
## lower_ci upper_ci ME
## 2.5% 0.2975253 0.3135813 0.008027969
p_hat_sleep
## [1] 0.3057154
# Check proportions in each dataset
table(tv_hours_df$success)
##
## no yes
## 4782 8463
table(sleep_hours_df$success)
##
## no yes
## 8564 3771
In the selected sample, the proportion of high school students who watch 2 or less hours of tv per day is 0.6390 or 63.90%. We can also say that we are 95% confident that around 63 to 64.75% (with a margin of error of 0.85 percentage points)of students watch 2 or less hours of tv per day, which suggests that most students don’t watch more than 2h o tv per day. In the selected sample, the proportion of students who sleep 8 or more hours per day is 0.3057 or 30.57%. We are also 95% confident that around 29.75% and 31.36% (with a margin of error of 0.8 percentage points)of students sleep at least 8 hours of sleep per day.
Describe the relationship between p and me. Include the margin of error vs. population proportion plot you constructed in your answer. For a given sample size, for which value of p is margin of error maximized?
n <- 1000
# variable p and margin of error
p <- seq(from = 0, to = 1, by = 0.01)
me <- 2 * sqrt(p * (1 - p)/n)
# define data frame and plot
dd <- data.frame(p = p, me = me)
ggplot(data = dd, aes(x = p, y = me)) +
geom_line() +
labs(x = "Population Proportion", y = "Margin of Error")
By analyzing the plot, we can see that the relationship between p and me has an parabolic (curve) shape and is symmetric around p = 0.5. The margin of error is maximized at p = 0.5, decreasing as it gets closer to 0 and 1.
Describe the sampling distribution of sample proportions at n=300 and p=0.1. Be sure to note the center, spread, and shape.
n <- 300
p <- 0.10
reps <- 1000
simulation <- tibble(p_hat = rbinom(reps, size = n, prob = p) / n)
summary <- simulation %>%
summarise(
mean_p_hat = mean(p_hat),
np = n * p,
np_2 = n * (1-p),
sd_p_hat = sd(p_hat),
q25 = quantile(p_hat, 0.025),
q975 = quantile(p_hat, 0.975)
)
summary
## # A tibble: 1 × 6
## mean_p_hat np np_2 sd_p_hat q25 q975
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0993 30 270 0.0174 0.0667 0.137
se <- 2 * sqrt(p * (1 - p)/n)
se
## [1] 0.03464102
ggplot(data = simulation, aes(x = p_hat)) +
geom_histogram(binwidth = 0.005, boundary = 0, closed = "left", color = "white") +
geom_vline(xintercept = p) +
labs(
title = "Sampling distribution for n = 300 and p = 0.10",
x = "Sample Proportion",
y = "Count"
) +
theme_classic()
By looking at the graph, we can see that the sampling distribution of sample proportions has a center equal to 0.0993, which is very close to 0.1 (the true population proportion), a standard error of 0.0346, and approximately normal (bell_shaped) since it passes the success fail test because np = 30 and n(1-p) = 270 (both above 10).
Keep n constant and change p. How does the shape, center, and spread of the sampling distribution vary as p changes. You might want to adjust min and max for the x-axis for a better view of the distribution.
n <- 300
p_values <- c(0.05, 0.1, 0.3, 0.5, 0.7, 0.9)
reps <- 1000
comparison_data <- data.frame()
for (p in p_values) {
sample_props <- replicate(reps, {
sample_data <- rbinom(n, 1, p)
mean(sample_data)
})
temp_data <- data.frame(
p_hat = sample_props,
p_value = paste("p =", p)
)
comparison_data <- rbind(comparison_data, temp_data)
}
ggplot(comparison_data, aes(x = p_hat)) +
geom_histogram(aes(y = after_stat(density)), bins = 30, fill = "blue", alpha = 0.7) +
facet_wrap(~p_value, scales = "free") +
labs(
title = "Sampling Distribution for Different Values of p",
subtitle = paste("Sample size n =", n, "held constant"),
x = "Sample Proportion",
y = "Density"
) +
theme_classic()
# Calculate stats for each p
stats_summary <- comparison_data %>%
group_by(p_value) %>%
reframe(
mean = mean(p_hat),
sd = sd(p_hat),
theoretical_se = sqrt(unique(parse_number(p_value)) * (1 - unique(parse_number(p_value))) / n),
np = n * p_hat,
np_2 = n * (1-p_hat)
)
print(stats_summary)
## # A tibble: 6,000 × 6
## p_value mean sd theoretical_se np np_2
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 p = 0.05 0.0500 0.0123 0.0126 13 287
## 2 p = 0.05 0.0500 0.0123 0.0126 12 288
## 3 p = 0.05 0.0500 0.0123 0.0126 16 284
## 4 p = 0.05 0.0500 0.0123 0.0126 14 286
## 5 p = 0.05 0.0500 0.0123 0.0126 21 279
## 6 p = 0.05 0.0500 0.0123 0.0126 14 286
## 7 p = 0.05 0.0500 0.0123 0.0126 17 283
## 8 p = 0.05 0.0500 0.0123 0.0126 17 283
## 9 p = 0.05 0.0500 0.0123 0.0126 11 289
## 10 p = 0.05 0.0500 0.0123 0.0126 14 286
## # ℹ 5,990 more rows
We can see in the output table of statistics summary that the mean is always equal (or very close) to the population proportion. The standard error, as expected, has its maximum value when p = 0.5 and decreases for p closer to 0 or 1. The distribution is still normal since the it passes both success fail tests for normal distribution for all values of p.
Now also change n. How does n appear to affect the distribution of p^?
p <- 0.1
n_values <- c(50, 100, 300, 500, 1000)
reps <- 1000
# Create comparison plot
comparison_data_n <- data.frame()
for (n in n_values) {
sample_props <- replicate(reps, {
sample_data <- rbinom(n, 1, p)
mean(sample_data)
})
temp_data <- data.frame(
p_hat = sample_props,
n_value = paste("n =", n)
)
comparison_data_n <- rbind(comparison_data_n, temp_data)
}
ggplot(comparison_data_n, aes(x = p_hat)) +
geom_histogram(aes(y = after_stat(density)), bins = 30, fill = "darkgreen", alpha = 0.7) +
facet_wrap(~n_value, scales = "free") +
labs(
title = "Sampling Distribution for Different n",
subtitle = paste("Population proportion p =", p, "held constant"),
x = "Sample Proportion",
y = "Density"
) +
theme_classic()
# Calculate statistics for each n
stats_summary_n <- comparison_data_n %>%
group_by(n_value) %>%
reframe(
mean = mean(p_hat),
sd = sd(p_hat),
theoretical_se = sqrt(p * (1 - p) / unique(parse_number(n_value))),
np = n * p_hat,
np_2 = n * (1-p_hat)
)
print(stats_summary_n)
## # A tibble: 5,000 × 6
## n_value mean sd theoretical_se np np_2
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 n = 100 0.0983 0.0297 0.03 80 920
## 2 n = 100 0.0983 0.0297 0.03 20 980
## 3 n = 100 0.0983 0.0297 0.03 70 930
## 4 n = 100 0.0983 0.0297 0.03 70 930
## 5 n = 100 0.0983 0.0297 0.03 90 910
## 6 n = 100 0.0983 0.0297 0.03 90 910
## 7 n = 100 0.0983 0.0297 0.03 110 890
## 8 n = 100 0.0983 0.0297 0.03 80 920
## 9 n = 100 0.0983 0.0297 0.03 180 820
## 10 n = 100 0.0983 0.0297 0.03 130 870
## # ℹ 4,990 more rows
WE can see that the mean continues to be equal or very similar to the population proportion regardless of the sample size. The standard error decreases when n increases, so a larger sample size will have a more concentrated distribution (less variability) as it is obvious by comparing the plots for all different sample sizes.
Is there convincing evidence that those who sleep 10+ hours per day are more likely to strength train every day of the week? As always, write out the hypotheses for any tests you conduct and outline the status of the conditions for inference. If you find a significant difference, also quantify this difference with a confidence interval.
df <- yrbss %>%
filter(!is.na(school_night_hours_sleep),
!is.na(strength_training_7d)) %>%
mutate(
sleep_grp = if_else(school_night_hours_sleep == "10+", "10+", "other"),
train_daily = if_else(strength_training_7d == "7", "yes", "no")
) %>%
mutate(
sleep_grp = factor(sleep_grp, levels = c("other","10+")),
train_daily = factor(train_daily, levels = c("no","yes"))
) %>%
select(sleep_grp, train_daily)
# sample p
desc <- df %>%
count(sleep_grp, train_daily) %>%
group_by(sleep_grp) %>%
mutate(p_hat = n / sum(n)) %>%
ungroup()
desc
## # A tibble: 4 × 4
## sleep_grp train_daily n p_hat
## <fct> <fct> <int> <dbl>
## 1 other no 9949 0.836
## 2 other yes 1958 0.164
## 3 10+ no 228 0.731
## 4 10+ yes 84 0.269
p_hats <- df %>%
group_by(sleep_grp) %>%
summarize(p_hat = mean(train_daily == "yes"), n = n(), .groups = "drop")
p_hats
## # A tibble: 2 × 3
## sleep_grp p_hat n
## <fct> <dbl> <int>
## 1 other 0.164 11907
## 2 10+ 0.269 312
obs_diff <- df %>%
specify(response = train_daily, explanatory = sleep_grp, success = "yes") %>%
calculate(stat = "diff in props", order = c("10+","other"))
obs_diff
## Response: train_daily (factor)
## Explanatory: sleep_grp (factor)
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 0.105
# hypotheses test
set.seed(1234)
null_dist <- df %>%
specify(response = train_daily, explanatory = sleep_grp, success = "yes") %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in props", order = c("10+","other"))
p_val <- get_p_value(null_dist, obs_stat = obs_diff, direction = "greater")
## Warning: Please be cautious in reporting a p-value of 0. This result is an approximation
## based on the number of `reps` chosen in the `generate()` step.
## ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.
p_val
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0
# bootstrap CI
set.seed(1234)
boot_dist <- df %>%
specify(response = train_daily, explanatory = sleep_grp, success = "yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "diff in props", order = c("10+","other"))
ci_95 <- get_ci(boot_dist, level = 0.95, type = "percentile")
ci_95
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.0579 0.154
The goal was to test if students who sleep 10+ hours per night are more likely to strength train every day when compared with students who sleep less. The hypotheses being: H0: p10+ = p_other vs HA: p10+ > pother. The estimated difference in proportions p10+ and pother is 0.105 and the p value is 0, which means that, under the null hypothesis of no relationship, we wouldnt oberseve a difference. So, we can say that there is enough evidence that students who sleep 10+ hours are more likely to strength train every day. A 95% bootstrap CI for the difference in proportions wont include zero, confirming that it is a meaningful difference in the population.
Let’s say there has been no difference in likeliness to strength train every day of the week for those who sleep 10+ hours. What is the probability that you could detect a change (at a significance level of 0.05) simply by chance? Hint: Review the definition of the Type 1 error.
If there is really no difference between the groups, the prob of detecting a change (rejecting the null hypothesis) simply by chance is exactly the significance level α = 0.05. This is the definition of a Type 1 error - rejecting a true null hypothesis. The significance level is set to control this probability. In this case, there would be about 5% probability that our sample results would appear significant (randomness).
Suppose you’re hired by the local government to estimate the proportion of residents that attend a religious service on a weekly basis. According to the guidelines, the estimate must have a margin of error no greater than 1% with 95% confidence. You have no idea what to expect for p. How many people would you have to sample to ensure that you are within the guidelines? Hint: Refer to your plot of the relationship between p and margin of error. This question does not require using a dataset.
# p = 0.5 since that is max variance
# z score for 95% CI is 1.96
p <- 0.5
z <- 1.96
me <- 0.01
n_needed <- (z^2 * p * (1 - p)) / (me^2)
n_needed
## [1] 9604
I the formula for margin of error is ME = z * sqrt(p(1-p)/n), so we only have to isolate n, giving us n = z^2 * p * (1 - p)) / me^2. I pick p = 0.5 because that is the probability that maximizes the sample size. As we can see by looking at the output, the sample would have to have 9604 or more residents to ensure it is between the guidelines.