In this lab, we will explore categorical data using tidyverse for data manipulation and visualization, and infer for statistical inference. The dataset comes from the Youth Risk Behavior Surveillance System (YRBSS), which surveys high school students to uncover health-related behaviors.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(infer)
## Warning: package 'infer' was built under R version 4.4.3
library(dplyr)
library(ggplot2)
library(tsibble)
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
##
## Attaching package: 'tsibble'
##
## The following object is masked from 'package:openintro':
##
## tourism
##
## The following object is masked from 'package:lubridate':
##
## interval
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
data('yrbss', package = 'openintro')
no_helmet <- yrbss %>%
filter(helmet_12m == "never")
It is beneficial to understand the distribution of Texting while Driving using a table displaying a count of responses from each category. Display statistics of students reported texting while driving.
table(yrbss$text_while_driving_30d)
##
## 0 1-2 10-19 20-29 3-5
## 4792 925 373 298 493
## 30 6-9 did not drive
## 827 311 4646
prop_table <- table(no_helmet$text_while_driving_30d)
prop_table["30"] / sum(prop_table)
## 30
## 0.07119791
To perform this analysis, we filter the dataset to contain students who never wear helmets and generate a new indicator variable (text_ind) for those who text daily.
no_helmet <- no_helmet %>%
mutate(text_ind = ifelse(text_while_driving_30d == "30", "yes", "no"))
Estimating the population proportion of non-helmet wearers who text daily while driving. We will construct a 95% confidence interval using Bootstrap resampling.
Select only non-helmet wearers and remove rows with missing texting data
no_helmet <- yrbss %>%
filter(helmet_12m == "never") %>%
drop_na(text_while_driving_30d)
no_helmet <- no_helmet %>%
mutate(text_ind = ifelse(text_while_driving_30d == "30", "yes", "no"))
ci <- no_helmet %>%
specify(response = text_ind, success = "yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.95)
The margin of error (ME) appears to be half the width of the confidence interval.
me <- (ci$upper_ci - ci$lower_ci) / 2
print(me)
## [1] 0.00638167
The given standard error formula is: SE = sqrt(p * (1 - p) / n) The margin of error for a 95% confidence interval is: ME = 1.96 * SE = 1.96 * sqrt(p * (1 - p) / n)
n <- 1000
p <- seq(from = 0, to = 1, by = 0.01)
me <- 2 * sqrt(p * (1 - p) / n)
dd <- data.frame(p = p, me = me)
ggplot(data = dd, aes(x = p, y = me)) +
geom_line() +
labs(x = "Population Proportion", y = "Margin of Error")
Identifying the relationship between ME and p: ME is maximized when p = 0.5, which demonstrates a lesser precise estimate when uncertainty is highest. As p moves towards 0 or 1, variability decreases, leading to a lower ME.
The sampling distribution should be approximately normal for inference on proportions. # np ≥ 10 and n(1 - p) ≥ 10
Simulating different values of p while keeping n = 300 to examine the impact on distribution shape
n <- 300
p_values <- c(0.1, 0.3, 0.5, 0.7, 0.9)
sampling_distributions <- data.frame()
for (p in p_values) {
sample_props <- rbinom(10000, size = n, prob = p) / n
temp_df <- data.frame(p_hat = sample_props, p = as.factor(p))
sampling_distributions <- rbind(sampling_distributions, temp_df)
}
ggplot(sampling_distributions, aes(x = p_hat, fill = p)) +
geom_density(alpha = 0.5) +
labs(title = "Effect of Changing p on the Sampling Distribution",
x = "Sample Proportion (p-hat)",
y = "Density") +
theme_minimal()
# Updated code with boxplot
ggplot(sampling_distributions, aes(x = as.factor(p), y = p_hat, fill = as.factor(p))) +
geom_boxplot() +
labs(title = "Effect of Changing p on the Sampling Distribution",
x = "Population Proportion (p)",
y = "Sample Proportion (p-hat)") +
theme_minimal() +
scale_fill_discrete(name = "Population Proportion")
When p = 0.5, the distribution is widest and most symmetric. As p moves closer to 0 or 1, the distribution becomes narrower and skewed. Increasing n makes the distribution more concentrated around p.
Comparing Two Proportions Using Hypothesis Testing Testing if students who sleep 10+ hours per day are more likely to strength train daily.
Null Hypothesis (H0): There is no relationship between sleep duration and strength training. Alternative Hypothesis (H1): Students who sleep 10+ hours per day are more likely to strength train daily.
Conduct hypothesis test In this step will specify response and explanatory variable
Check the structure of the dataset
str(yrbss)
## tibble [13,583 × 13] (S3: tbl_df/tbl/data.frame)
## $ age : int [1:13583] 14 14 15 15 15 15 15 14 15 15 ...
## $ gender : chr [1:13583] "female" "female" "female" "female" ...
## $ grade : chr [1:13583] "9" "9" "9" "9" ...
## $ hispanic : chr [1:13583] "not" "not" "hispanic" "not" ...
## $ race : chr [1:13583] "Black or African American" "Black or African American" "Native Hawaiian or Other Pacific Islander" "Black or African American" ...
## $ height : num [1:13583] NA NA 1.73 1.6 1.5 1.57 1.65 1.88 1.75 1.37 ...
## $ weight : num [1:13583] NA NA 84.4 55.8 46.7 ...
## $ helmet_12m : chr [1:13583] "never" "never" "never" "never" ...
## $ text_while_driving_30d : chr [1:13583] "0" NA "30" "0" ...
## $ physically_active_7d : int [1:13583] 4 2 7 0 2 1 4 4 5 0 ...
## $ hours_tv_per_school_day : chr [1:13583] "5+" "5+" "5+" "2" ...
## $ strength_training_7d : int [1:13583] 0 0 0 0 1 0 2 0 3 0 ...
## $ school_night_hours_sleep: chr [1:13583] "8" "6" "<5" "6" ...
strength_ind
if it does
not already existstrength_training_7d
# Load data (if not already loaded)
data("yrbss")
# Clean the data by filtering rows for sleep category "10+ hours" and removing missing values
sleep_10plus_clean <- yrbss %>%
filter(school_night_hours_sleep >= 10) %>%
filter(!is.na(strength_training_7d) & strength_training_7d != "")
# Create a line plot of strength training frequency for 10+ hours of sleep
ggplot(sleep_10plus_clean, aes(x = strength_training_7d)) +
geom_bar(stat = "count", fill = "lightblue") + # Count of occurrences per strength training frequency
labs(
title = "Line Plot of Strength Training Frequency for 10+ Hours Sleep",
x = "Strength Training Frequency (Days in 7 Days)",
y = "Count of Students"
) +
theme_minimal() +
geom_line(stat = "count", aes(group = 1), color = "red", size = 1)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
me_target <- 0.01
# Using the formula for margin of error
n_required <- (1.96^2 * 0.25) / (me_target^2)
n_required
## [1] 9604
If the p-value is small (e.g., < 0.05), we have evidence that strength training and sleep duration are related. If the p-value is large, we fail to reject the null hypothesis, indicating no meaningful association.