Lab Inference for Categorical Data

In this lab, we will explore categorical data using tidyverse for data manipulation and visualization, and infer for statistical inference. The dataset comes from the Youth Risk Behavior Surveillance System (YRBSS), which surveys high school students to uncover health-related behaviors.

Load Packages

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(infer)
## Warning: package 'infer' was built under R version 4.4.3
library(dplyr)
library(ggplot2)
library(tsibble)
## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr
## 
## Attaching package: 'tsibble'
## 
## The following object is masked from 'package:openintro':
## 
##     tourism
## 
## The following object is masked from 'package:lubridate':
## 
##     interval
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union

Load dataset

data('yrbss', package = 'openintro')

We will analyze the yrbss dataset. Specifically, we want to examine how frequently students have texted while driving over the past 30 days

no_helmet <- yrbss %>%
  filter(helmet_12m == "never")

It is beneficial to understand the distribution of Texting while Driving using a table displaying a count of responses from each category. Display statistics of students reported texting while driving.

table(yrbss$text_while_driving_30d)
## 
##             0           1-2         10-19         20-29           3-5 
##          4792           925           373           298           493 
##            30           6-9 did not drive 
##           827           311          4646

The next step is to compute the proportion of students in the “text daily while driving” group.

prop_table <- table(no_helmet$text_while_driving_30d)
prop_table["30"] / sum(prop_table)
##         30 
## 0.07119791

Proportion of Non-Helmet Wearers Who Text While Driving Daily

To perform this analysis, we filter the dataset to contain students who never wear helmets and generate a new indicator variable (text_ind) for those who text daily.

Create binary variable for texting while driving every day

no_helmet <- no_helmet %>%
  mutate(text_ind = ifelse(text_while_driving_30d == "30", "yes", "no"))

Inference on Proportions

Estimating the population proportion of non-helmet wearers who text daily while driving. We will construct a 95% confidence interval using Bootstrap resampling.

Select only non-helmet wearers and remove rows with missing texting data

no_helmet <- yrbss %>%
  filter(helmet_12m == "never") %>%  
  drop_na(text_while_driving_30d)  

Create binary variable again for bootstrapping

no_helmet <- no_helmet %>%
  mutate(text_ind = ifelse(text_while_driving_30d == "30", "yes", "no"))

Perform bootstrap inference for proportion

ci <- no_helmet %>%
  specify(response = text_ind, success = "yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.95)

Margin of Error Calculation

The margin of error (ME) appears to be half the width of the confidence interval.

me <- (ci$upper_ci - ci$lower_ci) / 2
print(me)
## [1] 0.00638167

Understanding the influence of sample size and population proportion on the margin of error.

The given standard error formula is: SE = sqrt(p * (1 - p) / n) The margin of error for a 95% confidence interval is: ME = 1.96 * SE = 1.96 * sqrt(p * (1 - p) / n)

Define sample size

n <- 1000

Create a sequence of population proportions

p <- seq(from = 0, to = 1, by = 0.01)

Calculate margin of error for each proportion

me <- 2 * sqrt(p * (1 - p) / n)

Visualizing the relationship using ggplot

dd <- data.frame(p = p, me = me)
ggplot(data = dd, aes(x = p, y = me)) + 
  geom_line() +
  labs(x = "Population Proportion", y = "Margin of Error") 

Observation

Identifying the relationship between ME and p: ME is maximized when p = 0.5, which demonstrates a lesser precise estimate when uncertainty is highest. As p moves towards 0 or 1, variability decreases, leading to a lower ME.

Success-Failure Condition:

The sampling distribution should be approximately normal for inference on proportions. # np ≥ 10 and n(1 - p) ≥ 10

Exploring the Sampling Distribution

Simulating different values of p while keeping n = 300 to examine the impact on distribution shape

n <- 300
p_values <- c(0.1, 0.3, 0.5, 0.7, 0.9)
sampling_distributions <- data.frame()
for (p in p_values) {
  sample_props <- rbinom(10000, size = n, prob = p) / n  
  temp_df <- data.frame(p_hat = sample_props, p = as.factor(p))
  sampling_distributions <- rbind(sampling_distributions, temp_df)
}

Visualizing the effect of changing p on the sampling distribution

ggplot(sampling_distributions, aes(x = p_hat, fill = p)) +
  geom_density(alpha = 0.5) +
  labs(title = "Effect of Changing p on the Sampling Distribution",
       x = "Sample Proportion (p-hat)",
       y = "Density") +
  theme_minimal()

# Updated code with boxplot
ggplot(sampling_distributions, aes(x = as.factor(p), y = p_hat, fill = as.factor(p))) +
  geom_boxplot() +
  labs(title = "Effect of Changing p on the Sampling Distribution",
       x = "Population Proportion (p)",
       y = "Sample Proportion (p-hat)") +
  theme_minimal() +
  scale_fill_discrete(name = "Population Proportion")

Findings:

When p = 0.5, the distribution is widest and most symmetric. As p moves closer to 0 or 1, the distribution becomes narrower and skewed. Increasing n makes the distribution more concentrated around p.

Comparing Two Proportions Using Hypothesis Testing Testing if students who sleep 10+ hours per day are more likely to strength train daily.

Define hypotheses:

Null Hypothesis (H0): There is no relationship between sleep duration and strength training. Alternative Hypothesis (H1): Students who sleep 10+ hours per day are more likely to strength train daily.

Prepare dataset

Conduct hypothesis test In this step will specify response and explanatory variable

Check the structure of the dataset

str(yrbss)
## tibble [13,583 × 13] (S3: tbl_df/tbl/data.frame)
##  $ age                     : int [1:13583] 14 14 15 15 15 15 15 14 15 15 ...
##  $ gender                  : chr [1:13583] "female" "female" "female" "female" ...
##  $ grade                   : chr [1:13583] "9" "9" "9" "9" ...
##  $ hispanic                : chr [1:13583] "not" "not" "hispanic" "not" ...
##  $ race                    : chr [1:13583] "Black or African American" "Black or African American" "Native Hawaiian or Other Pacific Islander" "Black or African American" ...
##  $ height                  : num [1:13583] NA NA 1.73 1.6 1.5 1.57 1.65 1.88 1.75 1.37 ...
##  $ weight                  : num [1:13583] NA NA 84.4 55.8 46.7 ...
##  $ helmet_12m              : chr [1:13583] "never" "never" "never" "never" ...
##  $ text_while_driving_30d  : chr [1:13583] "0" NA "30" "0" ...
##  $ physically_active_7d    : int [1:13583] 4 2 7 0 2 1 4 4 5 0 ...
##  $ hours_tv_per_school_day : chr [1:13583] "5+" "5+" "5+" "2" ...
##  $ strength_training_7d    : int [1:13583] 0 0 0 0 1 0 2 0 3 0 ...
##  $ school_night_hours_sleep: chr [1:13583] "8" "6" "<5" "6" ...

Create a categorical variable strength_ind if it does not already exist

Assuming you want to create this based on a condition from strength_training_7d

# Load data (if not already loaded)
data("yrbss")

# Clean the data by filtering rows for sleep category "10+ hours" and removing missing values
sleep_10plus_clean <- yrbss %>%
  filter(school_night_hours_sleep >= 10) %>%
  filter(!is.na(strength_training_7d) & strength_training_7d != "")

# Create a line plot of strength training frequency for 10+ hours of sleep
ggplot(sleep_10plus_clean, aes(x = strength_training_7d)) +
  geom_bar(stat = "count", fill = "lightblue") +  # Count of occurrences per strength training frequency
  labs(
    title = "Line Plot of Strength Training Frequency for 10+ Hours Sleep",
    x = "Strength Training Frequency (Days in 7 Days)",
    y = "Count of Students"
  ) +
  theme_minimal() +
  geom_line(stat = "count", aes(group = 1), color = "red", size = 1)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Desired margin of error

me_target <- 0.01

# Using the formula for margin of error
n_required <- (1.96^2 * 0.25) / (me_target^2)
n_required
## [1] 9604

Conclusion

If the p-value is small (e.g., < 0.05), we have evidence that strength training and sleep duration are related. If the p-value is large, we fail to reject the null hypothesis, indicating no meaningful association.