In this lab, we will explore and visualize the data using the tidyverse suite of packages, and perform statistical inference using infer. The data can be found in the companion package for OpenIntro resources, openintro.
Let’s load the packages.
Every two years, the Centers for Disease Control and Prevention conduct the Youth Risk Behavior Surveillance System (YRBSS) survey, where it takes data from high schoolers (9th through 12th grade), to analyze health patterns. You will work with a selected group of variables from a random sample of observations during one of the years the YRBSS was conducted.
Load the yrbss data set into your workspace.
There are observations on 13 different variables, some categorical and some numerical. The meaning of each variable can be found by bringing up the help file:
## [1] 13583 13
## [1] 13583
## [1] 13
The cases in this data set are individual high school students who participated in the YRBSS survey. Each row represents one student’s responses to the survey questions. There are 13,583 students in this sample.
Remember that you can answer this question by viewing the data in the data viewer or by using the following command:
## Rows: 13,583
## Columns: 13
## $ age <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15, 1…
## $ gender <chr> "female", "female", "female", "female", "fema…
## $ grade <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9", …
## $ hispanic <chr> "not", "not", "hispanic", "not", "not", "not"…
## $ race <chr> "Black or African American", "Black or Africa…
## $ height <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88, 1…
## $ weight <dbl> NA, NA, 84.37, 55.79, 46.72, 67.13, 131.54, 7…
## $ helmet_12m <chr> "never", "never", "never", "never", "did not …
## $ text_while_driving_30d <chr> "0", NA, "30", "0", "did not drive", "did not…
## $ physically_active_7d <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, 7, …
## $ hours_tv_per_school_day <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5+",…
## $ strength_training_7d <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, 7, …
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "<5"…
You will first start with analyzing the weight of the participants in
kilograms: weight.
Using visualization and summary statistics, describe the distribution
of weights. The summary function can be useful.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 29.94 56.25 64.41 67.91 76.20 180.99 1004
## [1] 1004
There are 1004 students who didn’t have their weight recorded. Because of that, they’ll be left out of any weight-related analysis since we don’t have their data.
Next, consider the possible relationship between a high schooler’s weight and their physical activity. Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions.
First, let’s create a new variable physical_3plus, which
will be coded as either “yes” if they are physically active for at least
3 days a week, and “no” if not.
physical_3plus and
weight. Is there a relationship between these two
variables? What did you expect and why?yrbss %>%
drop_na(weight) %>% # remove missing weights so the plot isn't messy
ggplot(aes(x = physical_3plus, y = weight, fill = physical_3plus)) +
geom_boxplot() +
labs(
x = "Physically Active 3+ Days/Week",
y = "Weight (kg)",
title = "Distribution of Weight by Physical Activity Level"
) +
theme_minimal() +
theme(legend.position = "none")From the boxplot, it looks like students who are active 3 or more days a week tend to weigh a little more on average than those who aren’t as active. This actually makes sense to me because if someone is working out or playing sports more often, they might have more muscle, which can add to their weight. The two groups still overlap a lot though, so the difference isn’t huge just by looking at the plot — but there is a small shift upward for the more active group. Overall, students who are physically active at least 3 days a week show a slightly higher median weight, suggesting a small positive relationship between physical activity and weight.
The box plots show how the medians of the two distributions compare,
but we can also compare the means of the distributions using the
following to first group the data by the physical_3plus
variable, and then calculate the mean weight in these
groups using the mean function while ignoring missing
values by setting the na.rm argument to
TRUE.
## # A tibble: 3 × 2
## physical_3plus mean_weight
## <chr> <dbl>
## 1 no 66.7
## 2 yes 68.4
## 3 <NA> 69.9
There is an observed difference, but is this difference statistically significant? In order to answer this question we will conduct a hypothesis test.
summarize
command above by defining a new variable with the definition
n().## # A tibble: 3 × 2
## physical_3plus n
## <chr> <int>
## 1 no 4404
## 2 yes 8906
## 3 <NA> 273
Yes, the conditions for inference are met. The sample appears to be randomly collected, and it’s reasonable to treat each student as independent since one student’s weight doesn’t influence another’s. Both activity groups have large sample sizes (well above 30), so the Central Limit Theorem kicks in, meaning the distribution of the sample mean should be close to normal. Because of this, it’s appropriate to run inference comparing the two mean weights.
Null hypothesis (H₀): There is no difference in the average weight of students who exercise at least 3 times a week compared to those who don’t. In other words, working out more doesn’t change the average weight.
Alternative hypothesis (Hₐ): There is a difference in the average weight between students who exercise at least 3 times a week and those who don’t.
In Symbols:* H₀: μ_yes = μ_no Hₐ: μ_yes ≠ μ_no This is a two-tailed test because we’re not assuming the group that exercises more weighs more or less — just that the means could be different.
Next, we will introduce a new function, hypothesize,
that falls into the infer workflow. You will use this
method for conducting hypothesis tests.
But first, we need to initialize the test, which we will save as
obs_diff.
obs_diff <- yrbss %>%
drop_na(physical_3plus) %>%
specify(weight ~ physical_3plus) %>%
calculate(stat = "diff in means", order = c("yes", "no"))Notice how you can use the functions specify and
calculate again like you did for calculating confidence
intervals. Here, though, the statistic you are searching for is the
difference in means, with the order being
yes - no != 0.
After you have initialized the test, you need to simulate the test on
the null distribution, which we will save as null.
null_dist <- yrbss %>%
drop_na(physical_3plus) %>%
specify(weight ~ physical_3plus) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("yes", "no"))Here, hypothesize is used to set the null hypothesis as
a test for independence. In one sample cases, the null
argument can be set to “point” to test a hypothesis relative to a point
estimate.
Also, note that the type argument within
generate is set to permute, whichis the
argument when generating a null distribution for a hypothesis test.
We can visualize this null distribution with the following code:
null permutations have a difference
of at least obs_stat?# Pull the observed statistic as a number
obs_val <- as.numeric(obs_diff$stat)
# Right-tail count: null stats >= observed (useful if you were doing a "greater" test)
right_tail_count <- sum(null_dist$stat >= obs_val)
# Two-sided count: |null stat| >= |observed|
two_sided_count <- sum(abs(null_dist$stat) >= abs(obs_val))
# Print both
list(
observed_difference = obs_val,
right_tail_count = right_tail_count,
two_sided_count = two_sided_count,
total_permutations = nrow(null_dist)
)## $observed_difference
## [1] 1.774584
##
## $right_tail_count
## [1] 0
##
## $two_sided_count
## [1] 0
##
## $total_permutations
## [1] 1000
My observed difference in mean weight between the two groups (those active 3+ days vs. not) was about 1.77 kg. When I compared this to the 1,000 permutation samples under the null hypothesis, 0 out of 1,000 of those simulated differences were as large as mine (in either direction).
Since none of the permuted differences were this extreme, this suggests that getting a result like mine just by random chance is very unlikely if there was truly no relationship.
Now that the test is initialized and the null distribution formed,
you can calculate the p-value for your hypothesis test using the
function get_p_value.
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0
This the standard workflow for performing hypothesis tests.
# Observed difference in means
obs_diff_means <- yrbss %>%
drop_na(physical_3plus) %>%
specify(weight ~ physical_3plus) %>%
calculate(stat = "diff in means", order = c("yes", "no"))
# Bootstrap distribution
boot_dist_means <- yrbss %>%
drop_na(physical_3plus) %>%
specify(weight ~ physical_3plus) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "diff in means", order = c("yes", "no"))
# 95% CI for the mean difference
ci_diff_means <- boot_dist_means %>%
get_ci(level = 0.95)
obs_diff_means## Response: weight (numeric)
## Explanatory: physical_3plus (factor)
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 1.77
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 1.12 2.44
obs_val <- as.numeric(obs_diff_means$stat)
ci_lower <- ci_diff_means$lower_ci
ci_upper <- ci_diff_means$upper_ci
c(obs_val = obs_val, ci_lower = ci_lower, ci_upper = ci_upper)## obs_val ci_lower ci_upper
## 1.774584 1.115769 2.436856
This confidence interval suggests that students who are physically active at least 3 days a week tend to weigh about 1.1 to 2.4 kg more on average than those who are less active. Since the entire interval is above 0, this supports the idea that the difference is real and not due to chance. In simple terms, more active students tend to weigh slightly more, and we’re 95% confident the true difference falls within this range.
height) and interpret it in context.library(dplyr)
library(infer)
# Bootstrap CI for the population mean height
height_boot <- yrbss %>%
drop_na(height) %>%
specify(response = height) %>%
generate(reps = 2000, type = "bootstrap") %>%
calculate(stat = "mean")
ci_height_95 <- height_boot %>% get_ci(level = 0.95)
# Also show the sample mean for context
height_mean <- yrbss %>% summarize(mean_height = mean(height, na.rm = TRUE))
list(
sample_mean_m = height_mean$mean_height,
ci_lower_95 = ci_height_95$lower_ci,
ci_upper_95 = ci_height_95$upper_ci
)## $sample_mean_m
## [1] 1.691241
##
## $ci_lower_95
## [1] 1.689452
##
## $ci_upper_95
## [1] 1.693023
The 95% CI for average height is roughly Lower 1.689491 and upper 1.693033 meters, centered around a sample mean of about 1.691241 meters. In plain English, we’re pretty confident the true average height of students in this survey lives in that range.
ci_height_90 <- height_boot %>% get_ci(level = 0.90)
list(
ci_lower_90 = ci_height_90$lower_ci,
ci_upper_90 = ci_height_90$upper_ci
)## $ci_lower_90
## [1] 1.689734
##
## $ci_upper_90
## [1] 1.692731
The 90% CI is narrower than the 95% CI (that’s expected). Less confidence = tighter range. So the 90% interval gives a more precise estimate, but with a slightly higher chance of missing the true mean.
# Observed difference in means (yes - no)
obs_diff_height <- yrbss %>%
drop_na(height, physical_3plus) %>%
specify(height ~ physical_3plus) %>%
calculate(stat = "diff in means", order = c("yes", "no"))
# Null distribution via permutation under independence
null_height <- yrbss %>%
drop_na(height, physical_3plus) %>%
specify(height ~ physical_3plus) %>%
hypothesize(null = "independence") %>%
generate(reps = 2000, type = "permute") %>%
calculate(stat = "diff in means", order = c("yes", "no"))
# Two-sided p-value
pval_height <- null_height %>%
get_p_value(obs_stat = obs_diff_height, direction = "two_sided")
list(
observed_diff_m = as.numeric(obs_diff_height$stat),
p_value = pval_height$p_value
)## $observed_diff_m
## [1] 0.03762589
##
## $p_value
## [1] 0
There is strong evidence that average height is different between students who exercise at least three times a week and those who don’t. The observed difference in average height was about 0.04 meters (around 4 cm), with the more active students being slightly taller on average. The p-value was reported as 0 (effectively < 0.001), which is well below my 0.05 significance level. This means the difference we’re seeing is very unlikely to be due to random chance.
So, I would reject the null hypothesis and conclude that students who exercise more tend to be a bit taller than those who don’t — although the difference is small in real-world terms.
hours_tv_per_school_day
there are.tv_levels <- yrbss %>%
distinct(hours_tv_per_school_day) %>%
arrange(hours_tv_per_school_day)
n_tv_levels <- tv_levels %>% filter(!is.na(hours_tv_per_school_day)) %>% nrow()
list(
num_levels = n_tv_levels,
levels = tv_levels$hours_tv_per_school_day
)## $num_levels
## [1] 7
##
## $levels
## [1] "1" "2" "3" "4" "5+"
## [6] "<1" "do not watch" NA
There are 7 different options in the dataset for hours_tv_per_school_day.
# Create a sleep grouping variable: 8+ hours vs. < 8 hours
yrbss_q12 <- yrbss %>%
mutate(
sleep8 = ifelse(!is.na(school_night_hours_sleep) & as.numeric(school_night_hours_sleep) >= 8, "8+ hrs", "<8 hrs"),
weight_kg = weight
) %>%
filter(!is.na(sleep8), !is.na(weight_kg))
# Quick sample sizes (condition check)
yrbss_q12 %>% count(sleep8)## # A tibble: 2 × 2
## sleep8 n
## <chr> <int>
## 1 8+ hrs 3210
## 2 <8 hrs 8255
# Observed difference in mean weight (8+ minus <8)
obs_sleep_weight <- yrbss_q12 %>%
specify(weight_kg ~ sleep8) %>%
calculate(stat = "diff in means", order = c("8+ hrs", "<8 hrs"))
# Bootstrap distribution
boot_sleep_weight <- yrbss_q12 %>%
specify(weight_kg ~ sleep8) %>%
generate(reps = 2000, type = "bootstrap") %>%
calculate(stat = "diff in means", order = c("8+ hrs", "<8 hrs"))
# 95% Confidence Interval for the difference
ci_sleep_weight <- boot_sleep_weight %>%
get_ci(level = 0.95)
obs_sleep_weight## Response: weight_kg (numeric)
## Explanatory: sleep8 (factor)
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 -0.911
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 -1.59 -0.240
Research Question: Do students who get at least 8 hours of sleep on a school night weigh more or less, on average, than students who sleep under 8 hours?
Hypotheses (in plain words): H0 (No difference): Sleep and weight are not related. Students who sleep 8+ hours weigh the same on average as those who sleep less.
HA (Difference exists): Students who sleep 8+ hours weigh a different amount on average than students who sleep less.
Check conditions: The sample is large enough in both sleep groups, and the data points represent individual students, so we can treat them as independent.
The sample sizes for both groups are well above 30, so the conditions for running this test look fine.
Results: The data shows that students who sleep 8+ hours weigh about 0.9 kg less on average than those who sleep under 8 hours. The 95% confidence interval for the difference is from −1.58 kg to −0.260 kg, and it does not include 0. That means this difference is not just due to chance.
Conclusion: Based on this sample, students who get 8+ hours of sleep actually weigh a bit less than students who sleep less. Since the entire confidence interval was negative, it backs up the idea that there’s a real difference here—not just random luck. So in this case, more sleep is linked with slightly lower weight for high school students.