Getting Started

Load packages

In this lab, we will explore and visualize the data using the tidyverse suite of packages, and perform statistical inference using infer. The data can be found in the companion package for OpenIntro resources, openintro.

Let’s load the packages.

library(tidyverse)
library(openintro)
library(infer)

The data

Every two years, the Centers for Disease Control and Prevention conduct the Youth Risk Behavior Surveillance System (YRBSS) survey, where it takes data from high schoolers (9th through 12th grade), to analyze health patterns. You will work with a selected group of variables from a random sample of observations during one of the years the YRBSS was conducted.

Load the yrbss data set into your workspace.

data('yrbss', package='openintro')

There are observations on 13 different variables, some categorical and some numerical. The meaning of each variable can be found by bringing up the help file:

?yrbss
  1. What are the cases in this data set? How many cases are there in our sample?
# View the dimensions of the dataset
dim(yrbss)
## [1] 13583    13
# Or, to label the output more clearly:
num_rows <- nrow(yrbss)
num_cols <- ncol(yrbss)

num_rows
## [1] 13583
num_cols
## [1] 13

The cases in this data set are individual high school students who participated in the YRBSS survey. Each row represents one student’s responses to the survey questions. There are 13,583 students in this sample.

Remember that you can answer this question by viewing the data in the data viewer or by using the following command:

glimpse(yrbss)
## Rows: 13,583
## Columns: 13
## $ age                      <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15, 1…
## $ gender                   <chr> "female", "female", "female", "female", "fema…
## $ grade                    <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9", …
## $ hispanic                 <chr> "not", "not", "hispanic", "not", "not", "not"…
## $ race                     <chr> "Black or African American", "Black or Africa…
## $ height                   <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88, 1…
## $ weight                   <dbl> NA, NA, 84.37, 55.79, 46.72, 67.13, 131.54, 7…
## $ helmet_12m               <chr> "never", "never", "never", "never", "did not …
## $ text_while_driving_30d   <chr> "0", NA, "30", "0", "did not drive", "did not…
## $ physically_active_7d     <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, 7, …
## $ hours_tv_per_school_day  <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5+",…
## $ strength_training_7d     <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, 7, …
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "<5"…

Exploratory data analysis

You will first start with analyzing the weight of the participants in kilograms: weight.

Using visualization and summary statistics, describe the distribution of weights. The summary function can be useful.

summary(yrbss$weight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   29.94   56.25   64.41   67.91   76.20  180.99    1004
  1. How many observations are we missing weights from?
# Count missing weight values
sum(is.na(yrbss$weight))
## [1] 1004

There are 1004 students who didn’t have their weight recorded. Because of that, they’ll be left out of any weight-related analysis since we don’t have their data.

Next, consider the possible relationship between a high schooler’s weight and their physical activity. Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions.

First, let’s create a new variable physical_3plus, which will be coded as either “yes” if they are physically active for at least 3 days a week, and “no” if not.

yrbss <- yrbss %>% 
  mutate(physical_3plus = ifelse(yrbss$physically_active_7d > 2, "yes", "no"))
  1. Make a side-by-side boxplot of physical_3plus and weight. Is there a relationship between these two variables? What did you expect and why?
yrbss %>%
drop_na(weight) %>%  # remove missing weights so the plot isn't messy
ggplot(aes(x = physical_3plus, y = weight, fill = physical_3plus)) +
geom_boxplot() +
labs(
x = "Physically Active 3+ Days/Week",
y = "Weight (kg)",
title = "Distribution of Weight by Physical Activity Level"
) +
theme_minimal() +
theme(legend.position = "none")

From the boxplot, it looks like students who are active 3 or more days a week tend to weigh a little more on average than those who aren’t as active. This actually makes sense to me because if someone is working out or playing sports more often, they might have more muscle, which can add to their weight. The two groups still overlap a lot though, so the difference isn’t huge just by looking at the plot — but there is a small shift upward for the more active group. Overall, students who are physically active at least 3 days a week show a slightly higher median weight, suggesting a small positive relationship between physical activity and weight.

The box plots show how the medians of the two distributions compare, but we can also compare the means of the distributions using the following to first group the data by the physical_3plus variable, and then calculate the mean weight in these groups using the mean function while ignoring missing values by setting the na.rm argument to TRUE.

yrbss %>%
  group_by(physical_3plus) %>%
  summarise(mean_weight = mean(weight, na.rm = TRUE))
## # A tibble: 3 × 2
##   physical_3plus mean_weight
##   <chr>                <dbl>
## 1 no                    66.7
## 2 yes                   68.4
## 3 <NA>                  69.9

There is an observed difference, but is this difference statistically significant? In order to answer this question we will conduct a hypothesis test.

Inference

  1. Are all conditions necessary for inference satisfied? Comment on each. You can compute the group sizes with the summarize command above by defining a new variable with the definition n().
yrbss %>%
  group_by(physical_3plus) %>%
  summarise(n = n())
## # A tibble: 3 × 2
##   physical_3plus     n
##   <chr>          <int>
## 1 no              4404
## 2 yes             8906
## 3 <NA>             273

Yes, the conditions for inference are met. The sample appears to be randomly collected, and it’s reasonable to treat each student as independent since one student’s weight doesn’t influence another’s. Both activity groups have large sample sizes (well above 30), so the Central Limit Theorem kicks in, meaning the distribution of the sample mean should be close to normal. Because of this, it’s appropriate to run inference comparing the two mean weights.

  1. Write the hypotheses for testing if the average weights are different for those who exercise at least times a week and those who don’t.

Null hypothesis (H₀): There is no difference in the average weight of students who exercise at least 3 times a week compared to those who don’t. In other words, working out more doesn’t change the average weight.

Alternative hypothesis (Hₐ): There is a difference in the average weight between students who exercise at least 3 times a week and those who don’t.

In Symbols:* H₀: μ_yes = μ_no Hₐ: μ_yes ≠ μ_no This is a two-tailed test because we’re not assuming the group that exercises more weighs more or less — just that the means could be different.

Next, we will introduce a new function, hypothesize, that falls into the infer workflow. You will use this method for conducting hypothesis tests.

But first, we need to initialize the test, which we will save as obs_diff.

obs_diff <- yrbss %>%
  drop_na(physical_3plus) %>%
  specify(weight ~ physical_3plus) %>%
  calculate(stat = "diff in means", order = c("yes", "no"))

Notice how you can use the functions specify and calculate again like you did for calculating confidence intervals. Here, though, the statistic you are searching for is the difference in means, with the order being yes - no != 0.

After you have initialized the test, you need to simulate the test on the null distribution, which we will save as null.

null_dist <- yrbss %>%
  drop_na(physical_3plus) %>%
  specify(weight ~ physical_3plus) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", order = c("yes", "no"))

Here, hypothesize is used to set the null hypothesis as a test for independence. In one sample cases, the null argument can be set to “point” to test a hypothesis relative to a point estimate.

Also, note that the type argument within generate is set to permute, whichis the argument when generating a null distribution for a hypothesis test.

We can visualize this null distribution with the following code:

ggplot(data = null_dist, aes(x = stat)) +
  geom_histogram()

  1. How many of these null permutations have a difference of at least obs_stat?
# Pull the observed statistic as a number
obs_val <- as.numeric(obs_diff$stat)

# Right-tail count: null stats >= observed (useful if you were doing a "greater" test)
right_tail_count <- sum(null_dist$stat >= obs_val)

# Two-sided count: |null stat| >= |observed|
two_sided_count <- sum(abs(null_dist$stat) >= abs(obs_val))

# Print both
list(
observed_difference = obs_val,
right_tail_count = right_tail_count,
two_sided_count = two_sided_count,
total_permutations = nrow(null_dist)
)
## $observed_difference
## [1] 1.774584
## 
## $right_tail_count
## [1] 0
## 
## $two_sided_count
## [1] 0
## 
## $total_permutations
## [1] 1000

My observed difference in mean weight between the two groups (those active 3+ days vs. not) was about 1.77 kg. When I compared this to the 1,000 permutation samples under the null hypothesis, 0 out of 1,000 of those simulated differences were as large as mine (in either direction).

Since none of the permuted differences were this extreme, this suggests that getting a result like mine just by random chance is very unlikely if there was truly no relationship.

Now that the test is initialized and the null distribution formed, you can calculate the p-value for your hypothesis test using the function get_p_value.

null_dist %>%
  get_p_value(obs_stat = obs_diff, direction = "two_sided")
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

This the standard workflow for performing hypothesis tests.

  1. Construct and record a confidence interval for the difference between the weights of those who exercise at least three times a week and those who don’t, and interpret this interval in context of the data.
# Observed difference in means
obs_diff_means <- yrbss %>%
  drop_na(physical_3plus) %>%
  specify(weight ~ physical_3plus) %>%
  calculate(stat = "diff in means", order = c("yes", "no"))

# Bootstrap distribution
boot_dist_means <- yrbss %>%
  drop_na(physical_3plus) %>%
  specify(weight ~ physical_3plus) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "diff in means", order = c("yes", "no"))

# 95% CI for the mean difference
ci_diff_means <- boot_dist_means %>%
  get_ci(level = 0.95)

obs_diff_means
## Response: weight (numeric)
## Explanatory: physical_3plus (factor)
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1  1.77
ci_diff_means
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1     1.12     2.44
obs_val  <- as.numeric(obs_diff_means$stat)
ci_lower <- ci_diff_means$lower_ci
ci_upper <- ci_diff_means$upper_ci

c(obs_val = obs_val, ci_lower = ci_lower, ci_upper = ci_upper)
##  obs_val ci_lower ci_upper 
## 1.774584 1.115769 2.436856

This confidence interval suggests that students who are physically active at least 3 days a week tend to weigh about 1.1 to 2.4 kg more on average than those who are less active. Since the entire interval is above 0, this supports the idea that the difference is real and not due to chance. In simple terms, more active students tend to weigh slightly more, and we’re 95% confident the true difference falls within this range.


More Practice

  1. Calculate a 95% confidence interval for the average height in meters (height) and interpret it in context.
library(dplyr)
library(infer)

# Bootstrap CI for the population mean height

height_boot <- yrbss %>%
drop_na(height) %>%
specify(response = height) %>%
generate(reps = 2000, type = "bootstrap") %>%
calculate(stat = "mean")

ci_height_95 <- height_boot %>% get_ci(level = 0.95)

# Also show the sample mean for context

height_mean <- yrbss %>% summarize(mean_height = mean(height, na.rm = TRUE))

list(
sample_mean_m = height_mean$mean_height,
ci_lower_95 = ci_height_95$lower_ci,
ci_upper_95 = ci_height_95$upper_ci
)
## $sample_mean_m
## [1] 1.691241
## 
## $ci_lower_95
## [1] 1.689452
## 
## $ci_upper_95
## [1] 1.693023

The 95% CI for average height is roughly Lower 1.689491 and upper 1.693033 meters, centered around a sample mean of about 1.691241 meters. In plain English, we’re pretty confident the true average height of students in this survey lives in that range.

  1. Calculate a new confidence interval for the same parameter at the 90% confidence level. Comment on the width of this interval versus the one obtained in the previous exercise.
ci_height_90 <- height_boot %>% get_ci(level = 0.90)

list(
ci_lower_90 = ci_height_90$lower_ci,
ci_upper_90 = ci_height_90$upper_ci
)
## $ci_lower_90
## [1] 1.689734
## 
## $ci_upper_90
## [1] 1.692731

The 90% CI is narrower than the 95% CI (that’s expected). Less confidence = tighter range. So the 90% interval gives a more precise estimate, but with a slightly higher chance of missing the true mean.

  1. Conduct a hypothesis test evaluating whether the average height is different for those who exercise at least three times a week and those who don’t.
# Observed difference in means (yes - no)

obs_diff_height <- yrbss %>%
drop_na(height, physical_3plus) %>%
specify(height ~ physical_3plus) %>%
calculate(stat = "diff in means", order = c("yes", "no"))

# Null distribution via permutation under independence

null_height <- yrbss %>%
drop_na(height, physical_3plus) %>%
specify(height ~ physical_3plus) %>%
hypothesize(null = "independence") %>%
generate(reps = 2000, type = "permute") %>%
calculate(stat = "diff in means", order = c("yes", "no"))

# Two-sided p-value

pval_height <- null_height %>%
get_p_value(obs_stat = obs_diff_height, direction = "two_sided")

list(
observed_diff_m = as.numeric(obs_diff_height$stat),
p_value = pval_height$p_value
)
## $observed_diff_m
## [1] 0.03762589
## 
## $p_value
## [1] 0

There is strong evidence that average height is different between students who exercise at least three times a week and those who don’t. The observed difference in average height was about 0.04 meters (around 4 cm), with the more active students being slightly taller on average. The p-value was reported as 0 (effectively < 0.001), which is well below my 0.05 significance level. This means the difference we’re seeing is very unlikely to be due to random chance.

So, I would reject the null hypothesis and conclude that students who exercise more tend to be a bit taller than those who don’t — although the difference is small in real-world terms.

  1. Now, a non-inference task: Determine the number of different options there are in the dataset for the hours_tv_per_school_day there are.
tv_levels <- yrbss %>%
distinct(hours_tv_per_school_day) %>%
arrange(hours_tv_per_school_day)

n_tv_levels <- tv_levels %>% filter(!is.na(hours_tv_per_school_day)) %>% nrow()

list(
num_levels = n_tv_levels,
levels = tv_levels$hours_tv_per_school_day
)
## $num_levels
## [1] 7
## 
## $levels
## [1] "1"            "2"            "3"            "4"            "5+"          
## [6] "<1"           "do not watch" NA

There are 7 different options in the dataset for hours_tv_per_school_day.

  1. Come up with a research question evaluating the relationship between height or weight and sleep. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval. Report the statistical results, and also provide an explanation in plain language. Be sure to check all assumptions, state your \(\alpha\) level, and conclude in context.
# Create a sleep grouping variable: 8+ hours vs. < 8 hours

yrbss_q12 <- yrbss %>%
mutate(
sleep8 = ifelse(!is.na(school_night_hours_sleep) & as.numeric(school_night_hours_sleep) >= 8, "8+ hrs", "<8 hrs"),
weight_kg = weight
) %>%
filter(!is.na(sleep8), !is.na(weight_kg))

# Quick sample sizes (condition check)

yrbss_q12 %>% count(sleep8)
## # A tibble: 2 × 2
##   sleep8     n
##   <chr>  <int>
## 1 8+ hrs  3210
## 2 <8 hrs  8255
# Observed difference in mean weight (8+ minus <8)

obs_sleep_weight <- yrbss_q12 %>%
specify(weight_kg ~ sleep8) %>%
calculate(stat = "diff in means", order = c("8+ hrs", "<8 hrs"))

# Bootstrap distribution

boot_sleep_weight <- yrbss_q12 %>%
specify(weight_kg ~ sleep8) %>%
generate(reps = 2000, type = "bootstrap") %>%
calculate(stat = "diff in means", order = c("8+ hrs", "<8 hrs"))

# 95% Confidence Interval for the difference

ci_sleep_weight <- boot_sleep_weight %>%
get_ci(level = 0.95)

obs_sleep_weight
## Response: weight_kg (numeric)
## Explanatory: sleep8 (factor)
## # A tibble: 1 × 1
##     stat
##    <dbl>
## 1 -0.911
ci_sleep_weight
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    -1.59   -0.240

Research Question: Do students who get at least 8 hours of sleep on a school night weigh more or less, on average, than students who sleep under 8 hours?

Hypotheses (in plain words): H0 (No difference): Sleep and weight are not related. Students who sleep 8+ hours weigh the same on average as those who sleep less.

HA (Difference exists): Students who sleep 8+ hours weigh a different amount on average than students who sleep less.

Check conditions: The sample is large enough in both sleep groups, and the data points represent individual students, so we can treat them as independent.

The sample sizes for both groups are well above 30, so the conditions for running this test look fine.

Results: The data shows that students who sleep 8+ hours weigh about 0.9 kg less on average than those who sleep under 8 hours. The 95% confidence interval for the difference is from −1.58 kg to −0.260 kg, and it does not include 0. That means this difference is not just due to chance.

Conclusion: Based on this sample, students who get 8+ hours of sleep actually weigh a bit less than students who sleep less. Since the entire confidence interval was negative, it backs up the idea that there’s a real difference here—not just random luck. So in this case, more sleep is linked with slightly lower weight for high school students.