Getting started

In this lab we work with data from the Youth Risk Behavior Surveillance System (YRBSS). The data are available in the openintro package.

data("yrbss", package = "openintro")

We can look at the structure of the data:

glimpse(yrbss)
## Rows: 13,583
## Columns: 13
## $ age                      <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15, 1…
## $ gender                   <chr> "female", "female", "female", "female", "fema…
## $ grade                    <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9", …
## $ hispanic                 <chr> "not", "not", "hispanic", "not", "not", "not"…
## $ race                     <chr> "Black or African American", "Black or Africa…
## $ height                   <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88, 1…
## $ weight                   <dbl> NA, NA, 84.37, 55.79, 46.72, 67.13, 131.54, 7…
## $ helmet_12m               <chr> "never", "never", "never", "never", "did not …
## $ text_while_driving_30d   <chr> "0", NA, "30", "0", "did not drive", "did not…
## $ physically_active_7d     <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, 7, …
## $ hours_tv_per_school_day  <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5+",…
## $ strength_training_7d     <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, 7, …
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "<5"…

Exercise 1

What are the cases in this data set? How many cases are there in our sample?

nrow(yrbss)
## [1] 13583

Answer (Exercise 1):
Each case in this data set is a single high school student (grades 9 through 12) who completed the YRBSS survey in that year. There are 13,583 rows in the data, so our sample contains information on 13,583 students.


Exploratory data analysis

We start by examining the distribution of weight (in kilograms).

summary(yrbss$weight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   29.94   56.25   64.41   67.91   76.20  180.99    1004

We can also visualize the distribution with a histogram:

ggplot(yrbss, aes(x = weight)) +
  geom_histogram(binwidth = 5, na.rm = TRUE) +
  labs(
    x = "Weight (kg)",
    y = "Count",
    title = "Distribution of student weights"
  )

Answer (Exercise 2, part 1):
The summary and histogram show that student weights range from about 30 kg to about 181 kg. The distribution is unimodal and right skewed, with most students between roughly 55 kg and 80 kg. There are a few very heavy students that create a long right tail.

We can check how many observations are missing weight:

sum(is.na(yrbss$weight))
## [1] 1004

Answer (Exercise 2, part 2):
There are 1,004 students with missing weight values, so weight is recorded for the rest of the sample.


Next we examine the relationship between weight and physical activity. We create a new variable physical_3plus which is "yes" if a student is physically active on more than 2 days per week, and "no" otherwise.

yrbss <- yrbss %>%
  mutate(physical_3plus = ifelse(physically_active_7d > 2, "yes", "no"))

We make side–by–side boxplots of weight for the two activity groups.

ggplot(yrbss, aes(x = physical_3plus, y = weight)) +
  geom_boxplot(na.rm = TRUE) +
  labs(
    x = "Physically active at least 3 days / week?",
    y = "Weight (kg)",
    title = "Weight by physical activity level"
  )

Exercise 3

Make a side-by-side boxplot of physical_3plus and weight. Is there a relationship between these two variables? What did you expect and why?

Answer (Exercise 3):
From the boxplots, students who are active at least 3 days per week tend to have slightly higher weights on average than those who are not as active. The medians and upper quartiles for the "yes" group are a little higher, and the distributions overlap a lot. I expected very active students to weigh a bit more on average because they may have more muscle mass, but I also expected considerable overlap because weight is affected by many other factors besides activity.

We can compute the mean weights for each group:

yrbss %>%
  group_by(physical_3plus) %>%
  summarise(
    n = n(),
    mean_weight = mean(weight, na.rm = TRUE),
    sd_weight = sd(weight, na.rm = TRUE)
  )
## # A tibble: 3 × 4
##   physical_3plus     n mean_weight sd_weight
##   <chr>          <int>       <dbl>     <dbl>
## 1 no              4404        66.7      17.6
## 2 yes             8906        68.4      16.5
## 3 <NA>             273        69.9      17.6

Exercise 4

Are all conditions necessary for inference satisfied? Comment on each.

Answer (Exercise 4):

  • Independence within groups: The data come from a large national survey of students. It is reasonable to treat students in the sample as approximately independent, especially since the sample size is small relative to the population of all US high school students.
  • Independent groups: Each student is classified into only one activity group ("yes" or "no"), so the groups are non–overlapping and independent.
  • Sample size / nearly normal: Both groups have large sample sizes (each well over 30 with non–missing weight values), so by the central limit theorem the sampling distribution of the difference in mean weights should be approximately normal even though the weight distributions are somewhat skewed.
  • Random sampling: The YRBSS is designed to be a random sample of high school students, which supports using inferential methods that generalize to the broader population.

Given these points, the conditions for a two–sample t procedure on the difference in means appear to be satisfied.


Exercise 5

Write the hypotheses for testing if the average weights are different for those who exercise at least 3 times a week and those who do not.

Let

  • \(\mu_{ ext{yes}}\) = mean weight of students who are physically active on at least 3 days per week.
  • \(\mu_{ ext{no}}\) = mean weight of students who are active on at most 2 days per week.

Null hypothesis:
\(H_0: \mu_{ ext{yes}} - \mu_{ ext{no}} = 0\) (no difference in average weights).

Alternative hypothesis:
\(H_A: \mu_{ ext{yes}} - \mu_{ ext{no}} e 0\) (the mean weights are different).


Hypothesis test with infer

We now follow the infer workflow to test the hypotheses.

First, compute the observed difference in mean weights (yes - no).

obs_diff <- yrbss %>%
  drop_na(physical_3plus, weight) %>%
  specify(weight ~ physical_3plus) %>%
  calculate(stat = "diff in means", order = c("yes", "no"))

obs_diff
## Response: weight (numeric)
## Explanatory: physical_3plus (factor)
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1  1.77

Next, we build the null distribution by permuting activity labels many times under the assumption of no relationship between activity and weight.

null_dist <- yrbss %>%
  drop_na(physical_3plus, weight) %>%
  specify(weight ~ physical_3plus) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", order = c("yes", "no"))

We can visualize the null distribution:

ggplot(null_dist, aes(x = stat)) +
  geom_histogram(binwidth = 0.1, boundary = 0) +
  labs(
    x = "Difference in means (yes - no) under H0",
    y = "Count",
    title = "Null distribution for difference in mean weights"
  )

Exercise 6

How many of these null permutations have a difference of at least obs_diff?

We compute the two–sided p–value:

null_dist %>%
  get_p_value(obs_stat = obs_diff, direction = "two_sided")
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

Answer (Exercise 6):
The p–value reported above is essentially 0 (or extremely close to 0), meaning that almost none of the 1,000 permutations produced a difference in means as large (in absolute value) as the observed difference. Under the null hypothesis of no difference, seeing a difference this large is extremely unlikely. Therefore, we reject the null hypothesis and conclude that average weight differs between students who are active at least 3 days per week and those who are not.


Confidence interval for the difference in weights

We now construct a 95% bootstrap confidence interval for the difference in mean weights \(\mu_{ ext{yes}} - \mu_{ ext{no}}\).

boot_dist_weight <- yrbss %>%
  drop_na(physical_3plus, weight) %>%
  specify(weight ~ physical_3plus) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "diff in means", order = c("yes", "no"))

ci_weight_95 <- boot_dist_weight %>%
  get_ci(level = 0.95)

ci_weight_95
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1     1.17     2.41

Exercise 7

Construct and record a confidence interval for the difference between the weights of those who exercise at least three times a week and those who do not, and interpret this interval in context of the data.

Answer (Exercise 7):
The 95% bootstrap confidence interval for \(\mu_{ ext{yes}} - \mu_{ ext{no}}\) is given by the lower and upper bounds printed above. The entire interval is above 0, indicating that students who are active at least 3 days per week tend to be heavier on average than less active students. I am 95% confident that the true difference in mean weight (active minus less active) for all US high school students falls within this interval.


Heights and confidence intervals

Exercise 8

Calculate a 95% confidence interval for the average height in meters (height) and interpret it in context.

We first construct a bootstrap distribution for the mean height:

boot_height <- yrbss %>%
  drop_na(height) %>%
  specify(response = height) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "mean")

ci_height_95 <- boot_height %>%
  get_ci(level = 0.95)

ci_height_95
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1     1.69     1.69

Answer (Exercise 8):
The 95% confidence interval for the mean height of students in the YRBSS sample is shown by the lower and upper bounds above (in meters). I am 95% confident that the true average height of all US high school students lies within this interval.


Exercise 9

Calculate a new confidence interval for the same parameter at the 90% confidence level. Comment on the width of this interval versus the one obtained in the previous exercise.

ci_height_90 <- boot_height %>%
  get_ci(level = 0.90)

ci_height_90
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1     1.69     1.69

Answer (Exercise 9):
The 90% confidence interval for the mean height is slightly narrower than the 95% interval. This makes sense: a lower confidence level means we do not require the interval to capture the true mean as often, so we can use a shorter interval. Higher confidence levels trade off wider intervals for more reliable capture of the true parameter.


Height and activity

Exercise 10

Conduct a hypothesis test evaluating whether the average height is different for those who exercise at least three times a week and those who do not.

We compute the observed difference in mean heights (yes - no) and perform a permutation test using infer.

obs_diff_height <- yrbss %>%
  drop_na(physical_3plus, height) %>%
  specify(height ~ physical_3plus) %>%
  calculate(stat = "diff in means", order = c("yes", "no"))

obs_diff_height
## Response: height (numeric)
## Explanatory: physical_3plus (factor)
## # A tibble: 1 × 1
##     stat
##    <dbl>
## 1 0.0376
null_height <- yrbss %>%
  drop_na(physical_3plus, height) %>%
  specify(height ~ physical_3plus) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", order = c("yes", "no"))
null_height %>%
  get_p_value(obs_stat = obs_diff_height, direction = "two_sided")
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

Answer (Exercise 10):
The p–value from the permutation test tells us how unusual our observed difference in mean height is under the assumption that activity level and height are unrelated. The p–value is quite small, indicating evidence against the null hypothesis. We conclude that average height is somewhat different between students who are physically active at least 3 days per week and those who are not, though the difference in means is modest in size.


TV watching

Exercise 11

Determine the number of different options there are in the dataset for hours_tv_per_school_day.

unique(yrbss$hours_tv_per_school_day)
## [1] "5+"           "2"            "3"            "do not watch" "<1"          
## [6] "4"            "1"            NA
length(unique(yrbss$hours_tv_per_school_day))
## [1] 8

Answer (Exercise 11):
The first line prints all distinct response categories for hours_tv_per_school_day. The second line gives their count. There are that many different options for how many hours of TV students watch on a typical school day.


Sleep and weight

Exercise 12

Come up with a research question evaluating the relationship between height or weight and sleep. Formulate the question so that it can be answered using a hypothesis test and/or a confidence interval. Report statistical results and provide a plain–language explanation. Check assumptions, state your alpha level, and conclude in context.

Research question:
Do students who sleep at least 7 hours on a school night tend to weigh less, on average, than students who sleep less than 7 hours?

First, we build a two–level sleep variable:

yrbss <- yrbss %>%
  mutate(
    sleep_group = case_when(
      school_night_hours_sleep %in% c("7", "8", "9", "10+") ~ "7+ hours",
      school_night_hours_sleep %in% c("<5", "5", "6") ~ "<7 hours",
      TRUE ~ NA_character_
    )
  )

We can compare the mean weights in the two sleep groups:

yrbss %>%
  group_by(sleep_group) %>%
  summarise(
    n = n(),
    mean_weight = mean(weight, na.rm = TRUE),
    sd_weight = sd(weight, na.rm = TRUE)
  )
## # A tibble: 3 × 4
##   sleep_group     n mean_weight sd_weight
##   <chr>       <int>       <dbl>     <dbl>
## 1 7+ hours     7232        67.3      16.4
## 2 <7 hours     5103        68.7      17.7
## 3 <NA>         1248        68.0      16.2

We then conduct a hypothesis test using infer.

  • \(H_0: \mu_{7+} - \mu_{<7} = 0\) (no difference in mean weight).
  • \(H_A: \mu_{7+} - \mu_{<7} e 0\).
  • Significance level: \(lpha = 0.05\).
obs_diff_sleep <- yrbss %>%
  drop_na(sleep_group, weight) %>%
  specify(weight ~ sleep_group) %>%
  calculate(stat = "diff in means", order = c("7+ hours", "<7 hours"))

obs_diff_sleep
## Response: weight (numeric)
## Explanatory: sleep_group (factor)
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1 -1.40
null_sleep <- yrbss %>%
  drop_na(sleep_group, weight) %>%
  specify(weight ~ sleep_group) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", order = c("7+ hours", "<7 hours"))

null_sleep %>%
  get_p_value(obs_stat = obs_diff_sleep, direction = "two_sided")
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

We can also construct a 95% confidence interval for the difference in mean weights:

boot_sleep <- yrbss %>%
  drop_na(sleep_group, weight) %>%
  specify(weight ~ sleep_group) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "diff in means", order = c("7+ hours", "<7 hours"))

ci_sleep_95 <- boot_sleep %>%
  get_ci(level = 0.95)

ci_sleep_95
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    -2.05   -0.800

Answer (Exercise 12):
The p–value from the permutation test indicates whether the observed difference in mean weight between sleep groups is statistically significant at the 5% level. The 95% bootstrap confidence interval for \(\mu_{7+} - \mu_{<7}\) shows the range of plausible values for the true difference in mean weights.

If the p–value is below 0.05 and the entire confidence interval lies below 0, we would conclude that students who sleep at least 7 hours tend to weigh less on average than students who sleep fewer than 7 hours. If the interval includes 0 and the p–value is large, we would conclude that there is no strong evidence that average weight differs by sleep group in the population of US high school students.

Overall, this analysis illustrates how we can use hypothesis tests and confidence intervals to study the relationship between sleep and body weight among teenagers.