In this lab we work with data from the Youth Risk Behavior
Surveillance System (YRBSS). The data are available in the
openintro package.
data("yrbss", package = "openintro")
We can look at the structure of the data:
glimpse(yrbss)
## Rows: 13,583
## Columns: 13
## $ age <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15, 1…
## $ gender <chr> "female", "female", "female", "female", "fema…
## $ grade <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9", …
## $ hispanic <chr> "not", "not", "hispanic", "not", "not", "not"…
## $ race <chr> "Black or African American", "Black or Africa…
## $ height <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88, 1…
## $ weight <dbl> NA, NA, 84.37, 55.79, 46.72, 67.13, 131.54, 7…
## $ helmet_12m <chr> "never", "never", "never", "never", "did not …
## $ text_while_driving_30d <chr> "0", NA, "30", "0", "did not drive", "did not…
## $ physically_active_7d <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, 7, …
## $ hours_tv_per_school_day <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5+",…
## $ strength_training_7d <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, 7, …
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "<5"…
What are the cases in this data set? How many cases are there in our sample?
nrow(yrbss)
## [1] 13583
Answer (Exercise 1):
Each case in this data set is a single high school
student (grades 9 through 12) who completed the YRBSS survey in that
year. There are 13,583 rows in the data, so our sample contains
information on 13,583 students.
We start by examining the distribution of weight (in kilograms).
summary(yrbss$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 29.94 56.25 64.41 67.91 76.20 180.99 1004
We can also visualize the distribution with a histogram:
ggplot(yrbss, aes(x = weight)) +
geom_histogram(binwidth = 5, na.rm = TRUE) +
labs(
x = "Weight (kg)",
y = "Count",
title = "Distribution of student weights"
)
Answer (Exercise 2, part 1):
The summary and histogram show that student weights range from about 30
kg to about 181 kg. The distribution is unimodal and right skewed, with
most students between roughly 55 kg and 80 kg. There are a few very
heavy students that create a long right tail.
We can check how many observations are missing weight:
sum(is.na(yrbss$weight))
## [1] 1004
Answer (Exercise 2, part 2):
There are 1,004 students with missing weight values, so weight is
recorded for the rest of the sample.
Next we examine the relationship between weight and
physical activity. We create a new variable
physical_3plus which is "yes" if a student is
physically active on more than 2 days per week, and "no"
otherwise.
yrbss <- yrbss %>%
mutate(physical_3plus = ifelse(physically_active_7d > 2, "yes", "no"))
We make side–by–side boxplots of weight for the two activity groups.
ggplot(yrbss, aes(x = physical_3plus, y = weight)) +
geom_boxplot(na.rm = TRUE) +
labs(
x = "Physically active at least 3 days / week?",
y = "Weight (kg)",
title = "Weight by physical activity level"
)
Make a side-by-side boxplot of
physical_3plusandweight. Is there a relationship between these two variables? What did you expect and why?
Answer (Exercise 3):
From the boxplots, students who are active at least 3 days per week tend
to have slightly higher weights on average than those
who are not as active. The medians and upper quartiles for the
"yes" group are a little higher, and the distributions
overlap a lot. I expected very active students to weigh a bit more on
average because they may have more muscle mass, but I also expected
considerable overlap because weight is affected by many other factors
besides activity.
We can compute the mean weights for each group:
yrbss %>%
group_by(physical_3plus) %>%
summarise(
n = n(),
mean_weight = mean(weight, na.rm = TRUE),
sd_weight = sd(weight, na.rm = TRUE)
)
## # A tibble: 3 × 4
## physical_3plus n mean_weight sd_weight
## <chr> <int> <dbl> <dbl>
## 1 no 4404 66.7 17.6
## 2 yes 8906 68.4 16.5
## 3 <NA> 273 69.9 17.6
Are all conditions necessary for inference satisfied? Comment on each.
Answer (Exercise 4):
"yes" or "no"), so
the groups are non–overlapping and independent.Given these points, the conditions for a two–sample t procedure on the difference in means appear to be satisfied.
Write the hypotheses for testing if the average weights are different for those who exercise at least 3 times a week and those who do not.
Let
Null hypothesis:
\(H_0: \mu_{ ext{yes}} - \mu_{ ext{no}} =
0\) (no difference in average weights).
Alternative hypothesis:
\(H_A: \mu_{ ext{yes}} - \mu_{ ext{no}}
e 0\) (the mean weights are different).
inferWe now follow the infer workflow to test the
hypotheses.
First, compute the observed difference in mean weights
(yes - no).
obs_diff <- yrbss %>%
drop_na(physical_3plus, weight) %>%
specify(weight ~ physical_3plus) %>%
calculate(stat = "diff in means", order = c("yes", "no"))
obs_diff
## Response: weight (numeric)
## Explanatory: physical_3plus (factor)
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 1.77
Next, we build the null distribution by permuting activity labels many times under the assumption of no relationship between activity and weight.
null_dist <- yrbss %>%
drop_na(physical_3plus, weight) %>%
specify(weight ~ physical_3plus) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("yes", "no"))
We can visualize the null distribution:
ggplot(null_dist, aes(x = stat)) +
geom_histogram(binwidth = 0.1, boundary = 0) +
labs(
x = "Difference in means (yes - no) under H0",
y = "Count",
title = "Null distribution for difference in mean weights"
)
How many of these null permutations have a difference of at least
obs_diff?
We compute the two–sided p–value:
null_dist %>%
get_p_value(obs_stat = obs_diff, direction = "two_sided")
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0
Answer (Exercise 6):
The p–value reported above is essentially 0 (or extremely close to 0),
meaning that almost none of the 1,000 permutations
produced a difference in means as large (in absolute value) as the
observed difference. Under the null hypothesis of no difference, seeing
a difference this large is extremely unlikely. Therefore, we reject the
null hypothesis and conclude that average weight differs between
students who are active at least 3 days per week and those who are
not.
We now construct a 95% bootstrap confidence interval for the difference in mean weights \(\mu_{ ext{yes}} - \mu_{ ext{no}}\).
boot_dist_weight <- yrbss %>%
drop_na(physical_3plus, weight) %>%
specify(weight ~ physical_3plus) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "diff in means", order = c("yes", "no"))
ci_weight_95 <- boot_dist_weight %>%
get_ci(level = 0.95)
ci_weight_95
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 1.17 2.41
Construct and record a confidence interval for the difference between the weights of those who exercise at least three times a week and those who do not, and interpret this interval in context of the data.
Answer (Exercise 7):
The 95% bootstrap confidence interval for \(\mu_{ ext{yes}} - \mu_{ ext{no}}\) is
given by the lower and upper bounds printed above. The entire interval
is above 0, indicating that students who are active at
least 3 days per week tend to be heavier on average than less active
students. I am 95% confident that the true difference in mean weight
(active minus less active) for all US high school students falls within
this interval.
Calculate a 95% confidence interval for the average height in meters (
height) and interpret it in context.
We first construct a bootstrap distribution for the mean height:
boot_height <- yrbss %>%
drop_na(height) %>%
specify(response = height) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "mean")
ci_height_95 <- boot_height %>%
get_ci(level = 0.95)
ci_height_95
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 1.69 1.69
Answer (Exercise 8):
The 95% confidence interval for the mean height of
students in the YRBSS sample is shown by the lower and upper bounds
above (in meters). I am 95% confident that the true average height of
all US high school students lies within this interval.
Calculate a new confidence interval for the same parameter at the 90% confidence level. Comment on the width of this interval versus the one obtained in the previous exercise.
ci_height_90 <- boot_height %>%
get_ci(level = 0.90)
ci_height_90
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 1.69 1.69
Answer (Exercise 9):
The 90% confidence interval for the mean height is slightly
narrower than the 95% interval. This makes sense: a
lower confidence level means we do not require the interval to capture
the true mean as often, so we can use a shorter interval. Higher
confidence levels trade off wider intervals for more reliable capture of
the true parameter.
Conduct a hypothesis test evaluating whether the average height is different for those who exercise at least three times a week and those who do not.
We compute the observed difference in mean heights
(yes - no) and perform a permutation test using
infer.
obs_diff_height <- yrbss %>%
drop_na(physical_3plus, height) %>%
specify(height ~ physical_3plus) %>%
calculate(stat = "diff in means", order = c("yes", "no"))
obs_diff_height
## Response: height (numeric)
## Explanatory: physical_3plus (factor)
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 0.0376
null_height <- yrbss %>%
drop_na(physical_3plus, height) %>%
specify(height ~ physical_3plus) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("yes", "no"))
null_height %>%
get_p_value(obs_stat = obs_diff_height, direction = "two_sided")
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0
Answer (Exercise 10):
The p–value from the permutation test tells us how unusual our observed
difference in mean height is under the assumption that activity level
and height are unrelated. The p–value is quite small, indicating
evidence against the null hypothesis. We conclude that average height is
somewhat different between students who are physically active at least 3
days per week and those who are not, though the difference in means is
modest in size.
Determine the number of different options there are in the dataset for
hours_tv_per_school_day.
unique(yrbss$hours_tv_per_school_day)
## [1] "5+" "2" "3" "do not watch" "<1"
## [6] "4" "1" NA
length(unique(yrbss$hours_tv_per_school_day))
## [1] 8
Answer (Exercise 11):
The first line prints all distinct response categories for
hours_tv_per_school_day. The second line gives their count.
There are that many different options for how many hours of TV students
watch on a typical school day.
Come up with a research question evaluating the relationship between height or weight and sleep. Formulate the question so that it can be answered using a hypothesis test and/or a confidence interval. Report statistical results and provide a plain–language explanation. Check assumptions, state your alpha level, and conclude in context.
Research question:
Do students who sleep at least 7 hours on a school night tend to weigh
less, on average, than students who sleep less than 7 hours?
First, we build a two–level sleep variable:
yrbss <- yrbss %>%
mutate(
sleep_group = case_when(
school_night_hours_sleep %in% c("7", "8", "9", "10+") ~ "7+ hours",
school_night_hours_sleep %in% c("<5", "5", "6") ~ "<7 hours",
TRUE ~ NA_character_
)
)
We can compare the mean weights in the two sleep groups:
yrbss %>%
group_by(sleep_group) %>%
summarise(
n = n(),
mean_weight = mean(weight, na.rm = TRUE),
sd_weight = sd(weight, na.rm = TRUE)
)
## # A tibble: 3 × 4
## sleep_group n mean_weight sd_weight
## <chr> <int> <dbl> <dbl>
## 1 7+ hours 7232 67.3 16.4
## 2 <7 hours 5103 68.7 17.7
## 3 <NA> 1248 68.0 16.2
We then conduct a hypothesis test using infer.
obs_diff_sleep <- yrbss %>%
drop_na(sleep_group, weight) %>%
specify(weight ~ sleep_group) %>%
calculate(stat = "diff in means", order = c("7+ hours", "<7 hours"))
obs_diff_sleep
## Response: weight (numeric)
## Explanatory: sleep_group (factor)
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 -1.40
null_sleep <- yrbss %>%
drop_na(sleep_group, weight) %>%
specify(weight ~ sleep_group) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("7+ hours", "<7 hours"))
null_sleep %>%
get_p_value(obs_stat = obs_diff_sleep, direction = "two_sided")
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0
We can also construct a 95% confidence interval for the difference in mean weights:
boot_sleep <- yrbss %>%
drop_na(sleep_group, weight) %>%
specify(weight ~ sleep_group) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "diff in means", order = c("7+ hours", "<7 hours"))
ci_sleep_95 <- boot_sleep %>%
get_ci(level = 0.95)
ci_sleep_95
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 -2.05 -0.800
Answer (Exercise 12):
The p–value from the permutation test indicates whether the observed
difference in mean weight between sleep groups is statistically
significant at the 5% level. The 95% bootstrap confidence interval for
\(\mu_{7+} - \mu_{<7}\) shows the
range of plausible values for the true difference in mean weights.
If the p–value is below 0.05 and the entire confidence interval lies below 0, we would conclude that students who sleep at least 7 hours tend to weigh less on average than students who sleep fewer than 7 hours. If the interval includes 0 and the p–value is large, we would conclude that there is no strong evidence that average weight differs by sleep group in the population of US high school students.
Overall, this analysis illustrates how we can use hypothesis tests and confidence intervals to study the relationship between sleep and body weight among teenagers.