Loading packages and data

library(tidyverse)
library(openintro)
library(infer)

data('yrbss', package='openintro')

?yrbss

glimpse(yrbss)
## Rows: 13,583
## Columns: 13
## $ age                      <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15, 1…
## $ gender                   <chr> "female", "female", "female", "female", "fema…
## $ grade                    <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9", …
## $ hispanic                 <chr> "not", "not", "hispanic", "not", "not", "not"…
## $ race                     <chr> "Black or African American", "Black or Africa…
## $ height                   <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88, 1…
## $ weight                   <dbl> NA, NA, 84.37, 55.79, 46.72, 67.13, 131.54, 7…
## $ helmet_12m               <chr> "never", "never", "never", "never", "did not …
## $ text_while_driving_30d   <chr> "0", NA, "30", "0", "did not drive", "did not…
## $ physically_active_7d     <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, 7, …
## $ hours_tv_per_school_day  <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5+",…
## $ strength_training_7d     <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, 7, …
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "<5"…

Exercise 1

What are the cases in this data set? How many cases are there in our sample?

Answer

There are 13,583 rows indicating we have that many respondents, or “cases” to work with.

Exercise 2

How many observations are we missing weights from?

summary(yrbss$weight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   29.94   56.25   64.41   67.91   76.20  180.99    1004

Answer

Summary output indicates we have 1,004 “NA” (or missing) answers from our 13,583 cases.

Exercise 3

Make a side-by-side boxplot of physical_3plus and weight. Is there a relationship between these two variables? What did you expect and why?

yrbss <- yrbss %>% 
  mutate(physical_3plus = ifelse(yrbss$physically_active_7d > 2, "yes", "no"))

ggplot(yrbss, aes(x = physical_3plus, y = weight)) +
  geom_boxplot() +
  labs(
    x = "Physically Active (3+ days)",
    y = "Weight (kg)",
    title = "Weight Distribution by Physical Activity Level"
  )

Answer

I suppose I generally had the assumption in mind that more active cases would have a lower weight. The boxplot doesn’t really show us that however. It does seem to reinforce that high weight outliers exist in the non-active class, but there doesn’t appear to be a clear relationship between activity and weight.In fact we see a wide range of weightsincluding both categories having many above 150kg’s.

yrbss %>%
  group_by(physical_3plus) %>%
  summarise(mean_weight = mean(weight, na.rm = TRUE))
## # A tibble: 3 × 2
##   physical_3plus mean_weight
##   <chr>                <dbl>
## 1 no                    66.7
## 2 yes                   68.4
## 3 <NA>                  69.9

Exercise 4

Are all conditions necessary for inference satisfied? Comment on each. You can compute the group sizes with the summarize command above by defining a new variable with the definition n().

Answer

yrbss %>%
  group_by(physical_3plus) %>%
  summarise(
    mean_weight = mean(weight, na.rm = TRUE),
    n = n()
  )
## # A tibble: 3 × 3
##   physical_3plus mean_weight     n
##   <chr>                <dbl> <int>
## 1 no                    66.7  4404
## 2 yes                   68.4  8906
## 3 <NA>                  69.9   273

Independence: Each row is a different respondent, so this is satisfied. Randomization: YRBSS is a national survey designed to be a sample size of the demographic, so this is satisfied. Normality: The sample size is sufficiently large to satisfy inference.

Exercise 5

Write the hypotheses for testing if the average weights are different for those who exercise at least times a week and those who don’t.

obs_diff <- yrbss %>%
  drop_na(physical_3plus) %>%
  specify(weight ~ physical_3plus) %>%
  calculate(stat = "diff in means", order = c("yes", "no"))

null_dist <- yrbss %>%
  drop_na(physical_3plus) %>%
  specify(weight ~ physical_3plus) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", order = c("yes", "no"))

obs_diff
## Response: weight (numeric)
## Explanatory: physical_3plus (factor)
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1  1.77
ggplot(data = null_dist, aes(x = stat)) +
  geom_histogram()

Answer

Null Hypothesis (H₀): The mean weight for students who exercise at least 3 days a week is equal to the mean weight for those who do not.

Alternative Hypothesis (H₁): The mean weight for students who exercise at least 3 days a week is different from the mean weight for those who do not.

Exercise 6

How many of these null permutations have a difference of at least obs_stat?

null_dist %>%
  get_p_value(obs_stat = obs_diff, direction = "two_sided")
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

Answer

Looks like 0. Which is an interesting finding.

Exercise 7

Construct and record a confidence interval for the difference between the weights of those who exercise at least three times a week and those who don’t, and interpret this interval in context of the data.

yrbss <- yrbss %>% 
  mutate(physical_3plus = ifelse(physically_active_7d > 2, "yes", "no"))

ci <- yrbss %>%
  drop_na(physical_3plus, weight) %>%
  specify(weight ~ physical_3plus) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "diff in means", order = c("yes", "no")) %>%
  get_ci(level = 0.95, type = "percentile")

ci
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1     1.14     2.39

Answer

We are 95% confident that, on average, high school students who exercise at least three times a week weigh between 1.3 and 2.1 kilograms more than students who exercise less frequently. Since the entire interval is above 0, this supports the conclusion that there is a statistically significant difference in weight between the two groups.

Exercise 8

Calculate a 95% confidence interval for the average height in meters (height) and interpret it in context.

Answer

ci_height <- yrbss %>%
  drop_na(height) %>% 
  specify(response = height) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "mean") %>%
  get_ci(level = 0.95, type = "percentile")

ci_height
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1     1.69     1.69

We are 95% confident that our high school students are between 1.68 and 1.69 meters in height.

Exercise 9

Calculate a new confidence interval for the same parameter at the 90% confidence level. Comment on the width of this interval versus the one obtained in the previous exercise.

ci_height <- yrbss %>%
  drop_na(height) %>% 
  specify(response = height) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "mean") %>%
  get_ci(level = 0.90, type = "percentile")

ci_height
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1     1.69     1.69

Interestingly, for me at least, the band is narrower. But I suppose that holds. Because we have less confidence, we can achieve a more narrow range of parameters for height. So this band is less likely to have the true average height, but it is more likely give us a narrower band of average heights.

Exercise 10

Conduct a hypothesis test evaluating whether the average height is different for those who exercise at least three times a week and those who don’t.

Answer

yrbss <- yrbss %>% 
  mutate(physical_3plus = ifelse(physically_active_7d > 2, "yes", "no"))

obs_diff_height <- yrbss %>%
  drop_na(physical_3plus, height) %>%
  specify(height ~ physical_3plus) %>%
  calculate(stat = "diff in means", order = c("yes", "no"))
obs_diff_height
## Response: height (numeric)
## Explanatory: physical_3plus (factor)
## # A tibble: 1 × 1
##     stat
##    <dbl>
## 1 0.0376
null_dist_height <- yrbss %>%
  drop_na(physical_3plus, height) %>%
  specify(height ~ physical_3plus) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", order = c("yes", "no"))

p_value_height <- null_dist_height %>%
  get_p_value(obs_stat = obs_diff_height, direction = "two_sided")
p_value_height
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0
library(ggplot2)
ggplot(yrbss, aes(x = physical_3plus, y = height)) +
  geom_boxplot() +
  labs(x = "Physical Activity (3+ days)", 
       y = "Height (m)", 
       title = "Boxplot of Height by Physical Activity Level")

The null hypothesis states that there is no difference in average height between the two groups.

By calculating the a small p-value of .037, it is indicating that the observed difference is unlikely to have occurred by chance, suggesting a statistically significant difference in average height between the two groups.

##Exercise 11

Now, a non-inference task: Determine the number of different options there are in the dataset for the hours_tv_per_school_day there are.

Answer

yrbss %>% 
  distinct(hours_tv_per_school_day) %>% 
  nrow()
## [1] 8

It appears there are 8 options that respondents had to select from.

##Exercise 12

Come up with a research question evaluating the relationship between height or weight and sleep. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval. Report the statistical results, and also provide an explanation in plain language. Be sure to check all assumptions, state your α level, and conclude in context.

Answer

We will start with the research question: Do high school students who get at least 8 hours of sleep on school nights have a different average weight than those who get less than 8 hours of sleep?

Hypothesis:

Null Hypothesis (H₀): There is no difference in average weight between students with sufficient sleep and those with insufficient sleep. Alternative Hypothesis (H₁): The average weights differ between the two groups.

We will use a two-sample hypothesis test at a significance level of α = 0.05.

##Data Prep

unique(yrbss$school_night_hours_sleep)
## [1] "8"   "6"   "<5"  "9"   "10+" "7"   "5"   NA
yrbss <- yrbss %>%
  mutate(sleep_category = case_when(
    school_night_hours_sleep %in% c("8", "9", "10+") ~ "sufficient",
    TRUE ~ "insufficient"
  ))

yrbss$sleep_category <- factor(yrbss$sleep_category, levels = c("sufficient", "insufficient"))

Testing Hypothesis

obs_diff_weight <- yrbss %>%
  drop_na(sleep_category, weight) %>%
  specify(weight ~ sleep_category) %>%
  calculate(stat = "diff in means", order = c("sufficient", "insufficient"))
obs_diff_weight
## Response: weight (numeric)
## Explanatory: sleep_category (factor)
## # A tibble: 1 × 1
##     stat
##    <dbl>
## 1 -0.967
null_dist_weight <- yrbss %>%
  drop_na(sleep_category, weight) %>%
  specify(weight ~ sleep_category) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", order = c("sufficient", "insufficient"))

p_value_weight <- null_dist_weight %>%
  get_p_value(obs_stat = obs_diff_weight, direction = "two_sided")
p_value_weight
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1   0.002

Oh interesting. Results indicate that there is an observed difference in average weight between those who sleep 8+ hours and those who do not. Students who sleep 8+ hours (sufficicent) weigh .97 kgs less than those who have insufficient sleep. Moreover, the p-value is .002 which shows strong evidence against the null hypothesis of no difference between groups.