Getting Started

Load packages

In this lab, we will explore and visualize the data using the tidyverse suite of packages, and perform statistical inference using infer. The data can be found in the companion package for OpenIntro resources, openintro.

Let’s load the packages.

library(tidyverse)
library(openintro)
library(infer)

The data

Every two years, the Centers for Disease Control and Prevention conduct the Youth Risk Behavior Surveillance System (YRBSS) survey, where it takes data from high schoolers (9th through 12th grade), to analyze health patterns. You will work with a selected group of variables from a random sample of observations during one of the years the YRBSS was conducted.

Load the yrbss data set into your workspace.

data('yrbss', package='openintro')

There are observations on 13 different variables, some categorical and some numerical. The meaning of each variable can be found by bringing up the help file:

?yrbss
  1. What are the cases in this data set? How many cases are there in our sample?

Remember that you can answer this question by viewing the data in the data viewer or by using the following command:

glimpse(yrbss)
## Rows: 13,583
## Columns: 13
## $ age                      <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15, 1…
## $ gender                   <chr> "female", "female", "female", "female", "fema…
## $ grade                    <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9", …
## $ hispanic                 <chr> "not", "not", "hispanic", "not", "not", "not"…
## $ race                     <chr> "Black or African American", "Black or Africa…
## $ height                   <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88, 1…
## $ weight                   <dbl> NA, NA, 84.37, 55.79, 46.72, 67.13, 131.54, 7…
## $ helmet_12m               <chr> "never", "never", "never", "never", "did not …
## $ text_while_driving_30d   <chr> "0", NA, "30", "0", "did not drive", "did not…
## $ physically_active_7d     <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, 7, …
## $ hours_tv_per_school_day  <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5+",…
## $ strength_training_7d     <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, 7, …
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "<5"…
colnames(yrbss)
##  [1] "age"                      "gender"                  
##  [3] "grade"                    "hispanic"                
##  [5] "race"                     "height"                  
##  [7] "weight"                   "helmet_12m"              
##  [9] "text_while_driving_30d"   "physically_active_7d"    
## [11] "hours_tv_per_school_day"  "strength_training_7d"    
## [13] "school_night_hours_sleep"

There are 13,583 observational cases with 13 different variables.

Exploratory data analysis

You will first start with analyzing the weight of the participants in kilograms: weight.

Using visualization and summary statistics, describe the distribution of weights. The summary function can be useful.

summary(yrbss$weight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   29.94   56.25   64.41   67.91   76.20  180.99    1004
  1. How many observations are we missing weights from?
sum(is.na(yrbss$weight))
## [1] 1004

There are 1,004 observations that are missing weight values.

Next, consider the possible relationship between a high schooler’s weight and their physical activity. Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions.

First, let’s create a new variable physical_3plus, which will be coded as either “yes” if they are physically active for at least 3 days a week, and “no” if not.

yrbss <- yrbss %>% 
  mutate(physical_3plus = ifelse(yrbss$physically_active_7d > 2, "yes", "no")) 
  1. Make a side-by-side boxplot of physical_3plus and weight. Is there a relationship between these two variables? What did you expect and why?
yrbss_no_na <- yrbss %>%
  filter(!is.na(physical_3plus),!is.na(weight))


ggplot(yrbss_no_na, aes(x=physical_3plus, y=weight)) +
  geom_boxplot() + 
  ggtitle("Comparing Two Variables") +
  xlab("Physically Active At Least 3 days") +
  ylab("Weight")

It appears that each boxplot is almost the same. The IQR and median are slightly higher for people who are physically active at least 3 days a week. With the exception of some outliers, the weight of people who are physically active less than 3 days a week is slightly lower than people who are physically active at least 3 days a week.

The box plots show how the medians of the two distributions compare, but we can also compare the means of the distributions using the following to first group the data by the physical_3plus variable, and then calculate the mean weight in these groups using the mean function while ignoring missing values by setting the na.rm argument to TRUE.

yrbss %>%
  group_by(physical_3plus) %>%
  summarise(mean_weight = mean(weight, na.rm = TRUE))
## # A tibble: 3 × 2
##   physical_3plus mean_weight
##   <chr>                <dbl>
## 1 no                    66.7
## 2 yes                   68.4
## 3 <NA>                  69.9

There is an observed difference, but is this difference statistically significant? In order to answer this question we will conduct a hypothesis test.

Inference

  1. Are all conditions necessary for inference satisfied? Comment on each. You can compute the group sizes with the summarize command above by defining a new variable with the definition n().

The 3 conditions that are necessary for inference is that the data comes from a random sample, the sampling distribution has to be approximately normal with at least 10 successes and 10 failures, and the observations need to be independent.

yrbss %>%
  filter(!(is.na(physical_3plus) | is.na(weight))) %>%
  group_by(physical_3plus) %>%
  summarise(count = n())
## # A tibble: 2 × 2
##   physical_3plus count
##   <chr>          <int>
## 1 no              4022
## 2 yes             8342

The inferences in this case are satisfied. The sample is random. There appears to be a normal distribution where there is more than 10 students who are physically active at least 3 days a week and more than 10 students who are physically active less than 3 days a week. Lastly, the sample size is less than 10% of the total high school student population, and therefore the observations are independent.

  1. Write the hypotheses for testing if the average weights are different for those who exercise at least 3 times a week and those who don’t.

Ho: There is no difference between the average weights of students who are physically active at least 3 times a week and those who are physically active less than 3 times a week.

M1 = M2

Ha: There is a difference between the average weights of students who are physically active at least 3 times a week and those who are physically active less than 3 times a week.

M1 != M2

Next, we will introduce a new function, hypothesize, that falls into the infer workflow. You will use this method for conducting hypothesis tests.

But first, we need to initialize the test, which we will save as obs_diff.

obs_diff <- yrbss_no_na %>%
  specify(weight ~ physical_3plus) %>%
  calculate(stat = "diff in means", order = c("yes", "no"))

Notice how you can use the functions specify and calculate again like you did for calculating confidence intervals. Here, though, the statistic you are searching for is the difference in means, with the order being yes - no != 0.

After you have initialized the test, you need to simulate the test on the null distribution, which we will save as null.

null_dist <- yrbss_no_na %>%
  specify(weight ~ physical_3plus) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", order = c("yes", "no"))

Here, hypothesize is used to set the null hypothesis as a test for independence. In one sample cases, the null argument can be set to “point” to test a hypothesis relative to a point estimate.

Also, note that the type argument within generate is set to permute, which is the argument when generating a null distribution for a hypothesis test.

We can visualize this null distribution with the following code:

ggplot(data = null_dist, aes(x = stat)) +
  geom_histogram()

  1. How many of these null permutations have a difference of at least obs_stat?
obs_diff[[1]]
## [1] 1.774584
(null_dist$stat >= obs_diff[[1]]) %>%
  table()
## .
## FALSE 
##  1000
sum(null_dist$stat >= obs_diff$stat)
## [1] 0

Now that the test is initialized and the null distribution formed, you can calculate the p-value for your hypothesis test using the function get_p_value.

null_dist %>%
  get_p_value(obs_stat = obs_diff, direction = "two_sided")
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

This is the standard workflow for performing hypothesis tests.

  1. Construct and record a confidence interval for the difference between the weights of those who exercise at least three times a week and those who don’t, and interpret this interval in context of the data.
weight_mean <- yrbss_no_na %>% 
  group_by(physical_3plus) %>% 
  summarise(mean_weight = mean(weight, na.rm = TRUE))
weight_mean
## # A tibble: 2 × 2
##   physical_3plus mean_weight
##   <chr>                <dbl>
## 1 no                    66.7
## 2 yes                   68.4

The mean of weights of students who are physically active for at least 3 days a week is 68.4, while the mean of students who are not physically active at least 3 days a week is 66.7.

weight_sd <- yrbss_no_na %>% 
  group_by(physical_3plus) %>% 
  summarise(sd_weight = sd(weight, na.rm = TRUE))
weight_sd
## # A tibble: 2 × 2
##   physical_3plus sd_weight
##   <chr>              <dbl>
## 1 no                  17.6
## 2 yes                 16.5

The standard deviation of weights of students who are physically active for at least 3 days a week is 16.5, while the mean of students who are physically active for less than 3 days a week is 17.6.

# Total for students who are physically active at least 3 days a week and less than 3 days a week
total_n <- yrbss %>%
  filter(!(is.na(physical_3plus) | is.na(weight))) %>%
  group_by(physical_3plus) %>%
  summarise(count = n())
total_n
## # A tibble: 2 × 2
##   physical_3plus count
##   <chr>          <int>
## 1 no              4022
## 2 yes             8342
# physically active at least 3 days a week
x1_mean <- 68.4
x1_sd <- 16.5
x1_n <- 8342
# not physically active at least 3 days a week
x2_mean <- 66.7
x2_sd <- 17.6
x2_n <- 4022
# Standard Error
SE <- x1_sd / sqrt(x1_n)
alpha = 0.05
degrees_of_freedom = x1_n - 1
t_score = qt(p=alpha/2, df=degrees_of_freedom,lower.tail=F)
# Margin of Error
margin_error <- t_score * SE
# physically active at least 3 days a week

lower_ci1 <- round(x1_mean - margin_error, 2)
upper_ci1 <- round(x1_mean + margin_error, 2)
lower_ci1
## [1] 68.05
upper_ci1
## [1] 68.75

We are 95% confident that the average weight of a student who is physically active at least 3 days a week will be between 68.05 kg and 68.75 kg.


SE2 <- x2_sd / sqrt(x2_n)
alpha2 = 0.05
degrees_of_freedom2 = x2_n - 1
t_score2 = qt(p=alpha2/2, df=degrees_of_freedom2,lower.tail=F)
margin_error2 <- t_score2 * SE2
# not physically active at least 3 days a week

lower_ci2 <- round(x2_mean - margin_error2, 2)
upper_ci2 <- round(x2_mean + margin_error2, 2)
lower_ci2
## [1] 66.16
upper_ci2
## [1] 67.24

We are 95% confident that the average weight of a student who is not physically active at least 3 days a week will be between 66.16 kg and 67.24 kg.


More Practice

  1. Calculate a 95% confidence interval for the average height in meters (height) and interpret it in context.
height_mean <- mean(yrbss$height, na.rm = TRUE)
height_mean
## [1] 1.691241
height_sd <- sd(yrbss$height, na.rm = TRUE)
height_sd
## [1] 0.1046973
total_height <- yrbss_no_na %>% 
  select(height) %>% 
  filter(!is.na(height)) %>%
  summarise(count = n())
total_height
## # A tibble: 1 × 1
##   count
##   <int>
## 1 12364
set.seed(1234)
SE <- height_sd / sqrt(total_height)
alpha <- 0.05
p <- alpha/2
df <- total_height - 1
# df = 12364-1 = 12363
t_score <- qt(p, df=12363, lower.tail = F)
margin_error <- t_score * SE
lower_ci <- height_mean - margin_error
upper_ci <- height_mean + margin_error
lower_ci
##      count
## 1 1.689395
upper_ci
##      count
## 1 1.693087

With a 95% confidence interval, the average height of students in meters will be between 1.689395 and 1.693087.

  1. Calculate a new confidence interval for the same parameter at the 90% confidence level. Comment on the width of this interval versus the one obtained in the previous exercise.
set.seed(1234)
new_SE <- height_sd / sqrt(total_height)
new_alpha <- 0.10
new_p <- new_alpha/2
new_df <- total_height - 1
# new_df = 12364-1 = 12363
new_t_score <- qt(new_p, df=12363, lower.tail = F)
new_margin_error <- new_t_score * new_SE
# 90% Confidence Level
new_lower_ci <- height_mean - new_margin_error
new_upper_ci <- height_mean + new_margin_error
new_lower_ci
##      count
## 1 1.689692
new_upper_ci
##     count
## 1 1.69279
# difference between interval ranges at 95% confidence level
conf95_difference <- upper_ci - lower_ci
conf95_difference
##         count
## 1 0.003691275
# difference between interval ranges at 95% confidence level
conf90_difference <- new_upper_ci - new_lower_ci
conf90_difference
##         count
## 1 0.003097744
overall_difference <- conf95_difference - conf90_difference
overall_difference
##          count
## 1 0.0005935305

With a 90% confidence interval, the average height in meters for students will be between 1.689692 and 1.69279. Compared with the confidence intervals at 95%, the lower confidence interval value is slightly higher, while the upper confidence interval is slightly lower. The interval range at with a 95% confidence level is slightly larger than the confidence interval range at a 90% confidence level, the difference being 0.0005935305.

  1. Conduct a hypothesis test evaluating whether the average height is different for those who exercise at least three times a week and those who don’t.

Ho: There is no difference in average height for students who exercise at least 3 times a week.

Ha: There is a difference in average height for students who exercise at least 3 times a week.

yrbss_height_no_na <- yrbss %>%
  filter(!is.na(physical_3plus), !is.na(height))
yrbss_height_no_na
## # A tibble: 12,364 × 14
##      age gender grade hispanic race        height weight helme…¹ text_…² physi…³
##    <int> <chr>  <chr> <chr>    <chr>        <dbl>  <dbl> <chr>   <chr>     <int>
##  1    15 female 9     hispanic Native Haw…   1.73   84.4 never   30            7
##  2    15 female 9     not      Black or A…   1.6    55.8 never   0             0
##  3    15 female 9     not      Black or A…   1.5    46.7 did no… did no…       2
##  4    15 female 9     not      Black or A…   1.57   67.1 did no… did no…       1
##  5    15 female 9     not      Black or A…   1.65  132.  did no… <NA>          4
##  6    14 male   9     not      Black or A…   1.88   71.2 never   <NA>          4
##  7    15 male   9     not      Black or A…   1.75   63.5 never   <NA>          5
##  8    15 male   10    not      Black or A…   1.37   97.1 did no… <NA>          0
##  9    15 female 9     not      Black or A…   1.68   69.8 did no… 0             0
## 10    15 female 9     not      Black or A…   1.65   66.7 did no… did no…       0
## # … with 12,354 more rows, 4 more variables: hours_tv_per_school_day <chr>,
## #   strength_training_7d <int>, school_night_hours_sleep <chr>,
## #   physical_3plus <chr>, and abbreviated variable names ¹​helmet_12m,
## #   ²​text_while_driving_30d, ³​physically_active_7d
height_obs_diff <- yrbss_height_no_na %>%
  specify(height ~ physical_3plus) %>%
  calculate(stat = "diff in means", order = c("yes", "no"))
height_null_dist <- yrbss_height_no_na %>%
  specify(height ~ physical_3plus) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", order = c("yes", "no"))
height_null_dist %>%
  get_p_value(obs_stat = height_obs_diff, direction = "two_sided")
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0
mean_height <- yrbss_height_no_na %>%
  group_by(physical_3plus) %>%
  summarise(mean_height = mean(height, na.rm = TRUE))
mean_height
## # A tibble: 2 × 2
##   physical_3plus mean_height
##   <chr>                <dbl>
## 1 no                    1.67
## 2 yes                   1.70
sd_height <- yrbss_height_no_na %>% 
  group_by(physical_3plus) %>% 
  summarise(sd_height_3plus = sd(height, na.rm = TRUE))
sd_height
## # A tibble: 2 × 2
##   physical_3plus sd_height_3plus
##   <chr>                    <dbl>
## 1 no                       0.103
## 2 yes                      0.103
total_height_n <- yrbss_height_no_na %>%
  filter(!(is.na(physical_3plus) | is.na(height))) %>%
  group_by(physical_3plus) %>%
  summarise(count = n())
total_height_n
## # A tibble: 2 × 2
##   physical_3plus count
##   <chr>          <int>
## 1 no              4022
## 2 yes             8342
# physically active at least 3 days a week
height_mean <- 1.70
height_sd <- .103
height_n <- 8342
# not physically active at least 3 days a week

not_active_mean <- 1.67
not_active_sd <- .103
not_active_n <- 4022

Physically Active At Least 3 Days a Week

set.seed(1234)
active_SE <- height_sd / sqrt(height_n)
alpha = 0.05
active_degrees_of_freedom = height_n - 1
active_t_score = qt(p=alpha/2, df=active_degrees_of_freedom,lower.tail=F)
active_margin_error <- t_score * active_SE
lower_ci1 <- height_mean - active_margin_error
upper_ci1 <- height_mean + active_margin_error
lower_ci1
## [1] 1.697789
upper_ci1
## [1] 1.702211

With a 95% confidence level, the average height of students who are physically active at least 3 days a week will be between 1.697789 and 1.702211.

Not Physically Active 3 days a week

set.seed(1234)
not_active_SE <- not_active_sd / sqrt(not_active_n)
alpha = 0.05
not_active_degrees_of_freedom = not_active_n - 1
not_active_t_score = qt(p=alpha/2, df=not_active_degrees_of_freedom,lower.tail=F)
not_active_margin_error <- not_active_t_score * not_active_SE
not_active_t_score = qt(p=alpha/2, df=not_active_degrees_of_freedom,lower.tail=F)
not_active_t_score
## [1] 1.960554
not_active_lower_ci1 <- not_active_mean - not_active_margin_error
not_active_upper_ci1 <- not_active_mean + not_active_margin_error
not_active_lower_ci1
## [1] 1.666816
not_active_upper_ci1
## [1] 1.673184

With a 95% confidence level, the average height of students who are not physically active at least 3 days a week will be between 1.666816 and 1.673184.

p_value_active <- 2 * pt(active_t_score, active_degrees_of_freedom, lower.tail = FALSE)
p_value_active
## [1] 0.05

With a p-value of 0.05, we can reject the null hypothesis that there is no difference in the average of heights for students who are physically active at least 3 days a week.

  1. Now, a non-inference task: Determine the number of different options there are in the dataset for the hours_tv_per_school_day there are.
diff_options <- yrbss %>%
  filter(!is.na(hours_tv_per_school_day)) %>%
  group_by(hours_tv_per_school_day)%>% 
  summarise(count = n())
diff_options
## # A tibble: 7 × 2
##   hours_tv_per_school_day count
##   <chr>                   <int>
## 1 1                        1750
## 2 2                        2705
## 3 3                        2139
## 4 4                        1048
## 5 5+                       1595
## 6 <1                       2168
## 7 do not watch             1840
  1. Come up with a research question evaluating the relationship between height or weight and sleep. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval. Report the statistical results, and also provide an explanation in plain language. Be sure to check all assumptions, state your \(\alpha\) level, and conclude in context.

Ho: There is no difference in the average of weights for people who sleep at least 8 hours a day.

Ha: There is a difference in the average of weights for people who sleep at least 8 hours a day.

yrbss <- yrbss %>% 
  mutate(sleep_8hours = ifelse(yrbss$school_night_hours_sleep > 7, "yes", "no")) 
yrbss_weight_no_na <- yrbss %>%
  filter(!is.na(sleep_8hours), !is.na(weight))
yrbss_weight_no_na
## # A tibble: 11,481 × 15
##      age gender grade hispanic race        height weight helme…¹ text_…² physi…³
##    <int> <chr>  <chr> <chr>    <chr>        <dbl>  <dbl> <chr>   <chr>     <int>
##  1    15 female 9     hispanic Native Haw…   1.73   84.4 never   30            7
##  2    15 female 9     not      Black or A…   1.6    55.8 never   0             0
##  3    15 female 9     not      Black or A…   1.5    46.7 did no… did no…       2
##  4    15 female 9     not      Black or A…   1.57   67.1 did no… did no…       1
##  5    15 female 9     not      Black or A…   1.65  132.  did no… <NA>          4
##  6    14 male   9     not      Black or A…   1.88   71.2 never   <NA>          4
##  7    15 male   9     not      Black or A…   1.75   63.5 never   <NA>          5
##  8    15 male   10    not      Black or A…   1.37   97.1 did no… <NA>          0
##  9    15 female 9     not      Black or A…   1.68   69.8 did no… 0             0
## 10    15 female 9     not      Black or A…   1.65   66.7 did no… did no…       0
## # … with 11,471 more rows, 5 more variables: hours_tv_per_school_day <chr>,
## #   strength_training_7d <int>, school_night_hours_sleep <chr>,
## #   physical_3plus <chr>, sleep_8hours <chr>, and abbreviated variable names
## #   ¹​helmet_12m, ²​text_while_driving_30d, ³​physically_active_7d
weight_obs_diff <- yrbss_weight_no_na %>%
  specify(weight ~ sleep_8hours) %>%
  calculate(stat = "diff in means", order = c("yes", "no"))
weight_obs_diff
## Response: weight (numeric)
## Explanatory: sleep_8hours (factor)
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1 -1.18
weight_null_dist <- yrbss_weight_no_na %>%
  specify(weight ~ sleep_8hours) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", order = c("yes", "no"))
weight_null_dist
## Response: weight (numeric)
## Explanatory: sleep_8hours (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate    stat
##        <int>   <dbl>
##  1         1  0.635 
##  2         2 -0.225 
##  3         3 -0.328 
##  4         4 -0.246 
##  5         5  0.0539
##  6         6  0.138 
##  7         7 -0.0689
##  8         8 -0.540 
##  9         9 -0.0175
## 10        10 -0.139 
## # … with 990 more rows
weight_null_dist %>%
  get_p_value(obs_stat = weight_obs_diff, direction = "two_sided")
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1   0.002
mean_weight <- yrbss_weight_no_na %>%
  group_by(sleep_8hours) %>%
  summarise(mean_weight_sleep = mean(weight, na.rm = TRUE))
mean_weight
## # A tibble: 2 × 2
##   sleep_8hours mean_weight_sleep
##   <chr>                    <dbl>
## 1 no                        68.2
## 2 yes                       67.0
sd_weight <- yrbss_weight_no_na %>% 
  group_by(sleep_8hours) %>% 
  summarise(sd_weight_sleep = sd(weight, na.rm = TRUE))
sd_weight
## # A tibble: 2 × 2
##   sleep_8hours sd_weight_sleep
##   <chr>                  <dbl>
## 1 no                      17.2
## 2 yes                     16.4
total_weight_n <- yrbss_weight_no_na %>%
  filter(!(is.na(sleep_8hours) | is.na(weight))) %>%
  group_by(sleep_8hours) %>%
  summarise(count = n())
total_weight_n
## # A tibble: 2 × 2
##   sleep_8hours count
##   <chr>        <int>
## 1 no            8271
## 2 yes           3210
# sleep at least 8 hours
sleep_mean <- 67.0
sleep_sd <- 16.4
sleep_n <- 3210
# no sleep for at least 8 hours

no_sleep_mean <- 68.2
no_sleep_sd <- 17.2
no_sleep_n <- 8271

Sleep At Least 8 Hours

set.seed(1234)
sleep_SE <- sleep_sd / sqrt(sleep_n)
alpha = 0.05
sleep_degrees_of_freedom = sleep_n - 1
sleep_t_score = qt(p=alpha/2, df=degrees_of_freedom,lower.tail=F)
sleep_margin_error <- sleep_t_score * sleep_SE
sleep_lower_ci1 <- sleep_mean - sleep_margin_error
sleep_upper_ci1 <- sleep_mean + sleep_margin_error
sleep_lower_ci1
## [1] 66.43258
sleep_upper_ci1
## [1] 67.56742

With a 95% confidence level, the average weight of students who sleep at least 8 hours will be between 66.43258 kg and 67.56742 kg.

No Sleeping At Least 8 Hours

set.seed(1234)
no_sleep_SE <- no_sleep_sd / sqrt(no_sleep_n)
alpha = 0.05
no_sleep_degrees_of_freedom = no_sleep_n - 1
no_sleep_t_score = qt(p=alpha/2, df=no_sleep_degrees_of_freedom,lower.tail=F)
no_sleep_margin_error <- no_sleep_t_score * no_sleep_SE
no_sleep_lower_ci1 <- no_sleep_mean - no_sleep_margin_error
no_sleep_upper_ci1 <- no_sleep_mean + no_sleep_margin_error
no_sleep_lower_ci1
## [1] 67.82927
no_sleep_upper_ci1
## [1] 68.57073

With a 95% confidence level, the average weight of students who do not sleep at least 8 hours will be between 67.82927 kg and 68.57073 kg. With a p-value of 0.05, we can reject the null hypothesis.

# Interval range difference for students who slept at least 8 hours
diff_sleep <- sleep_upper_ci1 - sleep_lower_ci1
diff_sleep
## [1] 1.134834
# Interval Range difference for students who did not sleep at least 8 hours
diff_no_sleep <- no_sleep_upper_ci1 - no_sleep_lower_ci1
diff_no_sleep
## [1] 0.7414657
sleep_diff <- diff_sleep - diff_no_sleep
sleep_diff
## [1] 0.3933685

The confidence interval range difference of students who sleep at least 8 hours is 1.134834, which is larger than the range for students who do not sleep at least 8 hours, which is 0.7414657, a difference of 0.3933685.

# p-value
sleep_p_value <- 2 * pt(sleep_t_score, sleep_degrees_of_freedom, lower.tail = FALSE)
sleep_p_value
## [1] 0.05005317

Because the p-value is 0.05, we can reject the null hypothesis that there is no difference in the average weights of students who sleep at least 8 hours.