Inference for Numerical Data

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(infer)
library(dplyr)
library(ggplot2)

Exercise 1

What are the cases in this data set? How many cases are there in our sample?

Three are 13583 cases or rows in our sample. and 13 variables or columns.

# Check the dimensions of the yrbss dataset
data('yrbss', package='openintro')
dim(yrbss)
## [1] 13583    13

Exploratory data analysis

Exercise 2

How many observations are we missing weights from?

There are 9,476 missing values.Under the observations of weights we have 1004 missing values.

sum(is.na(yrbss))
## [1] 9476
summary(yrbss$weight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   29.94   56.25   64.41   67.91   76.20  180.99    1004
head(yrbss)
## # A tibble: 6 × 13
##     age gender grade hispanic race                      height weight helmet_12m
##   <int> <chr>  <chr> <chr>    <chr>                      <dbl>  <dbl> <chr>     
## 1    14 female 9     not      Black or African American  NA      NA   never     
## 2    14 female 9     not      Black or African American  NA      NA   never     
## 3    15 female 9     hispanic Native Hawaiian or Other…   1.73   84.4 never     
## 4    15 female 9     not      Black or African American   1.6    55.8 never     
## 5    15 female 9     not      Black or African American   1.5    46.7 did not r…
## 6    15 female 9     not      Black or African American   1.57   67.1 did not r…
## # ℹ 5 more variables: text_while_driving_30d <chr>, physically_active_7d <int>,
## #   hours_tv_per_school_day <chr>, strength_training_7d <int>,
## #   school_night_hours_sleep <chr>

Exercise 3

Make a side-by-side boxplot of physical_3plus and weight. Is there a relationship between these two variables? What did you expect and why?

yrbss <- yrbss %>% 
  mutate(physical_3plus = ifelse(yrbss$physically_active_7d > 2, "yes", "no"))

yrbss %>%
  group_by(physical_3plus) %>%
  summarise(mean_weight = mean(weight, na.rm = TRUE))
## # A tibble: 3 × 2
##   physical_3plus mean_weight
##   <chr>                <dbl>
## 1 no                    66.7
## 2 yes                   68.4
## 3 <NA>                  69.9
# Create a side-by-side boxplot
ggplot(yrbss, aes(x = physical_3plus, y = weight)) +
  geom_boxplot() +
  labs(title = " Physical Activity vs. Weight", x = "Physical Activity (3+ Days)", y = "Weight")
## Warning: Removed 1004 rows containing non-finite values (`stat_boxplot()`).

In the boxplot, we can see see differences in the central tendency (medians) or the spread of weights between the “yes” and “no” groups. Compared to those who are not physically active, it may indicate a relationship between physical activity and weight.

Inference

Exercise 4

Are all conditions necessary for inference satisfied? Comment on each. You can compute the group sizes with the summarize command above by defining a new variable with the definition n().

If the data was collected through a random sampling process, and the sample can be considered representative of the population. If each individual in the dataset is a unique and independent observation, this condition is often met in observational datasets like surveys. BY conducting a hypothesis test (e.g., t-test) to determine the statistical significance of the observed difference in means between the “yes” and “no” groups. If data is sufficient Sample Size, the group sizes for “yes” and “no” can be calculated as:

 group_sizes <- yrbss %>%
  group_by(physical_3plus) %>%
  summarise(group_size = n())
group_sizes
## # A tibble: 3 × 2
##   physical_3plus group_size
##   <chr>               <int>
## 1 no                   4404
## 2 yes                  8906
## 3 <NA>                  273

Exercise 5

Write the hypotheses for testing if the average weights are different for those who exercise at least times a week and those who don’t.

The population mean weight for high schoolers who exercise at least three times a week is equal to the population mean weight for those who don’t exercise at least three times a week is known Null Hypothesis H0 that means there is no significant difference in mean weight between the two groups.

The population mean weight for high schoolers who exercise at least three times a week is not equal to the population mean weight for those who don’t exercise at least three times a week is known as alternative hypothesis H1 that there is a significant difference.

 obs_diff <- yrbss %>%
  drop_na(physical_3plus) %>%
  specify(weight ~ physical_3plus) %>%
  calculate(stat = "diff in means", order = c("yes", "no"))
## Warning: Removed 946 rows containing missing values.
null_dist <- yrbss %>%
  drop_na(physical_3plus) %>%
  specify(weight ~ physical_3plus) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", order = c("yes", "no"))
## Warning: Removed 946 rows containing missing values.
ggplot(data = null_dist, aes(x = stat)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

How many of these null permutations have a difference of at least obs_stat?

visualize(null_dist) + 
  shade_p_value(obs_stat = obs_diff, direction = "two_sided")

null_dist %>%
  get_p_value(obs_stat = obs_diff, direction = "two_sided")
## Warning: Please be cautious in reporting a p-value of 0. This result is an
## approximation based on the number of `reps` chosen in the `generate()` step.
## See `?get_p_value()` for more information.
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

The p-value epresents the probability of observing a difference as extreme as or more extreme than the observed statistic under the null hypothesis. If the p-value is less than the significance level (e.g. 0.05), the null hypothesis might be rejected in favor of the alternative.

Exercise 7

Construct and record a confidence interval for the difference between the weights of those who exercise at least three times a week and those who don’t, and interpret this interval in context of the data.

#Calculate and Summarize the Standard Deviation (sd_weight) within each group
yrbss %>% 
  group_by(physical_3plus) %>% 
  summarise(sd_weight = sd(weight, na.rm = TRUE))
## # A tibble: 3 × 2
##   physical_3plus sd_weight
##   <chr>              <dbl>
## 1 no                  17.6
## 2 yes                 16.5
## 3 <NA>                17.6
#Calculate and Summarize the Mean (mean_weight):
yrbss %>% 
  group_by(physical_3plus) %>% 
  summarise(mean_weight = mean(weight, na.rm = TRUE))
## # A tibble: 3 × 2
##   physical_3plus mean_weight
##   <chr>                <dbl>
## 1 no                    66.7
## 2 yes                   68.4
## 3 <NA>                  69.9
#calculate the frequency of weights for different levels of the "physical_3plus" variable
#sample size for each category
yrbss %>% 
  group_by(physical_3plus) %>% 
  summarise(freq = table(weight)) %>%
  summarise(n = sum(freq))
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'physical_3plus'. You can override using
## the `.groups` argument.
## # A tibble: 3 × 2
##   physical_3plus     n
##   <chr>          <int>
## 1 no              4022
## 2 yes             8342
## 3 <NA>             215
# Calculate the upper and lower confidence intervals for the "not active" group
mean_not_active <- 66.67389
n_not_active <- 4022
sd_not_active <- 17.63805
z <- 1.96  # Z-score for a 95% confidence interval
upper_ci_not_act <- mean_not_active + z * (sd_not_active / sqrt(n_not_active))
lower_ci_not_act <- mean_not_active - z * (sd_not_active / sqrt(n_not_active))
cat("Confidence Interval for 'not active' group: is ", upper_ci_not_act, "and" , lower_ci_not_act, "\n")
## Confidence Interval for 'not active' group: is  67.219 and 66.12878
# Given values for the "active" group
mean_active <- 68.44847
n_active <- 8342
sd_active <- 16.47832
z <- 1.96 # with 95% confidence level
# Calculate the upper and lower confidence intervals
upper_ci_act <- mean_active + z * (sd_active / sqrt(n_active))
lower_ci_act <- mean_active - z * (sd_active / sqrt(n_active))
cat("Confidence Interval for 'active' group: is ", upper_ci_act, "and" , lower_ci_act, "\n")
## Confidence Interval for 'active' group: is  68.80209 and 68.09485

More Practice

Exercise 8

Calculate a 95% confidence interval for the average height in meters (height) and interpret it in context.

z <- 1.96  # Z-score for a 95% confidence interval
height_df <- as.data.frame(table(yrbss$height))
freq <- sum(height_df$Freq)

mean_h <- mean(yrbss$height, na.rm = TRUE)
sd <- sd(yrbss$height, na.rm = TRUE)
samp <- yrbss %>% 
  summarise(freq = table(height)) %>%
  summarise(n = sum(freq, na.rm = TRUE))
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
up_ci <- mean_h + z*(sd/sqrt(samp))
low_ci <- mean_h - z*(sd/sqrt(samp))
c(up_ci,low_ci)
## $n
## [1] 1.693071
## 
## $n
## [1] 1.689411

Exercise 9

Calculate a new confidence interval for the same parameter at the 90% confidence level. Comment on the width of this interval versus the one obtained in the previous exercise.

# Z-score for a 90% confidence interval
confidence_level <- 0.90
z1 <- qnorm((1 + confidence_level) / 2)
z1
## [1] 1.644854
up_ci <- mean_h + z1*(sd/sqrt(samp))
low_ci <- mean_h - z1*(sd/sqrt(samp))
c(up_ci,low_ci)
## $n
## [1] 1.692776
## 
## $n
## [1] 1.689705

Exercise 10

Conduct a hypothesis test evaluating whether the average height is different for those who exercise at least three times a week and those who don’t.

Because p-value is very small, smaller than 0.05, we should reject the null hypothesis and there is a statistically significant difference in average height between individuals who are physically active at least three days per week and those who are not.

# calculate the observed difference in means for height between the "yes" and "no" groups
obs_diff_height <- yrbss %>%
  drop_na(physical_3plus) %>%
  specify(height ~ physical_3plus) %>%
  calculate(stat = "diff in means", order = c("yes", "no"))
## Warning: Removed 946 rows containing missing values.
# A null distribution by simulating data under the assumption of independence (null hypothesis) through permutation
null_dist_height <- yrbss %>%
  drop_na(physical_3plus) %>%
  specify(height ~ physical_3plus) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", order = c("yes", "no"))
## Warning: Removed 946 rows containing missing values.
# visualize the null distribution and shaded the p-value region in both tails of the distribution
visualize(null_dist_height) + 
  shade_p_value(obs_stat = obs_diff_height, direction = "two_sided")

# calculate the p-value
null_dist_height %>%
  get_p_value(obs_stat = obs_diff_height, direction = "two_sided")
## Warning: Please be cautious in reporting a p-value of 0. This result is an
## approximation based on the number of `reps` chosen in the `generate()` step.
## See `?get_p_value()` for more information.
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

Exercise 11

Now, a non-inference task: Determine the number of different options there are in the dataset for the hours_tv_per_school_day there are.

yrbss %>%group_by(hours_tv_per_school_day)%>% summarise(n())
## # A tibble: 8 × 2
##   hours_tv_per_school_day `n()`
##   <chr>                   <int>
## 1 1                        1750
## 2 2                        2705
## 3 3                        2139
## 4 4                        1048
## 5 5+                       1595
## 6 <1                       2168
## 7 do not watch             1840
## 8 <NA>                      338

There are 7 different options and NA.

Exercise 12

Come up with a research question evaluating the relationship between height or weight and sleep. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval. Report the statistical results, and also provide an explanation in plain language. Be sure to check all assumptions, state your α level, and conclude in context.

# Create a new variable 'sleep_less' based on sleep duration
yrbss <- yrbss %>%
  mutate(sleep_less = ifelse(school_night_hours_sleep < 7, "yes", "no"))

# Select weight for students who sleep less
weight_less <- yrbss %>%
  filter(sleep_less == "yes") %>%
  select(weight) %>%
  na.omit()

# Select weight for students who sleep more
weight_more <- yrbss %>%
  filter(sleep_less == "no") %>%
  select(weight) %>%
  na.omit()

# Boxplot to visually compare weight distributions
boxplot(weight_less$weight, weight_more$weight,
        names = c("less_sleep", "more_sleep"))

# Summary statistics for weight of students who sleep less
mean_less <- mean(weight_less$weight)
sd_less <- sd(weight_less$weight)
n_less <- length(weight_less$weight)

# Summary statistics for weight of students who sleep more
mean_more <- mean(weight_more$weight)
sd_more <- sd(weight_more$weight)
n_more <- length(weight_more$weight)

# Calculate the difference in means
mean_diff <- mean_more - mean_less

# Calculate the standard error of the difference in means
se <- sqrt((sd_less^2 / n_less) + (sd_more^2 / n_more))

# Degrees of freedom
df <- n_less + n_more - 2

# Calculate the t-statistic
t_stat <- mean_diff / se

# Calculate the two-tailed p-value
p_value <- 2 * (1 - pt(abs(t_stat), df))

# Calculate the 95% confidence interval
alpha <- 0.05
t_critical <- qt(1 - alpha / 2, df)
margin_of_error <- t_critical * se
ci_lower <- mean_diff - margin_of_error
ci_upper <- mean_diff + margin_of_error

# Print summary statistics, p-value, and confidence interval

cat("P-Value:", p_value, "\n")
## P-Value: 3.238408e-06
cat("95% Confidence Interval:", ci_lower, "to", ci_upper, "\n")
## 95% Confidence Interval: -2.137164 to -0.8710759