In this lab, I will explore and visualize the data using the tidyverse suite of packages, and perform statistical inference using infer. The data can be found in the companion package for OpenIntro resources, openintro.
library(tidyverse)
library(openintro)
library(infer)Every two years, the Centers for Disease Control and Prevention conduct the Youth Risk Behavior Surveillance System (YRBSS) survey, where it takes data from high schoolers (9th through 12th grade), to analyze health patterns. You will work with a selected group of variables from a random sample of observations during one of the years the YRBSS was conducted.
Load the yrbss data set into your workspace.
What are the cases in this data set? How many cases are there in our sample?
data(yrbss)
glimpse(yrbss)## Rows: 13,583
## Columns: 13
## $ age <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15, 1…
## $ gender <chr> "female", "female", "female", "female", "fema…
## $ grade <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9", …
## $ hispanic <chr> "not", "not", "hispanic", "not", "not", "not"…
## $ race <chr> "Black or African American", "Black or Africa…
## $ height <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88, 1…
## $ weight <dbl> NA, NA, 84.37, 55.79, 46.72, 67.13, 131.54, 7…
## $ helmet_12m <chr> "never", "never", "never", "never", "did not …
## $ text_while_driving_30d <chr> "0", NA, "30", "0", "did not drive", "did not…
## $ physically_active_7d <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, 7, …
## $ hours_tv_per_school_day <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5+",…
## $ strength_training_7d <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, 7, …
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "<5"…
There are 13,583 cases in the dataset
How many observations are we missing weights from?
summary(yrbss$weight)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 29.94 56.25 64.41 67.91 76.20 180.99 1004
1004 observation are missing weights
Make a side-by-side boxplot of physical_3plus and weight. Is there a relationship between these two variables? What did you expect and why?
yrbss<-yrbss %>%
mutate(physical_3plus = ifelse(yrbss$physically_active_7d > 2, "yes", "no"))%>%
filter(!is.na(physical_3plus))
ggplot(yrbss, aes(x=physical_3plus, y=weight))+
geom_boxplot()+
ggtitle('weight destribution by activities')## Warning: Removed 946 rows containing non-finite values (`stat_boxplot()`).
Box plot show the means are very similiar. I use the following table to further understand the number.
yrbss%>%
group_by(physical_3plus)%>%
summarise(mean_weight=mean(weight, na.rm=TRUE))## # A tibble: 2 × 2
## physical_3plus mean_weight
## <chr> <dbl>
## 1 no 66.7
## 2 yes 68.4
Are all conditions necessary for inference satisfied? Comment on each. You can compute the group sizes with the summarize command above by defining a new variable with the definition n().
yrbss%>%
group_by(physical_3plus) %>%
summarise(n=n()) %>%
mutate(freq=n/sum(n), np=n*freq)## # A tibble: 2 × 4
## physical_3plus n freq np
## <chr> <int> <dbl> <dbl>
## 1 no 4404 0.331 1457.
## 2 yes 8906 0.669 5959.
From the table, N and NP are large enough, all condition necessary for inference are satisfied.
Write the hypotheses for testing if the average weights are different for those who exercise at least times a week and those who don’t.
Null H0: There is no impact on weight between people who exercise 3 or more days a weeks compared to the weight of those who don’t
Alternate Ha: That there is a weight difference between people who exercise 3 or more days a weeks vs those who don’t
obs_diff <- yrbss %>%
specify(weight ~ physical_3plus) %>%
calculate(stat = "diff in means", order = c("yes", "no"))## Warning: Removed 946 rows containing missing values.
null_dist <- yrbss %>%
specify(weight ~ physical_3plus) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("yes", "no"))## Warning: Removed 946 rows containing missing values.
ggplot(data = null_dist, aes(x = stat)) +
geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
How many of these null permutations have a difference of at least obs_stat?
visualize(null_dist) +
shade_p_value(obs_stat = obs_diff, direction = "two_sided")null_dist %>%
get_p_value(obs_stat = obs_diff, direction = "two_sided")## Warning: Please be cautious in reporting a p-value of 0. This result is an
## approximation based on the number of `reps` chosen in the `generate()` step. See
## `?get_p_value()` for more information.
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0
Construct and record a confidence interval for the difference between the weights of those who exercise at least three times a week and those who don’t, and interpret this interval in context of the data.
yrbss %>%
specify(weight ~ physical_3plus) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("yes", "no")) %>%
get_ci(level = 0.95)## Warning: Removed 946 rows containing missing values.
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 -0.627 0.634
The above shows that The CI interval is between -0.6 and 0.6. The true mean with 95% confidence under the null is that mean is between -0.6 and 0.6. Our TRUE observed difference is 1.7, way outside the CI we can reject the NULL.
Calculate a 95% confidence interval for the average height in meters (height) and interpret it in context.
height_mean <- mean(yrbss$height, na.rm = TRUE)
height_mean## [1] 1.690973
height_sd <- sd(yrbss$height, na.rm = TRUE)
height_sd## [1] 0.1046448
height_n <- yrbss %>%
summarise(freq = table(height)) %>%
summarise(n = sum(freq, na.rm = TRUE))
height_n## # A tibble: 1 × 1
## n
## <int>
## 1 12364
z_height <- 1.96
# confidence interval for height
upper_height<- height_mean + z_height * (height_sd / sqrt(height_n))
upper_height## n
## 1 1.692818
lower_height <- height_mean - z_height * (height_sd / sqrt(height_n))
lower_height## n
## 1 1.689128
Per calculation above, we are 95% confident that the mean height is between 1.689128 meters and 1.692818 meters.
Calculate a new confidence interval for the same parameter at the 90% confidence level. Comment on the width of this interval versus the one obtained in the previous exercise.
z_height <- 1.65
# confidence interval for height
upper_height_90<- height_mean + z_height * (height_sd / sqrt(height_n))
upper_height_90## n
## 1 1.692526
lower_height_90 <- height_mean - z_height * (height_sd / sqrt(height_n))
lower_height_90## n
## 1 1.68942
Per calculation above, we are 90% confident that the mean height is between 1.68942 meters and 1.692526. 95% and 90% confidence intervals are very similar. The former is slightly larger.
Conduct a hypothesis test evaluating whether the average height is different for those who exercise at least three times a week and those who don’t.
Ho: There is no difference in the average height of those who are physically active 3 or more days per week.
Ha: There is a difference in the average height of those who are physically active 3 or more days per week.
obs_diff_hgt <- yrbss %>%
specify(height ~ physical_3plus) %>%
calculate(stat = "diff in means", order = c("yes", "no"))## Warning: Removed 946 rows containing missing values.
set.seed(87)
null_dist_hgt <- yrbss %>%
specify(height ~ physical_3plus) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("yes", "no"))## Warning: Removed 946 rows containing missing values.
visualize(null_dist_hgt) +
shade_p_value(obs_stat = obs_diff_hgt, direction = "two_sided")null_dist_hgt %>%
get_p_value(obs_stat = obs_diff_hgt, direction = "two_sided")## Warning: Please be cautious in reporting a p-value of 0. This result is an
## approximation based on the number of `reps` chosen in the `generate()` step. See
## `?get_p_value()` for more information.
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0
height_not_active_mean <- 1.665
height_not_active_sd <- 0.1029
height_not_active_n <- 4022
# active
height_active_mean <- 1.7032
height_active_sd <- 0.1033
height_active_n <- 8342z_height <- 1.96
height_upper_not_active <- height_not_active_mean + z_height * (height_not_active_sd / sqrt(height_not_active_n))
height_upper_not_active## [1] 1.66818
height_lower_not_active <- height_not_active_mean - z_height * (height_not_active_sd / sqrt(height_not_active_n))
height_lower_not_active## [1] 1.66182
# confidence interval for active
height_upper_active <- height_active_mean + z_height * (height_active_sd / sqrt(height_active_n))
height_upper_active## [1] 1.705417
height_lower_active <- height_active_mean - z_height * (height_active_sd / sqrt(height_active_n))
height_lower_active## [1] 1.700983
With a 95% confidence interval, the average heights of those students who are not physically active 3 or more days per week is between 1.66 m and 1.67 m. While for those students who are physically active is between 1.701 m and 1.705 m.
Since the p-values is below 0.05, we reject the null hypothesis. There is a difference in the average height of the students who are physically active and those who are not.
Now, a non-inference task: Determine the number of different options there are in the dataset for the hours_tv_per_school_day there are.
yrbss %>%
group_by(hours_tv_per_school_day)%>%
summarise(n())## # A tibble: 8 × 2
## hours_tv_per_school_day `n()`
## <chr> <int>
## 1 <1 2166
## 2 1 1745
## 3 2 2701
## 4 3 2131
## 5 4 1044
## 6 5+ 1589
## 7 do not watch 1837
## 8 <NA> 97
There are 7 different options for the data set hours_tv_per_school_dayand 1 option for NA.
Come up with a research question evaluating the relationship between height or weight and sleep. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval. Report the statistical results, and also provide an explanation in plain language. Be sure to check all assumptions, state your α level, and conclude in context.
Ho: There is no relationship between the mean height and sleep of students.
Ha: There is a relationship between the mean height and sleep of students.
Confidence interval: 95%
Conditions: Independent sample: Yes, Normality: Yes
yrbss <- yrbss %>%
mutate(sleep_less = ifelse(yrbss$school_night_hours_sleep < 6, "yes", "no"))
height_less <- yrbss %>%
select(height, sleep_less) %>%
filter(sleep_less == "no") %>%
na.omit()
height_more <- yrbss %>%
select(height, sleep_less) %>%
filter(sleep_less == "yes") %>%
na.omit()boxplot(height_less$height, height_more$height,
names = c("less sleep", "more sleep"))# less sleep
less_sleep_mean <- mean(height_less$height)
less_sleep_mean## [1] 1.692186
less_sleep_sd <- sd(height_less$height)
less_sleep_sd## [1] 0.1042165
more_sleep_mean <- mean(height_more$height)
more_sleep_mean## [1] 1.685306
more_sleep_sd <- sd(height_more$height)
more_sleep_sd## [1] 0.1060691
diff_mean <- more_sleep_mean - less_sleep_mean
diff_mean## [1] -0.006879444
diff_sd <- sqrt(((more_sleep_mean^2) / nrow(height_more)) + ((less_sleep_mean^2) / nrow(height_less)))
diff_sd## [1] 0.03827245
sleep_df <- 2492-1
t_sleep <- qt(.05/2, sleep_df, lower.tail = FALSE)
# confidence interval
upper_sleep<- diff_mean + t_sleep * diff_sd
upper_sleep## [1] 0.06816964
lower_sleep<- diff_mean - t_sleep * diff_sd
lower_sleep## [1] -0.08192853
p_value_sleep <- 2 * pt(t_sleep, sleep_df, lower.tail = FALSE)
p_value_sleep## [1] 0.05
P value is 0.05, therefore reject the null. There is a relationship between the mean height and sleep of students.