In this lab, we will explore and visualize the data using the tidyverse suite of packages, and perform statistical inference using infer. The data can be found in the companion package for OpenIntro resources, openintro.
Let’s load the packages.
rm(list=ls())
library(tidyverse)
library(openintro)
library(infer)Every two years, the Centers for Disease Control and Prevention conduct the Youth Risk Behavior Surveillance System (YRBSS) survey, where it takes data from high schoolers (9th through 12th grade), to analyze health patterns. You will work with a selected group of variables from a random sample of observations during one of the years the YRBSS was conducted.
Load the yrbss data set into your workspace.
data('yrbss', package='openintro')There are observations on 13 different variables, some categorical and some numerical. The meaning of each variable can be found by bringing up the help file:
?yrbssANSWER 13,583 cases in the dataset
Remember that you can answer this question by viewing the data in the data viewer or by using the following command:
glimpse(yrbss)## Rows: 13,583
## Columns: 13
## $ age <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15, 1~
## $ gender <chr> "female", "female", "female", "female", "fema~
## $ grade <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9", ~
## $ hispanic <chr> "not", "not", "hispanic", "not", "not", "not"~
## $ race <chr> "Black or African American", "Black or Africa~
## $ height <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88, 1~
## $ weight <dbl> NA, NA, 84.37, 55.79, 46.72, 67.13, 131.54, 7~
## $ helmet_12m <chr> "never", "never", "never", "never", "did not ~
## $ text_while_driving_30d <chr> "0", NA, "30", "0", "did not drive", "did not~
## $ physically_active_7d <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, 7, ~
## $ hours_tv_per_school_day <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5+",~
## $ strength_training_7d <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, 7, ~
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "<5"~
You will first start with analyzing the weight of the participants in kilograms: weight.
Using visualization and summary statistics, describe the distribution of weights. The summary function can be useful.
summary(yrbss$weight)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 29.94 56.25 64.41 67.91 76.20 180.99 1004
ANSWER As per summary, NA’s = 1,004 Below a few ways to get the same information.
yrbss %>%
select(weight) %>%
count(is.na(.))## # A tibble: 2 x 2
## `is.na(.)`[,"weight"] n
## <lgl> <int>
## 1 FALSE 12579
## 2 TRUE 1004
sum(is.na(yrbss$weight))## [1] 1004
yrbss %>%
group_by(age) %>%
summarise(nas = sum(is.na(weight)))## # A tibble: 8 x 2
## age nas
## <int> <int>
## 1 12 20
## 2 13 5
## 3 14 115
## 4 15 228
## 5 16 200
## 6 17 207
## 7 18 152
## 8 NA 77
Next, consider the possible relationship between a high schooler’s weight and their physical activity. Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions.
First, let’s create a new variable physical_3plus, which will be coded as either “yes” if they are physically active for at least 3 days a week, and “no” if not.
yrbss <- yrbss %>%
mutate(physical_3plus = ifelse(yrbss$physically_active_7d > 2, "yes", "no"))physical_3plus and weight. Is there a relationship between these two variables? What did you expect and why?#create vertical side-by-side boxplots
yrbss <- yrbss %>%
filter(!is.na(physical_3plus))
ggplot(yrbss, aes(x=physical_3plus, y=weight)) +
geom_boxplot() +
ggtitle('Weight Distribution by Activity') The box plots show how the medians of the two distributions compare, but we can also compare the means of the distributions using the following to first group the data by the physical_3plus variable, and then calculate the mean weight in these groups using the mean function while ignoring missing values by setting the na.rm argument to TRUE.
yrbss %>%
group_by(physical_3plus) %>%
summarise(mean_weight = mean(weight, na.rm = TRUE))## # A tibble: 2 x 2
## physical_3plus mean_weight
## <chr> <dbl>
## 1 no 66.7
## 2 yes 68.4
There is an observed difference, but is this difference statistically significant? In order to answer this question we will conduct a hypothesis test.
summarize command above by defining a new variable with the definition n().Answer Let check that np>10 and n*(1-p)>10
yrbss %>%
group_by(physical_3plus) %>%
filter(!is.na(physical_3plus)) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n), np = n*freq)## # A tibble: 2 x 4
## physical_3plus n freq np
## <chr> <int> <dbl> <dbl>
## 1 no 4404 0.331 1457.
## 2 yes 8906 0.669 5959.
We can see that the conditions for inference are metcsince np and n(1-p) are 5959 and 1457.
*ANSWER** Null H0 There is no impact on weight between people who exercise 3 or more days a weeks compared to the weight of those who don’t
M3plus - M2lower = 0
Alternate Ha That there is a weight difference between people who exercise 3 or more days a weeks vs those who don’t
M3plus - M2lower <> 0
Next, we will introduce a new function, hypothesize, that falls into the infer workflow. You will use this method for conducting hypothesis tests.
But first, we need to initialize the test, which we will save as obs_diff.
obs_diff <- yrbss %>%
specify(weight ~ physical_3plus) %>%
calculate(stat = "diff in means", order = c("yes", "no"))
obs_diff## Response: weight (numeric)
## Explanatory: physical_3plus (factor)
## # A tibble: 1 x 1
## stat
## <dbl>
## 1 1.77
Notice how you can use the functions specify and calculate again like you did for calculating confidence intervals. Here, though, the statistic you are searching for is the difference in means, with the order being yes - no != 0.
After you have initialized the test, you need to simulate the test on the null distribution, which we will save as null.
null_dist <- yrbss %>%
specify(weight ~ physical_3plus) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("yes", "no"))Here, hypothesize is used to set the null hypothesis as a test for independence. In one sample cases, the null argument can be set to “point” to test a hypothesis relative to a point estimate.
Also, note that the type argument within generate is set to permute, whichis the argument when generating a null distribution for a hypothesis test.
We can visualize this null distribution with the following code:
ggplot(data = null_dist, aes(x = stat)) +
geom_histogram()null permutations have a difference of at least obs_stat?ANSWER We will count all rows in null_dist where the stat is at least (equal or larger) than the obs_diff.
We didn’t see any rows equal or larger than obs_diff
null_dist %>%
summarise(n = sum(stat >= obs_diff$stat))## # A tibble: 1 x 1
## n
## <int>
## 1 0
Now that the test is initialized and the null distribution formed, you can calculate the p-value for your hypothesis test using the function get_p_value.
null_dist %>%
get_p_value(obs_stat = obs_diff, direction = "two_sided")## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0
This the standard workflow for performing hypothesis tests.
ANSWER
yrbss %>%
specify(weight ~ physical_3plus) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("yes", "no")) %>%
get_ci(level = 0.95)## # A tibble: 1 x 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 -0.614 0.672
The CI interval is between -0.6 and 0.6. This is the confidence interval on a distribution of the “shuffled” data assuming the null that exercising and weight are independent. The true mean with 95% confidence under the null is that mean is between -0.6 and 0.6. Since our TRUE observed difference is 1.7, way outside the CI we can reject the NULL.
height) and interpret it in context.ANSWER
height_result <- yrbss %>%
filter(!is.na(height)) %>%
summarise(mean_h = mean(height), sd_h = sd(height), nr = n()) %>%
mutate(ci_lower= mean_h-(1.96*(sd_h/sqrt(nr))), ci_upper = mean_h+(1.96*(sd_h/sqrt(nr))))
print.data.frame(height_result)## mean_h sd_h nr ci_lower ci_upper
## 1 1.690973 0.1046448 12364 1.689128 1.692818
ANSWER
height_result <- yrbss %>%
filter(!is.na(height)) %>%
summarise(mean_h = mean(height), sd_h = sd(height), nr = n()) %>%
mutate(ci_lower= mean_h-(1.645*(sd_h/sqrt(nr))), ci_upper = mean_h+(1.645*(sd_h/sqrt(nr))))
print.data.frame(height_result)## mean_h sd_h nr ci_lower ci_upper
## 1 1.690973 0.1046448 12364 1.689425 1.692521
ANSWER
#create vertical side-by-side boxplots
yrbss <- yrbss %>%
filter(!is.na(physical_3plus))
yrbss <- yrbss %>%
filter(!is.na(height))
ggplot(yrbss, aes(x=physical_3plus, y=height)) +
geom_boxplot() +
ggtitle('Height Distribution by Activity') yrbss %>%
group_by(physical_3plus) %>%
summarise(mean_height = mean(height, na.rm = TRUE))## # A tibble: 2 x 2
## physical_3plus mean_height
## <chr> <dbl>
## 1 no 1.67
## 2 yes 1.70
obs_diff2 <- yrbss %>%
specify(height ~ physical_3plus) %>%
calculate(stat = "diff in means", order = c("yes", "no"))
obs_diff2## Response: height (numeric)
## Explanatory: physical_3plus (factor)
## # A tibble: 1 x 1
## stat
## <dbl>
## 1 0.0376
null_dist2 <- yrbss %>%
specify(height ~ physical_3plus) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("yes", "no"))ggplot(data = null_dist2, aes(x = stat)) +
geom_histogram()p_value_df <- null_dist2 %>%
get_p_value(obs_stat = obs_diff2, direction = "two_sided")
print.data.frame(p_value_df)## p_value
## 1 0
yrbss %>%
specify(height ~ physical_3plus) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("yes", "no")) %>%
get_ci(level = 0.95)## # A tibble: 1 x 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 -0.00379 0.00390
Just as with weight, with height the CI goes from -0.004 to 0.0037. Since the observed value of 0.0376 is also way outside the CI from the permutation we can reject the NULL that working out 3+ a week doesn’t have an impact on Height, or rather reject the NULL that Height and Exercising are independent
hours_tv_per_school_day there are.ANSWER
yrbss %>%
group_by(hours_tv_per_school_day) %>%
summarise(count = n())## # A tibble: 8 x 2
## hours_tv_per_school_day count
## <chr> <int>
## 1 <1 2019
## 2 1 1663
## 3 2 2545
## 4 3 1988
## 5 4 972
## 6 5+ 1426
## 7 do not watch 1669
## 8 <NA> 82
ANSWER
Does sleeping more hours have an impact on weight? Ho would be they are independent and has no impact on weight. Ha is that sleeping more has impact on weight.
yrbss %>%
group_by(school_night_hours_sleep) %>%
summarise(count = n())## # A tibble: 8 x 2
## school_night_hours_sleep count
## <chr> <int>
## 1 <5 856
## 2 10+ 252
## 3 5 1372
## 4 6 2487
## 5 7 3273
## 6 8 2497
## 7 9 705
## 8 <NA> 922
We will define all cases intro two populations. Sleeps 8hrs or more (8, 9 and 10+). This will be “yes”, vs other will be “no”
yrbss <- yrbss %>%
mutate(sleeps_8plus = ifelse(yrbss$school_night_hours_sleep >= "8" | yrbss$school_night_hours_sleep == "10+" , "yes", "no"))#create vertical side-by-side boxplots
yrbss <- yrbss %>%
filter(!is.na(sleeps_8plus))
ggplot(yrbss, aes(x=sleeps_8plus, y=weight)) +
geom_boxplot() +
ggtitle('Weight Distribution by Sleep') yrbss %>%
group_by(sleeps_8plus) %>%
summarise(mean_weight = mean(weight, na.rm = TRUE))## # A tibble: 2 x 2
## sleeps_8plus mean_weight
## <chr> <dbl>
## 1 no 68.2
## 2 yes 67.2
obs_diff3 <- yrbss %>%
specify(weight ~ sleeps_8plus) %>%
calculate(stat = "diff in means", order = c("yes", "no"))
obs_diff3## Response: weight (numeric)
## Explanatory: sleeps_8plus (factor)
## # A tibble: 1 x 1
## stat
## <dbl>
## 1 -0.997
null_dist3 <- yrbss %>%
specify(weight ~ sleeps_8plus) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("yes", "no"))ggplot(data = null_dist3, aes(x = stat)) +
geom_histogram()p_value_df <- null_dist3 %>%
get_p_value(obs_stat = obs_diff3, direction = "two_sided")
print.data.frame(p_value_df)## p_value
## 1 0.008
yrbss %>%
specify(weight ~ sleeps_8plus) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("yes", "no")) %>%
get_ci(level = 0.95)## # A tibble: 1 x 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 -0.640 0.701
ANSWER As with the other hypothesis reviewed, we found that we can reject the null that Sleeping a lot has no impact on weight. In this case long sleepers have lower weight than light sleepers.