In this lab, we will explore and visualize the data using the tidyverse suite of packages, and perform statistical inference using infer. The data can be found in the companion package for OpenIntro resources, openintro.
Let’s load the packages.
library(tidyverse)
library(openintro)
library(infer)Every two years, the Centers for Disease Control and Prevention conduct the Youth Risk Behavior Surveillance System (YRBSS) survey, where it takes data from high schoolers (9th through 12th grade), to analyze health patterns. You will work with a selected group of variables from a random sample of observations during one of the years the YRBSS was conducted.
Load the yrbss data set into your workspace.
data('yrbss', package='openintro')There are observations on 13 different variables, some categorical and some numerical. The meaning of each variable can be found by bringing up the help file:
?yrbssRemember that you can answer this question by viewing the data in the data viewer or by using the following command:
glimpse(yrbss)## Rows: 13,583
## Columns: 13
## $ age <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15, 1…
## $ gender <chr> "female", "female", "female", "female", "fema…
## $ grade <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9", …
## $ hispanic <chr> "not", "not", "hispanic", "not", "not", "not"…
## $ race <chr> "Black or African American", "Black or Africa…
## $ height <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88, 1…
## $ weight <dbl> NA, NA, 84.37, 55.79, 46.72, 67.13, 131.54, 7…
## $ helmet_12m <chr> "never", "never", "never", "never", "did not …
## $ text_while_driving_30d <chr> "0", NA, "30", "0", "did not drive", "did not…
## $ physically_active_7d <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, 7, …
## $ hours_tv_per_school_day <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5+",…
## $ strength_training_7d <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, 7, …
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "<5"…
colnames(yrbss)## [1] "age" "gender"
## [3] "grade" "hispanic"
## [5] "race" "height"
## [7] "weight" "helmet_12m"
## [9] "text_while_driving_30d" "physically_active_7d"
## [11] "hours_tv_per_school_day" "strength_training_7d"
## [13] "school_night_hours_sleep"
There are 13,583 observational cases with 13 different variables.
You will first start with analyzing the weight of the participants in
kilograms: weight.
Using visualization and summary statistics, describe the distribution
of weights. The summary function can be useful.
summary(yrbss$weight)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 29.94 56.25 64.41 67.91 76.20 180.99 1004
sum(is.na(yrbss$weight))## [1] 1004
There are 1,004 observations that are missing weight values.
Next, consider the possible relationship between a high schooler’s weight and their physical activity. Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions.
First, let’s create a new variable physical_3plus, which
will be coded as either “yes” if they are physically active for at least
3 days a week, and “no” if not.
yrbss <- yrbss %>%
mutate(physical_3plus = ifelse(yrbss$physically_active_7d > 2, "yes", "no")) physical_3plus and
weight. Is there a relationship between these two
variables? What did you expect and why?yrbss_no_na <- yrbss %>%
filter(!is.na(physical_3plus),!is.na(weight))
ggplot(yrbss_no_na, aes(x=physical_3plus, y=weight)) +
geom_boxplot() +
ggtitle("Comparing Two Variables") +
xlab("Physically Active At Least 3 days") +
ylab("Weight")It appears that each boxplot is almost the same. The IQR and median are slightly higher for people who are physically active at least 3 days a week. With the exception of some outliers, the weight of people who are physically active less than 3 days a week is slightly lower than people who are physically active at least 3 days a week.
The box plots show how the medians of the two distributions compare,
but we can also compare the means of the distributions using the
following to first group the data by the physical_3plus
variable, and then calculate the mean weight in these
groups using the mean function while ignoring missing
values by setting the na.rm argument to
TRUE.
yrbss %>%
group_by(physical_3plus) %>%
summarise(mean_weight = mean(weight, na.rm = TRUE))## # A tibble: 3 × 2
## physical_3plus mean_weight
## <chr> <dbl>
## 1 no 66.7
## 2 yes 68.4
## 3 <NA> 69.9
There is an observed difference, but is this difference statistically significant? In order to answer this question we will conduct a hypothesis test.
summarize
command above by defining a new variable with the definition
n().The 3 conditions that are necessary for inference is that the data comes from a random sample, the sampling distribution has to be approximately normal with at least 10 successes and 10 failures, and the observations need to be independent.
yrbss %>%
filter(!(is.na(physical_3plus) | is.na(weight))) %>%
group_by(physical_3plus) %>%
summarise(count = n())## # A tibble: 2 × 2
## physical_3plus count
## <chr> <int>
## 1 no 4022
## 2 yes 8342
The inferences in this case are satisfied. The sample is random. There appears to be a normal distribution where there is more than 10 students who are physically active at least 3 days a week and more than 10 students who are physically active less than 3 days a week. Lastly, the sample size is less than 10% of the total high school student population, and therefore the observations are independent.
Ho: There is no difference between the average weights of students who are physically active at least 3 times a week and those who are physically active less than 3 times a week.
M1 = M2
Ha: There is a difference between the average weights of students who are physically active at least 3 times a week and those who are physically active less than 3 times a week.
M1 != M2
Next, we will introduce a new function, hypothesize,
that falls into the infer workflow. You will use this
method for conducting hypothesis tests.
But first, we need to initialize the test, which we will save as
obs_diff.
obs_diff <- yrbss_no_na %>%
specify(weight ~ physical_3plus) %>%
calculate(stat = "diff in means", order = c("yes", "no"))Notice how you can use the functions specify and
calculate again like you did for calculating confidence
intervals. Here, though, the statistic you are searching for is the
difference in means, with the order being
yes - no != 0.
After you have initialized the test, you need to simulate the test on
the null distribution, which we will save as null.
null_dist <- yrbss_no_na %>%
specify(weight ~ physical_3plus) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("yes", "no"))Here, hypothesize is used to set the null hypothesis as
a test for independence. In one sample cases, the null
argument can be set to “point” to test a hypothesis relative to a point
estimate.
Also, note that the type argument within
generate is set to permute, which is the
argument when generating a null distribution for a hypothesis test.
We can visualize this null distribution with the following code:
ggplot(data = null_dist, aes(x = stat)) +
geom_histogram()null permutations have a difference
of at least obs_stat?obs_diff[[1]]## [1] 1.774584
(null_dist$stat >= obs_diff[[1]]) %>%
table()## .
## FALSE
## 1000
sum(null_dist$stat >= obs_diff$stat)## [1] 0
Now that the test is initialized and the null distribution formed,
you can calculate the p-value for your hypothesis test using the
function get_p_value.
null_dist %>%
get_p_value(obs_stat = obs_diff, direction = "two_sided")## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0
This is the standard workflow for performing hypothesis tests.
weight_mean <- yrbss_no_na %>%
group_by(physical_3plus) %>%
summarise(mean_weight = mean(weight, na.rm = TRUE))
weight_mean## # A tibble: 2 × 2
## physical_3plus mean_weight
## <chr> <dbl>
## 1 no 66.7
## 2 yes 68.4
The mean of weights of students who are physically active for at least 3 days a week is 68.4, while the mean of students who are not physically active at least 3 days a week is 66.7.
weight_sd <- yrbss_no_na %>%
group_by(physical_3plus) %>%
summarise(sd_weight = sd(weight, na.rm = TRUE))
weight_sd## # A tibble: 2 × 2
## physical_3plus sd_weight
## <chr> <dbl>
## 1 no 17.6
## 2 yes 16.5
The standard deviation of weights of students who are physically active for at least 3 days a week is 16.5, while the mean of students who are physically active for less than 3 days a week is 17.6.
# Total for students who are physically active at least 3 days a week and less than 3 days a week
total_n <- yrbss %>%
filter(!(is.na(physical_3plus) | is.na(weight))) %>%
group_by(physical_3plus) %>%
summarise(count = n())
total_n## # A tibble: 2 × 2
## physical_3plus count
## <chr> <int>
## 1 no 4022
## 2 yes 8342
# physically active at least 3 days a week
x1_mean <- 68.4
x1_sd <- 16.5
x1_n <- 8342# not physically active at least 3 days a week
x2_mean <- 66.7
x2_sd <- 17.6
x2_n <- 4022# Standard Error
SE <- x1_sd / sqrt(x1_n)
alpha = 0.05
degrees_of_freedom = x1_n - 1
t_score = qt(p=alpha/2, df=degrees_of_freedom,lower.tail=F)
# Margin of Error
margin_error <- t_score * SE# physically active at least 3 days a week
lower_ci1 <- round(x1_mean - margin_error, 2)
upper_ci1 <- round(x1_mean + margin_error, 2)
lower_ci1## [1] 68.05
upper_ci1## [1] 68.75
We are 95% confident that the average weight of a student who is physically active at least 3 days a week will be between 68.05 kg and 68.75 kg.
SE2 <- x2_sd / sqrt(x2_n)
alpha2 = 0.05
degrees_of_freedom2 = x2_n - 1
t_score2 = qt(p=alpha2/2, df=degrees_of_freedom2,lower.tail=F)
margin_error2 <- t_score2 * SE2# not physically active at least 3 days a week
lower_ci2 <- round(x2_mean - margin_error2, 2)
upper_ci2 <- round(x2_mean + margin_error2, 2)
lower_ci2## [1] 66.16
upper_ci2## [1] 67.24
We are 95% confident that the average weight of a student who is not physically active at least 3 days a week will be between 66.16 kg and 67.24 kg.
height) and interpret it in context.height_mean <- mean(yrbss$height, na.rm = TRUE)
height_mean## [1] 1.691241
height_sd <- sd(yrbss$height, na.rm = TRUE)
height_sd## [1] 0.1046973
total_height <- yrbss_no_na %>%
select(height) %>%
filter(!is.na(height)) %>%
summarise(count = n())
total_height## # A tibble: 1 × 1
## count
## <int>
## 1 12364
set.seed(1234)
SE <- height_sd / sqrt(total_height)
alpha <- 0.05
p <- alpha/2
df <- total_height - 1
# df = 12364-1 = 12363
t_score <- qt(p, df=12363, lower.tail = F)
margin_error <- t_score * SElower_ci <- height_mean - margin_error
upper_ci <- height_mean + margin_error
lower_ci## count
## 1 1.689395
upper_ci## count
## 1 1.693087
With a 95% confidence interval, the average height of students in meters will be between 1.689395 and 1.693087.
set.seed(1234)
new_SE <- height_sd / sqrt(total_height)
new_alpha <- 0.10
new_p <- new_alpha/2
new_df <- total_height - 1
# new_df = 12364-1 = 12363
new_t_score <- qt(new_p, df=12363, lower.tail = F)
new_margin_error <- new_t_score * new_SE# 90% Confidence Level
new_lower_ci <- height_mean - new_margin_error
new_upper_ci <- height_mean + new_margin_error
new_lower_ci## count
## 1 1.689692
new_upper_ci## count
## 1 1.69279
# difference between interval ranges at 95% confidence level
conf95_difference <- upper_ci - lower_ci
conf95_difference## count
## 1 0.003691275
# difference between interval ranges at 95% confidence level
conf90_difference <- new_upper_ci - new_lower_ci
conf90_difference## count
## 1 0.003097744
overall_difference <- conf95_difference - conf90_difference
overall_difference## count
## 1 0.0005935305
With a 90% confidence interval, the average height in meters for students will be between 1.689692 and 1.69279. Compared with the confidence intervals at 95%, the lower confidence interval value is slightly higher, while the upper confidence interval is slightly lower. The interval range at with a 95% confidence level is slightly larger than the confidence interval range at a 90% confidence level, the difference being 0.0005935305.
Ho: There is no difference in average height for students who exercise at least 3 times a week.
Ha: There is a difference in average height for students who exercise at least 3 times a week.
yrbss_height_no_na <- yrbss %>%
filter(!is.na(physical_3plus), !is.na(height))
yrbss_height_no_na## # A tibble: 12,364 × 14
## age gender grade hispanic race height weight helme…¹ text_…² physi…³
## <int> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <int>
## 1 15 female 9 hispanic Native Haw… 1.73 84.4 never 30 7
## 2 15 female 9 not Black or A… 1.6 55.8 never 0 0
## 3 15 female 9 not Black or A… 1.5 46.7 did no… did no… 2
## 4 15 female 9 not Black or A… 1.57 67.1 did no… did no… 1
## 5 15 female 9 not Black or A… 1.65 132. did no… <NA> 4
## 6 14 male 9 not Black or A… 1.88 71.2 never <NA> 4
## 7 15 male 9 not Black or A… 1.75 63.5 never <NA> 5
## 8 15 male 10 not Black or A… 1.37 97.1 did no… <NA> 0
## 9 15 female 9 not Black or A… 1.68 69.8 did no… 0 0
## 10 15 female 9 not Black or A… 1.65 66.7 did no… did no… 0
## # … with 12,354 more rows, 4 more variables: hours_tv_per_school_day <chr>,
## # strength_training_7d <int>, school_night_hours_sleep <chr>,
## # physical_3plus <chr>, and abbreviated variable names ¹helmet_12m,
## # ²text_while_driving_30d, ³physically_active_7d
height_obs_diff <- yrbss_height_no_na %>%
specify(height ~ physical_3plus) %>%
calculate(stat = "diff in means", order = c("yes", "no"))height_null_dist <- yrbss_height_no_na %>%
specify(height ~ physical_3plus) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("yes", "no"))height_null_dist %>%
get_p_value(obs_stat = height_obs_diff, direction = "two_sided")## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0
mean_height <- yrbss_height_no_na %>%
group_by(physical_3plus) %>%
summarise(mean_height = mean(height, na.rm = TRUE))
mean_height## # A tibble: 2 × 2
## physical_3plus mean_height
## <chr> <dbl>
## 1 no 1.67
## 2 yes 1.70
sd_height <- yrbss_height_no_na %>%
group_by(physical_3plus) %>%
summarise(sd_height_3plus = sd(height, na.rm = TRUE))
sd_height## # A tibble: 2 × 2
## physical_3plus sd_height_3plus
## <chr> <dbl>
## 1 no 0.103
## 2 yes 0.103
total_height_n <- yrbss_height_no_na %>%
filter(!(is.na(physical_3plus) | is.na(height))) %>%
group_by(physical_3plus) %>%
summarise(count = n())
total_height_n## # A tibble: 2 × 2
## physical_3plus count
## <chr> <int>
## 1 no 4022
## 2 yes 8342
# physically active at least 3 days a week
height_mean <- 1.70
height_sd <- .103
height_n <- 8342# not physically active at least 3 days a week
not_active_mean <- 1.67
not_active_sd <- .103
not_active_n <- 4022set.seed(1234)
active_SE <- height_sd / sqrt(height_n)
alpha = 0.05
active_degrees_of_freedom = height_n - 1
active_t_score = qt(p=alpha/2, df=active_degrees_of_freedom,lower.tail=F)
active_margin_error <- t_score * active_SElower_ci1 <- height_mean - active_margin_error
upper_ci1 <- height_mean + active_margin_error
lower_ci1## [1] 1.697789
upper_ci1## [1] 1.702211
With a 95% confidence level, the average height of students who are physically active at least 3 days a week will be between 1.697789 and 1.702211.
set.seed(1234)
not_active_SE <- not_active_sd / sqrt(not_active_n)
alpha = 0.05
not_active_degrees_of_freedom = not_active_n - 1
not_active_t_score = qt(p=alpha/2, df=not_active_degrees_of_freedom,lower.tail=F)
not_active_margin_error <- not_active_t_score * not_active_SEnot_active_t_score = qt(p=alpha/2, df=not_active_degrees_of_freedom,lower.tail=F)
not_active_t_score## [1] 1.960554
not_active_lower_ci1 <- not_active_mean - not_active_margin_error
not_active_upper_ci1 <- not_active_mean + not_active_margin_error
not_active_lower_ci1## [1] 1.666816
not_active_upper_ci1## [1] 1.673184
With a 95% confidence level, the average height of students who are not physically active at least 3 days a week will be between 1.666816 and 1.673184.
p_value_active <- 2 * pt(active_t_score, active_degrees_of_freedom, lower.tail = FALSE)
p_value_active## [1] 0.05
With a p-value of 0.05, we can reject the null hypothesis that there is no difference in the average of heights for students who are physically active at least 3 days a week.
hours_tv_per_school_day
there are.diff_options <- yrbss %>%
filter(!is.na(hours_tv_per_school_day)) %>%
group_by(hours_tv_per_school_day)%>%
summarise(count = n())
diff_options## # A tibble: 7 × 2
## hours_tv_per_school_day count
## <chr> <int>
## 1 1 1750
## 2 2 2705
## 3 3 2139
## 4 4 1048
## 5 5+ 1595
## 6 <1 2168
## 7 do not watch 1840
Ho: There is no difference in the average of weights for people who sleep at least 8 hours a day.
Ha: There is a difference in the average of weights for people who sleep at least 8 hours a day.
yrbss <- yrbss %>%
mutate(sleep_8hours = ifelse(yrbss$school_night_hours_sleep > 7, "yes", "no")) yrbss_weight_no_na <- yrbss %>%
filter(!is.na(sleep_8hours), !is.na(weight))
yrbss_weight_no_na## # A tibble: 11,481 × 15
## age gender grade hispanic race height weight helme…¹ text_…² physi…³
## <int> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <int>
## 1 15 female 9 hispanic Native Haw… 1.73 84.4 never 30 7
## 2 15 female 9 not Black or A… 1.6 55.8 never 0 0
## 3 15 female 9 not Black or A… 1.5 46.7 did no… did no… 2
## 4 15 female 9 not Black or A… 1.57 67.1 did no… did no… 1
## 5 15 female 9 not Black or A… 1.65 132. did no… <NA> 4
## 6 14 male 9 not Black or A… 1.88 71.2 never <NA> 4
## 7 15 male 9 not Black or A… 1.75 63.5 never <NA> 5
## 8 15 male 10 not Black or A… 1.37 97.1 did no… <NA> 0
## 9 15 female 9 not Black or A… 1.68 69.8 did no… 0 0
## 10 15 female 9 not Black or A… 1.65 66.7 did no… did no… 0
## # … with 11,471 more rows, 5 more variables: hours_tv_per_school_day <chr>,
## # strength_training_7d <int>, school_night_hours_sleep <chr>,
## # physical_3plus <chr>, sleep_8hours <chr>, and abbreviated variable names
## # ¹helmet_12m, ²text_while_driving_30d, ³physically_active_7d
weight_obs_diff <- yrbss_weight_no_na %>%
specify(weight ~ sleep_8hours) %>%
calculate(stat = "diff in means", order = c("yes", "no"))
weight_obs_diff## Response: weight (numeric)
## Explanatory: sleep_8hours (factor)
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 -1.18
weight_null_dist <- yrbss_weight_no_na %>%
specify(weight ~ sleep_8hours) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("yes", "no"))
weight_null_dist## Response: weight (numeric)
## Explanatory: sleep_8hours (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
## replicate stat
## <int> <dbl>
## 1 1 0.635
## 2 2 -0.225
## 3 3 -0.328
## 4 4 -0.246
## 5 5 0.0539
## 6 6 0.138
## 7 7 -0.0689
## 8 8 -0.540
## 9 9 -0.0175
## 10 10 -0.139
## # … with 990 more rows
weight_null_dist %>%
get_p_value(obs_stat = weight_obs_diff, direction = "two_sided")## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0.002
mean_weight <- yrbss_weight_no_na %>%
group_by(sleep_8hours) %>%
summarise(mean_weight_sleep = mean(weight, na.rm = TRUE))
mean_weight## # A tibble: 2 × 2
## sleep_8hours mean_weight_sleep
## <chr> <dbl>
## 1 no 68.2
## 2 yes 67.0
sd_weight <- yrbss_weight_no_na %>%
group_by(sleep_8hours) %>%
summarise(sd_weight_sleep = sd(weight, na.rm = TRUE))
sd_weight## # A tibble: 2 × 2
## sleep_8hours sd_weight_sleep
## <chr> <dbl>
## 1 no 17.2
## 2 yes 16.4
total_weight_n <- yrbss_weight_no_na %>%
filter(!(is.na(sleep_8hours) | is.na(weight))) %>%
group_by(sleep_8hours) %>%
summarise(count = n())
total_weight_n## # A tibble: 2 × 2
## sleep_8hours count
## <chr> <int>
## 1 no 8271
## 2 yes 3210
# sleep at least 8 hours
sleep_mean <- 67.0
sleep_sd <- 16.4
sleep_n <- 3210# no sleep for at least 8 hours
no_sleep_mean <- 68.2
no_sleep_sd <- 17.2
no_sleep_n <- 8271set.seed(1234)
sleep_SE <- sleep_sd / sqrt(sleep_n)
alpha = 0.05
sleep_degrees_of_freedom = sleep_n - 1
sleep_t_score = qt(p=alpha/2, df=degrees_of_freedom,lower.tail=F)
sleep_margin_error <- sleep_t_score * sleep_SEsleep_lower_ci1 <- sleep_mean - sleep_margin_error
sleep_upper_ci1 <- sleep_mean + sleep_margin_error
sleep_lower_ci1## [1] 66.43258
sleep_upper_ci1## [1] 67.56742
With a 95% confidence level, the average weight of students who sleep at least 8 hours will be between 66.43258 kg and 67.56742 kg.
set.seed(1234)
no_sleep_SE <- no_sleep_sd / sqrt(no_sleep_n)
alpha = 0.05
no_sleep_degrees_of_freedom = no_sleep_n - 1
no_sleep_t_score = qt(p=alpha/2, df=no_sleep_degrees_of_freedom,lower.tail=F)
no_sleep_margin_error <- no_sleep_t_score * no_sleep_SEno_sleep_lower_ci1 <- no_sleep_mean - no_sleep_margin_error
no_sleep_upper_ci1 <- no_sleep_mean + no_sleep_margin_error
no_sleep_lower_ci1## [1] 67.82927
no_sleep_upper_ci1## [1] 68.57073
With a 95% confidence level, the average weight of students who do not sleep at least 8 hours will be between 67.82927 kg and 68.57073 kg. With a p-value of 0.05, we can reject the null hypothesis.
# Interval range difference for students who slept at least 8 hours
diff_sleep <- sleep_upper_ci1 - sleep_lower_ci1
diff_sleep## [1] 1.134834
# Interval Range difference for students who did not sleep at least 8 hours
diff_no_sleep <- no_sleep_upper_ci1 - no_sleep_lower_ci1
diff_no_sleep## [1] 0.7414657
sleep_diff <- diff_sleep - diff_no_sleep
sleep_diff## [1] 0.3933685
The confidence interval range difference of students who sleep at least 8 hours is 1.134834, which is larger than the range for students who do not sleep at least 8 hours, which is 0.7414657, a difference of 0.3933685.
# p-value
sleep_p_value <- 2 * pt(sleep_t_score, sleep_degrees_of_freedom, lower.tail = FALSE)
sleep_p_value## [1] 0.05005317
Because the p-value is 0.05, we can reject the null hypothesis that there is no difference in the average weights of students who sleep at least 8 hours.