library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ---------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(infer)
data(yrbss)
?yrbss
## starting httpd help server ... done
What are the cases in this data set? How many cases are there in our sample?
The cases are based on health-risk behaviors and experiences of the youth. There’s 13,583 cases and 13 variables.
glimpse(yrbss)
## Rows: 13,583
## Columns: 13
## $ age <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15...
## $ gender <chr> "female", "female", "female", "female", "f...
## $ grade <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9...
## $ hispanic <chr> "not", "not", "hispanic", "not", "not", "n...
## $ race <chr> "Black or African American", "Black or Afr...
## $ height <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88...
## $ weight <dbl> NA, NA, 84.37, 55.79, 46.72, 67.13, 131.54...
## $ helmet_12m <chr> "never", "never", "never", "never", "did n...
## $ text_while_driving_30d <chr> "0", NA, "30", "0", "did not drive", "did ...
## $ physically_active_7d <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, ...
## $ hours_tv_per_school_day <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5...
## $ strength_training_7d <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, ...
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "...
summary(yrbss$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 29.94 56.25 64.41 67.91 76.20 180.99 1004
How many observations are we missing weights from?
We’re missing 1004 observations on weights.
yrbss <- yrbss %>%
mutate(physical_3plus = ifelse(yrbss$physically_active_7d > 2, "yes", "no"))
Make a side-by-side boxplot of physical_3plus and weight. Is there a relationship between these two variables? What did you expect and why?
The medians between the two variable are pretty close. The median for the people that don’t exercise in the 3 days is lower. The spread is also pretty similar but for the people that don’t exercise is slightly lower than the people that do exercise. There are more outliers for the “no group” but over all both groups seem to be normal based on the data.I did expect for both groups to be relatively normal just because the observations are teens and just like any teen they binge watch tv and just eat a ton regardless if they exercise or not and still weigh about the same.
boxplot(yrbss$weight~yrbss$physical_3plus, ylab="Weight", xlab="physical_3plus")
yrbss %>%
group_by(physical_3plus) %>%
summarise(mean_weight = mean(weight, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
## physical_3plus mean_weight
## <chr> <dbl>
## 1 no 66.7
## 2 yes 68.4
## 3 <NA> 69.9
Are all conditions necessary for inference satisfied? Comment on each. You can compute the group sizes with the summarize command above by defining a new variable with the definition n().
There are more than enough samples and the samples are independent. It seems to me that all conditions have been met for interence satisfaction.
summary(yrbss$physical_3plus)
## Length Class Mode
## 13583 character character
Write the hypotheses for testing if the average weights are different for those who exercise at least 3 times a week and those who don’t.
The average weights are different for those who exercise 3 times a week to those who don’t. The average weights are not different for those who exercise 3 times a week to those who don’t.
obs_diff <- yrbss %>%
specify(weight ~ physical_3plus) %>%
calculate(stat = "diff in means", order = c("yes", "no"))
## Warning: Removed 1219 rows containing missing values.
null_dist <- yrbss %>%
specify(weight ~ physical_3plus) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("yes", "no"))
## Warning: Removed 1219 rows containing missing values.
How many of these null permutations have a difference of at least obs_stat?
There are no difference of the null permutation of obs_stat.
null_dist %>%
get_p_value(obs_stat = obs_diff, direction = "two_sided")
## Warning: Please be cautious in reporting a p-value of 0. This result is an
## approximation based on the number of `reps` chosen in the `generate()` step. See
## `?get_p_value()` for more information.
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0
Conduct a hypothesis test evaluating whether the average height is different for those who exercise at least three times a week and those who don’t.
Based on the data it does show that there is a difference in height between those who exercise and those who don’t.We would reject the null.
obs_diff <- yrbss %>%
specify(height ~ physical_3plus) %>%
calculate(stat = "diff in means", order = c("yes", "no"))
## Warning: Removed 1219 rows containing missing values.
null_dist <- yrbss %>%
specify(height ~ physical_3plus) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("yes", "no"))
## Warning: Removed 1219 rows containing missing values.
ggplot(data = null_dist, aes(x = stat)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Now, a non-inference task: Determine the number of different options there are in the dataset for the hours_tv_per_school_day there are.
There are 7 options in the data set including N/A for those that did not answer.
yrbss %>% group_by(hours_tv_per_school_day) %>% summarise(n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 8 x 2
## hours_tv_per_school_day `n()`
## <chr> <int>
## 1 <1 2168
## 2 1 1750
## 3 2 2705
## 4 3 2139
## 5 4 1048
## 6 5+ 1595
## 7 do not watch 1840
## 8 <NA> 338
Come up with a research question evaluating the relationship between height or weight and sleep. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval. Report the statistical results, and also provide an explanation in plain language. Be sure to check all assumptions, state your α level, and conclude in context.
Is there a difference between students that are taller and sleep for 7 hours or more than the students that are short?
I would reject the null, There is a difference between students that are taller and sleep for 7 hours or more than the students that are shorter.
yrbss <- yrbss %>%
mutate(sleep_7plus = ifelse(yrbss$school_night_hours_sleep > 7, "yes", "no"))
obs_diff <- yrbss %>%
specify(height ~ sleep_7plus) %>%
calculate(stat = "diff in means", order = c("yes", "no"))
## Warning: Removed 2102 rows containing missing values.
null_dist <- yrbss %>%
specify(height ~ sleep_7plus) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("yes", "no"))
## Warning: Removed 2102 rows containing missing values.
null_dist %>%
get_p_value(obs_stat = obs_diff, direction = "two_sided")
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0.13
ggplot(data = null_dist, aes(x = stat)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.