library(tidyverse)
## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(infer)
data(yrbss)
?yrbss
## starting httpd help server ... done
What are the cases in this data set? How many cases are there in our sample?
There are 13 cases in this data set. This sample size is 13,583.
glimpse(yrbss)
## Rows: 13,583
## Columns: 13
## $ age <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15...
## $ gender <chr> "female", "female", "female", "female", "f...
## $ grade <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9...
## $ hispanic <chr> "not", "not", "hispanic", "not", "not", "n...
## $ race <chr> "Black or African American", "Black or Afr...
## $ height <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88...
## $ weight <dbl> NA, NA, 84.37, 55.79, 46.72, 67.13, 131.54...
## $ helmet_12m <chr> "never", "never", "never", "never", "did n...
## $ text_while_driving_30d <chr> "0", NA, "30", "0", "did not drive", "did ...
## $ physically_active_7d <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, ...
## $ hours_tv_per_school_day <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5...
## $ strength_training_7d <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, ...
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "...
summary(yrbss$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 29.94 56.25 64.41 67.91 76.20 180.99 1004
How many observations are we missing weights from?
We are missing weights from 1004 observations.
yrbss <- yrbss %>%
mutate(physical_3plus = ifelse(yrbss$physically_active_7d > 2, "yes", "no"))
Make a side-by-side boxplot of physical_3plus and weight. Is there a relationship between these two variables? What did you expect and why?
boxplot(yrbss$weight ~ yrbss$physical_3plus, data= yrbss)
These two variables are distributed very similarly. There seems to be a relationship between the two, because weight and physical activity impact eachother.There are lots of upper outliers.
yrbss %>%
group_by(physical_3plus) %>%
summarise(mean_weight = mean(weight, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
## physical_3plus mean_weight
## <chr> <dbl>
## 1 no 66.7
## 2 yes 68.4
## 3 <NA> 69.9
Are all conditions necessary for inference satisfied? Comment on each. You can compute the group sizes with the summarize command above by defining a new variable with the definition n()
summary(yrbss$physical_3plus)
## Length Class Mode
## 13583 character character
The sample is large enough for the normal condition to be applied. The values represent different people meaning they are independent.The sample is random. The conditions for inference are met.
Write the hypotheses for testing if the average weights are different for those who exercise at least times a week and those who don’t.
Ho: muyes=muno Ha: muyes≠muno
obs_diff <- yrbss %>%
specify(weight ~ physical_3plus) %>%
calculate(stat = "diff in means", order = c("yes", "no"))
## Warning: Removed 1219 rows containing missing values.
null_dist <- yrbss %>%
specify(weight ~ physical_3plus) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("yes", "no"))
## Warning: Removed 1219 rows containing missing values.
ggplot(data = null_dist, aes(x = stat)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
How many of these null permutations have a difference of at least obs_stat?
Does not appear that there are any that have this difference.
glimpse(obs_diff)
## Rows: 1
## Columns: 1
## $ stat <dbl> 1.774584
null_dist %>%
get_p_value(obs_stat = obs_diff, direction = "two_sided")
## Warning: Please be cautious in reporting a p-value of 0. This result is an
## approximation based on the number of `reps` chosen in the `generate()` step. See
## `?get_p_value()` for more information.
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0
Conduct a hypothesis test evaluating whether the average height is different for those who exercise at least three times a week and those who don’t.
Ho: muexercise = muno Ha: muexercise ≠muno
obs_diff2 <- yrbss %>%
specify(height ~ physical_3plus) %>%
calculate(stat = "diff in means", order = c("yes", "no"))
## Warning: Removed 1219 rows containing missing values.
null_dist2 <- yrbss %>%
specify(height ~ physical_3plus) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("yes", "no"))
## Warning: Removed 1219 rows containing missing values.
ggplot(data = null_dist2, aes(x = stat)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
null_dist2 %>%
get_p_value(obs_stat = obs_diff2, direction = "two_sided")
## Warning: Please be cautious in reporting a p-value of 0. This result is an
## approximation based on the number of `reps` chosen in the `generate()` step. See
## `?get_p_value()` for more information.
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0
very strong evidence to reject the null.
Now, a non-inference task: Determine the number of different options there are in the dataset for the hours_tv_per_school_day there are.
glimpse(yrbss$hours_tv_per_school_day)
## chr [1:13583] "5+" "5+" "5+" "2" "3" "5+" "5+" "5+" "5+" "do not watch" ...
The options are 0 through 5+
Come up with a research question evaluating the relationship between height or weight and sleep. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval. Report the statistical results, and also provide an explanation in plain language. Be sure to check all assumptions, state your α level, and conclude in context.
Conduct a hypothesis test evaluating whether the average height is different for those who sleep for 6 hours or more and those who don’t.
Ho: musleep6 = muno Ha: musleep6 ≠muno
yrbss <- yrbss %>%
mutate(sleep = ifelse(yrbss$school_night_hours_sleep > 6, "yes", "no"))
obs_diff3 <- yrbss %>%
specify(height ~ sleep) %>%
calculate(stat = "diff in means", order = c("yes", "no"))
## Warning: Removed 2102 rows containing missing values.
null_dist3 <- yrbss %>%
specify(height ~ sleep) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("yes", "no"))
## Warning: Removed 2102 rows containing missing values.
ggplot(data = null_dist3, aes(x = stat)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
null_dist3 %>%
get_p_value(obs_stat = obs_diff3, direction = "two_sided")
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0.01
There is strong evidence to reject the null. There is a difference between the height of those who sleep 6 hours or more and those who do not.
This is significant at the 5% level.
Random selection, data is independent and the sample size is large enough therefore the assumptions are met.