library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(infer)
library(ggplot2)
library(dplyr)
data('yrbss', package = 'openintro' )
What are the cases in this data set? How many cases are there in our sample?
There are 13 cases in this data set -- Age, Gender, Grade, Hispanic, Race, Height, Weight, Helmet_12m, Text_while_driving_30d, Physically_Active, Hours_tv_per_school_day, Strength_training_7d, School_night_hours_sleep
glimpse(yrbss)
## Rows: 13,583
## Columns: 13
## $ age <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15, 1~
## $ gender <chr> "female", "female", "female", "female", "fema~
## $ grade <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9", ~
## $ hispanic <chr> "not", "not", "hispanic", "not", "not", "not"~
## $ race <chr> "Black or African American", "Black or Africa~
## $ height <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88, 1~
## $ weight <dbl> NA, NA, 84.37, 55.79, 46.72, 67.13, 131.54, 7~
## $ helmet_12m <chr> "never", "never", "never", "never", "did not ~
## $ text_while_driving_30d <chr> "0", NA, "30", "0", "did not drive", "did not~
## $ physically_active_7d <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, 7, ~
## $ hours_tv_per_school_day <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5+",~
## $ strength_training_7d <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, 7, ~
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "<5"~
How many observations are we missing weights from?
We are missing 1004 observations, represented by NA's
summary(yrbss$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 29.94 56.25 64.41 67.91 76.20 180.99 1004
Make a side-by-side boxplot of physical_3plus and weight. Is there a relationship between these two variables? What did you expect and why?
There is a relationship between the two variables. Without looking at the data, I expected people with a lower weight to be physically active for more than 3 days. It makes sense that No physical active for more than 3 days has larger outliers than Yes.
yrbss <- yrbss %>%
mutate(physical_3plus = ifelse(yrbss$physically_active_7d > 2, "yes", "no"))
ggplot(yrbss, aes(x=weight, y=physical_3plus)) + geom_boxplot()
## Warning: Removed 1004 rows containing non-finite values (stat_boxplot).
Are all conditions necessary for inference satisfied? Comment on each. You can compute the group sizes with the summarize command above by defining a new variable with the definition n().
With the box plot, we can see that everything is almost symmetrical.
yrbss %>%
group_by(physical_3plus) %>%
summarise(mean_weight = mean(weight, na.rm = TRUE))
## # A tibble: 3 x 2
## physical_3plus mean_weight
## <chr> <dbl>
## 1 no 66.7
## 2 yes 68.4
## 3 <NA> 69.9
It turns out that numerically, it isn't symmetrical, All possible types of physical_3plus are greater than 30, are reasonably random and not including NA, constitues for at least 10% of the total.
yrbss %>%
group_by(physical_3plus) %>%
summarise(n = n())
## # A tibble: 3 x 2
## physical_3plus n
## <chr> <int>
## 1 no 4404
## 2 yes 8906
## 3 <NA> 273
Write the hypotheses for testing if the average weights are different for those who exercise at least times a week and those who don't.
Null hypothesis - H0: There is a difference in weight between individuals with a three day workout regimen, versus the individuals that don't. Alternate hypothesis - HA: The opposite of above (There is no difference in weight between both sets of idividuals)
How many of these null permutations have a difference of at least obs_stat?
Our obs_stat (obs_diff) is about 1.77, and there is at least one record in the distribution of our permutation that matches that.
obs_diff <- yrbss %>%
specify(weight ~ physical_3plus) %>%
calculate(stat = "diff in means", order = c("yes", "no"))
## Warning: Removed 1219 rows containing missing values.
obs_diff
## Response: weight (numeric)
## Explanatory: physical_3plus (factor)
## # A tibble: 1 x 1
## stat
## <dbl>
## 1 1.77
null_dist <- yrbss %>%
specify(weight ~ physical_3plus) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("yes", "no"))
## Warning: Removed 1219 rows containing missing values.
ggplot(data = null_dist, aes(x = stat)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Construct and record a confidence interval for the difference between the weights of those who exercise at least three times a week and those who don't, and interpret this interval in context of the data.
As explained here: https://cran.r-project.org/web/packages/infer/infer.pdf
I wasn't sure which calculation to use here, so I went with diff in means since the others gave errors:
calculate( x, stat = c("mean", "median", "sum", "sd", "prop", "count", "diff in means", "diff in medians", "diff in props", "Chisq", "F", "slope", "correlation", "t", "z", "ratio of props", "odds ratio"), order = NULL, ... )
yrbss %>%
specify(weight ~ physical_3plus) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "diff in means") %>%
get_ci(level = 0.95)
## Warning: Removed 1219 rows containing missing values.
## Warning: The statistic is based on a difference or ratio; by default, for
## difference-based statistics, the explanatory variable is subtracted in the
## order "no" - "yes", or divided in the order "no" / "yes" for ratio-based
## statistics. To specify this order yourself, supply `order = c("no", "yes")` to
## the calculate() function.
## # A tibble: 1 x 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 -2.37 -1.13
inference(y = nc\(weight, x = nc\)habit, est = "mean", type = "ci", null = 0, alternative = "twosided", method = "theoretical", order = c("smoker","nonsmoker"))
Calculate a 95% confidence interval for the average height in meters (height) and interpret it in context.
The confidence interval range is 1.689427 to 1.693034 in this example without a seed (subject to change if re-run)
yrbss %>%
specify(response = height) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "mean") %>%
get_ci(level = 0.95)
## Warning: Removed 1004 rows containing missing values.
## # A tibble: 1 x 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 1.69 1.69
Calculate a new confidence interval for the same parameter at the 90% confidence level. Comment on the width of this interval versus the one obtained in the previous exercise.
In this example without setting a seed, the range for the confidence interval is 1.689718 to 1.692773 which is about the same as before. This is subject to change, however.
yrbss %>%
specify(response = height) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "mean") %>%
get_ci(level = 0.90)
## Warning: Removed 1004 rows containing missing values.
## # A tibble: 1 x 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 1.69 1.69
Conduct a hypothesis test evaluating whether the average height is different for those who exercise at least three times a week and those who don't.
Null Hypothesis: H0 - There is a difference in height between individuals who work out three times a week, versus individuals who don't. Alterantive Hypothesis: HA - Opposite of above (There is no difference in average height between individuals who work out three times a week and those who don't.)
obs_diff <- yrbss %>%
specify(height ~ physical_3plus) %>%
calculate(stat = "diff in means", order = c("yes", "no"))
## Warning: Removed 1219 rows containing missing values.
obs_diff
## Response: height (numeric)
## Explanatory: physical_3plus (factor)
## # A tibble: 1 x 1
## stat
## <dbl>
## 1 0.0376
null_dist <- yrbss %>%
specify(height ~ physical_3plus) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("yes", "no"))
## Warning: Removed 1219 rows containing missing values.
ggplot(data = null_dist, aes(x = stat)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The obs_diff is about 0.037, which isn't represented in the histogram above.
Now, a non-inference task: Determine the number of different options there are in the dataset for the hours_tv_per_school_day there are.
yrbss %>%
group_by(hours_tv_per_school_day) %>%
summarise(n = n())
## # A tibble: 8 x 2
## hours_tv_per_school_day n
## <chr> <int>
## 1 <1 2168
## 2 1 1750
## 3 2 2705
## 4 3 2139
## 5 4 1048
## 6 5+ 1595
## 7 do not watch 1840
## 8 <NA> 338
Come up with a research question evaluating the relationship between height or weight and sleep. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval. Report the statistical results, and also provide an explanation in plain language. Be sure to check all assumptions, state your ?? level, and conclude in context.
Is there a difference in height between individuals who have a higher average hours of sleep?
yrbss %>%
group_by(school_night_hours_sleep) %>%
summarise(n = n())
## # A tibble: 8 x 2
## school_night_hours_sleep n
## <chr> <int>
## 1 <5 965
## 2 10+ 316
## 3 5 1480
## 4 6 2658
## 5 7 3461
## 6 8 2692
## 7 9 763
## 8 <NA> 1248
Couldn't get this chunk to work, which prevents the rest of the code from running as well. Not sure which calculate to use for this hypothesis.
obs_diff <- yrbss %>% specify(height ~ school_night_hours_sleep) %>% calculate(stat = "diff in means", order = c("yes", "no"))
null_dist <- yrbss %>% specify(height ~ school_night_hours_sleep) %>% hypothesize(null = "independence") %>% generate(reps = 1000, type = "permute") %>% calculate(stat = "diff in means", order = c("yes", "no"))
ggplot(data = null_dist, aes(x = stat)) + geom_histogram()
yrbss %>% specify(weight ~ school_night_hours_sleep) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "diff in means") %>% get_ci(level = 0.95)