library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(infer)
library(ggplot2)
library(dplyr)
data('yrbss', package = 'openintro' )

What are the cases in this data set? How many cases are there in our sample?

There are 13 cases in this data set -- Age, Gender, Grade, Hispanic, Race, Height, Weight, Helmet_12m, Text_while_driving_30d, Physically_Active, Hours_tv_per_school_day, Strength_training_7d, School_night_hours_sleep

glimpse(yrbss)
## Rows: 13,583
## Columns: 13
## $ age                      <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15, 1~
## $ gender                   <chr> "female", "female", "female", "female", "fema~
## $ grade                    <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9", ~
## $ hispanic                 <chr> "not", "not", "hispanic", "not", "not", "not"~
## $ race                     <chr> "Black or African American", "Black or Africa~
## $ height                   <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88, 1~
## $ weight                   <dbl> NA, NA, 84.37, 55.79, 46.72, 67.13, 131.54, 7~
## $ helmet_12m               <chr> "never", "never", "never", "never", "did not ~
## $ text_while_driving_30d   <chr> "0", NA, "30", "0", "did not drive", "did not~
## $ physically_active_7d     <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, 7, ~
## $ hours_tv_per_school_day  <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5+",~
## $ strength_training_7d     <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, 7, ~
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "<5"~

How many observations are we missing weights from?

We are missing 1004 observations, represented by NA's

summary(yrbss$weight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   29.94   56.25   64.41   67.91   76.20  180.99    1004

Make a side-by-side boxplot of physical_3plus and weight. Is there a relationship between these two variables? What did you expect and why?

There is a relationship between the two variables. Without looking at the data, I expected people with a lower weight to be physically active for more than 3 days. It makes sense that No physical active for more than 3 days has larger outliers than Yes.

yrbss <- yrbss %>% 
  mutate(physical_3plus = ifelse(yrbss$physically_active_7d > 2, "yes", "no"))

ggplot(yrbss, aes(x=weight, y=physical_3plus)) + geom_boxplot()
## Warning: Removed 1004 rows containing non-finite values (stat_boxplot).

Are all conditions necessary for inference satisfied? Comment on each. You can compute the group sizes with the summarize command above by defining a new variable with the definition n().

With the box plot, we can see that everything is almost symmetrical.

yrbss %>%
  group_by(physical_3plus) %>%
  summarise(mean_weight = mean(weight, na.rm = TRUE))
## # A tibble: 3 x 2
##   physical_3plus mean_weight
##   <chr>                <dbl>
## 1 no                    66.7
## 2 yes                   68.4
## 3 <NA>                  69.9

It turns out that numerically, it isn't symmetrical, All possible types of physical_3plus are greater than 30, are reasonably random and not including NA, constitues for at least 10% of the total.

yrbss %>%
  group_by(physical_3plus) %>%
  summarise(n = n())
## # A tibble: 3 x 2
##   physical_3plus     n
##   <chr>          <int>
## 1 no              4404
## 2 yes             8906
## 3 <NA>             273

Write the hypotheses for testing if the average weights are different for those who exercise at least times a week and those who don't.

Null hypothesis - H0: There is a difference in weight between individuals with a three day workout regimen, versus the individuals that don't. Alternate hypothesis - HA: The opposite of above (There is no difference in weight between both sets of idividuals)

How many of these null permutations have a difference of at least obs_stat?

Our obs_stat (obs_diff) is about 1.77, and there is at least one record in the distribution of our permutation that matches that.

obs_diff <- yrbss %>%
  specify(weight ~ physical_3plus) %>%
  calculate(stat = "diff in means", order = c("yes", "no"))
## Warning: Removed 1219 rows containing missing values.
obs_diff
## Response: weight (numeric)
## Explanatory: physical_3plus (factor)
## # A tibble: 1 x 1
##    stat
##   <dbl>
## 1  1.77
null_dist <- yrbss %>%
  specify(weight ~ physical_3plus) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", order = c("yes", "no"))
## Warning: Removed 1219 rows containing missing values.
ggplot(data = null_dist, aes(x = stat)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Construct and record a confidence interval for the difference between the weights of those who exercise at least three times a week and those who don't, and interpret this interval in context of the data.

As explained here: https://cran.r-project.org/web/packages/infer/infer.pdf

I wasn't sure which calculation to use here, so I went with diff in means since the others gave errors:

calculate( x, stat = c("mean", "median", "sum", "sd", "prop", "count", "diff in means", "diff in medians", "diff in props", "Chisq", "F", "slope", "correlation", "t", "z", "ratio of props", "odds ratio"), order = NULL, ... )

yrbss %>%
  specify(weight ~ physical_3plus) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "diff in means") %>%
  get_ci(level = 0.95)
## Warning: Removed 1219 rows containing missing values.
## Warning: The statistic is based on a difference or ratio; by default, for
## difference-based statistics, the explanatory variable is subtracted in the
## order "no" - "yes", or divided in the order "no" / "yes" for ratio-based
## statistics. To specify this order yourself, supply `order = c("no", "yes")` to
## the calculate() function.
## # A tibble: 1 x 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    -2.37    -1.13

inference(y = nc\(weight, x = nc\)habit, est = "mean", type = "ci", null = 0, alternative = "twosided", method = "theoretical", order = c("smoker","nonsmoker"))

Calculate a 95% confidence interval for the average height in meters (height) and interpret it in context.

The confidence interval range is 1.689427 to 1.693034 in this example without a seed (subject to change if re-run)

yrbss %>%
  specify(response = height) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "mean") %>%
  get_ci(level = 0.95)
## Warning: Removed 1004 rows containing missing values.
## # A tibble: 1 x 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1     1.69     1.69

Calculate a new confidence interval for the same parameter at the 90% confidence level. Comment on the width of this interval versus the one obtained in the previous exercise.

In this example without setting a seed, the range for the confidence interval is 1.689718 to 1.692773 which is about the same as before. This is subject to change, however.

yrbss %>%
  specify(response = height) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "mean") %>%
  get_ci(level = 0.90)
## Warning: Removed 1004 rows containing missing values.
## # A tibble: 1 x 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1     1.69     1.69

Conduct a hypothesis test evaluating whether the average height is different for those who exercise at least three times a week and those who don't.

Null Hypothesis: H0 - There is a difference in height between individuals who work out three times a week, versus individuals who don't. Alterantive Hypothesis: HA - Opposite of above (There is no difference in average height between individuals who work out three times a week and those who don't.)

obs_diff <- yrbss %>%
  specify(height ~ physical_3plus) %>%
  calculate(stat = "diff in means", order = c("yes", "no"))
## Warning: Removed 1219 rows containing missing values.
obs_diff
## Response: height (numeric)
## Explanatory: physical_3plus (factor)
## # A tibble: 1 x 1
##     stat
##    <dbl>
## 1 0.0376
null_dist <- yrbss %>%
  specify(height ~ physical_3plus) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", order = c("yes", "no"))
## Warning: Removed 1219 rows containing missing values.
ggplot(data = null_dist, aes(x = stat)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The obs_diff is about 0.037, which isn't represented in the histogram above.

Now, a non-inference task: Determine the number of different options there are in the dataset for the hours_tv_per_school_day there are.

yrbss %>%
  group_by(hours_tv_per_school_day) %>%
  summarise(n = n())
## # A tibble: 8 x 2
##   hours_tv_per_school_day     n
##   <chr>                   <int>
## 1 <1                       2168
## 2 1                        1750
## 3 2                        2705
## 4 3                        2139
## 5 4                        1048
## 6 5+                       1595
## 7 do not watch             1840
## 8 <NA>                      338

Come up with a research question evaluating the relationship between height or weight and sleep. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval. Report the statistical results, and also provide an explanation in plain language. Be sure to check all assumptions, state your ?? level, and conclude in context.

Is there a difference in height between individuals who have a higher average hours of sleep?

yrbss %>%
  group_by(school_night_hours_sleep) %>%
  summarise(n = n())
## # A tibble: 8 x 2
##   school_night_hours_sleep     n
##   <chr>                    <int>
## 1 <5                         965
## 2 10+                        316
## 3 5                         1480
## 4 6                         2658
## 5 7                         3461
## 6 8                         2692
## 7 9                          763
## 8 <NA>                      1248

Couldn't get this chunk to work, which prevents the rest of the code from running as well. Not sure which calculate to use for this hypothesis.

obs_diff <- yrbss %>% specify(height ~ school_night_hours_sleep) %>% calculate(stat = "diff in means", order = c("yes", "no"))

null_dist <- yrbss %>% specify(height ~ school_night_hours_sleep) %>% hypothesize(null = "independence") %>% generate(reps = 1000, type = "permute") %>% calculate(stat = "diff in means", order = c("yes", "no"))

ggplot(data = null_dist, aes(x = stat)) + geom_histogram()

yrbss %>% specify(weight ~ school_night_hours_sleep) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "diff in means") %>% get_ci(level = 0.95)