Inference for Numerical Data

library(tidyverse)

## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(openintro)

## Loading required package: airports

## Loading required package: cherryblossom

## Loading required package: usdata

library(infer)

The data

data(yrbss)
?yrbss

## starting httpd help server ... done

Exercise 1

What are the cases in this data set? How many cases are there in our sample?

There are 13 cases in this data set. This sample size is 13,583.

glimpse(yrbss)

## Rows: 13,583
## Columns: 13
## $ age                      <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15...
## $ gender                   <chr> "female", "female", "female", "female", "f...
## $ grade                    <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9...
## $ hispanic                 <chr> "not", "not", "hispanic", "not", "not", "n...
## $ race                     <chr> "Black or African American", "Black or Afr...
## $ height                   <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88...
## $ weight                   <dbl> NA, NA, 84.37, 55.79, 46.72, 67.13, 131.54...
## $ helmet_12m               <chr> "never", "never", "never", "never", "did n...
## $ text_while_driving_30d   <chr> "0", NA, "30", "0", "did not drive", "did ...
## $ physically_active_7d     <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, ...
## $ hours_tv_per_school_day  <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5...
## $ strength_training_7d     <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, ...
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "...

Exploratory Variable Analysis

summary(yrbss$weight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   29.94   56.25   64.41   67.91   76.20  180.99    1004

Exercise 2

How many observations are we missing weights from?

We are missing weights from 1004 observations.

yrbss <- yrbss %>% 
  mutate(physical_3plus = ifelse(yrbss$physically_active_7d > 2, "yes", "no"))

Exercise 3

Make a side-by-side boxplot of physical_3plus and weight. Is there a relationship between these two variables? What did you expect and why?

boxplot(yrbss$weight ~ yrbss$physical_3plus, data= yrbss)

These two variables are distributed very similarly. There seems to be a relationship between the two, because weight and physical activity impact eachother.There are lots of upper outliers.

yrbss %>%
  group_by(physical_3plus) %>%
  summarise(mean_weight = mean(weight, na.rm = TRUE))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 2
##   physical_3plus mean_weight
##   <chr>                <dbl>
## 1 no                    66.7
## 2 yes                   68.4
## 3 <NA>                  69.9

Inference

Exercise 4

Are all conditions necessary for inference satisfied? Comment on each. You can compute the group sizes with the summarize command above by defining a new variable with the definition n()

summary(yrbss$physical_3plus)

##    Length     Class      Mode 
##     13583 character character

The sample is large enough for the normal condition to be applied. The values represent different people meaning they are independent.The sample is random. The conditions for inference are met.

Exercise 5

Write the hypotheses for testing if the average weights are different for those who exercise at least times a week and those who don’t.

Ho: muyes=muno Ha: muyes≠muno

obs_diff <- yrbss %>%
  specify(weight ~ physical_3plus) %>%
  calculate(stat = "diff in means", order = c("yes", "no"))

## Warning: Removed 1219 rows containing missing values.

null_dist <- yrbss %>%
  specify(weight ~ physical_3plus) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", order = c("yes", "no"))

## Warning: Removed 1219 rows containing missing values.

ggplot(data = null_dist, aes(x = stat)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Exercise 6

How many of these null permutations have a difference of at least obs_stat?

Does not appear that there are any that have this difference.

glimpse(obs_diff)

## Rows: 1
## Columns: 1
## $ stat <dbl> 1.774584

null_dist %>%
  get_p_value(obs_stat = obs_diff, direction = "two_sided")

## Warning: Please be cautious in reporting a p-value of 0. This result is an
## approximation based on the number of `reps` chosen in the `generate()` step. See
## `?get_p_value()` for more information.

## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1       0

Exercise 10

Conduct a hypothesis test evaluating whether the average height is different for those who exercise at least three times a week and those who don’t.

Ho: muexercise = muno Ha: muexercise ≠muno

obs_diff2 <- yrbss %>%
  specify(height ~ physical_3plus) %>%
  calculate(stat = "diff in means", order = c("yes", "no"))

## Warning: Removed 1219 rows containing missing values.

null_dist2 <- yrbss %>%
  specify(height ~ physical_3plus) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", order = c("yes", "no"))

## Warning: Removed 1219 rows containing missing values.

ggplot(data = null_dist2, aes(x = stat)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

null_dist2 %>%
  get_p_value(obs_stat = obs_diff2, direction = "two_sided")

## Warning: Please be cautious in reporting a p-value of 0. This result is an
## approximation based on the number of `reps` chosen in the `generate()` step. See
## `?get_p_value()` for more information.

## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1       0

very strong evidence to reject the null.

Exercise 11

Now, a non-inference task: Determine the number of different options there are in the dataset for the hours_tv_per_school_day there are.

glimpse(yrbss$hours_tv_per_school_day)

##  chr [1:13583] "5+" "5+" "5+" "2" "3" "5+" "5+" "5+" "5+" "do not watch" ...

The options are 0 through 5+

Exercise 12

Come up with a research question evaluating the relationship between height or weight and sleep. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval. Report the statistical results, and also provide an explanation in plain language. Be sure to check all assumptions, state your α level, and conclude in context.

Conduct a hypothesis test evaluating whether the average height is different for those who sleep for 6 hours or more and those who don’t.

Ho: musleep6 = muno Ha: musleep6 ≠muno

yrbss <- yrbss %>% 
  mutate(sleep = ifelse(yrbss$school_night_hours_sleep > 6, "yes", "no"))

obs_diff3 <- yrbss %>%
  specify(height ~ sleep) %>%
  calculate(stat = "diff in means", order = c("yes", "no"))

## Warning: Removed 2102 rows containing missing values.

null_dist3 <- yrbss %>%
  specify(height ~ sleep) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", order = c("yes", "no"))

## Warning: Removed 2102 rows containing missing values.

ggplot(data = null_dist3, aes(x = stat)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

null_dist3 %>%
  get_p_value(obs_stat = obs_diff3, direction = "two_sided")

## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1    0.01

There is strong evidence to reject the null. There is a difference between the height of those who sleep 6 hours or more and those who do not.

This is significant at the 5% level.

Random selection, data is independent and the sample size is large enough therefore the assumptions are met.