library(tidyverse)
library(openintro)
library(infer)
library(ggplot2)
library(dplyr)
data('yrbss', package='openintro')There are 13 cases in the data set.
glimpse(yrbss)## Rows: 13,583
## Columns: 13
## $ age <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15, 1~
## $ gender <chr> "female", "female", "female", "female", "fema~
## $ grade <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9", ~
## $ hispanic <chr> "not", "not", "hispanic", "not", "not", "not"~
## $ race <chr> "Black or African American", "Black or Africa~
## $ height <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88, 1~
## $ weight <dbl> NA, NA, 84.37, 55.79, 46.72, 67.13, 131.54, 7~
## $ helmet_12m <chr> "never", "never", "never", "never", "did not ~
## $ text_while_driving_30d <chr> "0", NA, "30", "0", "did not drive", "did not~
## $ physically_active_7d <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, 7, ~
## $ hours_tv_per_school_day <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5+",~
## $ strength_training_7d <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, 7, ~
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "<5"~
We are missing 1004 weights from this set.
summary(yrbss$weight)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 29.94 56.25 64.41 67.91 76.20 180.99 1004
The median for those are physically active for more than two day is higher than those who are not active. The quartiles for the “yes” are distributed more evenly than the “no”. Also, there is less outliers for the “yes” than the “no”.
yrbss <- yrbss %>%
mutate(physical_3plus = ifelse(yrbss$physically_active_7d > 2, "yes", "no"))
ggplot(yrbss, aes(x=weight, y=physical_3plus)) +geom_boxplot()All conditions are satisfied there are three categories (yes,no,na-but we ignore this variable) and the box plots look almost symmetric.
yrbss %>%
group_by(physical_3plus) %>%
summarise(mean_weight = mean(weight, na.rm = TRUE))## # A tibble: 3 x 2
## physical_3plus mean_weight
## <chr> <dbl>
## 1 no 66.7
## 2 yes 68.4
## 3 <NA> 69.9
count<-yrbss %>%count(physical_3plus)
print(count)## # A tibble: 3 x 2
## physical_3plus n
## <chr> <int>
## 1 no 4404
## 2 yes 8906
## 3 <NA> 273
H0: The weights are different from those who work out 3+ a week from those who don’t HA:There is no difference in weights from those who out 3+ a week from those who don’t
None as the null distance stats are larger than the obs_stats. They had to be close to zero.
obs_diff <- yrbss %>%
specify(weight ~ physical_3plus) %>%
calculate(stat = "diff in means", order = c("yes", "no"))
null_dist <- yrbss %>%
specify(weight ~ physical_3plus) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("yes", "no"))
ggplot(data = null_dist, aes(x = stat)) +geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
null_dist %>%filter(stat >= obs_diff)## Response: weight (numeric)
## Explanatory: physical_3plus (factor)
## Null Hypothesis: independence
## # A tibble: 0 x 2
## # ... with 2 variables: replicate <int>, stat <dbl>
null_dist %>%
get_p_value(obs_stat = obs_diff, direction = "two_sided")## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0
Let’s figure out the confident levels. The confident level of 95% for the differences in weights/Exercise daily are(0.661, 0.677)
yrbss%>%
specify(response =physical_3plus , success = "yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.95)## # A tibble: 1 x 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.661 0.678
The calculated height is 1.697054. To calculate the confident level of the average height. Lets use infer. The confident interval range is ( 1.69,1.70).
yrbss<-yrbss%>%na.omit(yrbss$height)
mean(yrbss$height)## [1] 1.697054
set.seed(1000)
yrbss%>%
specify(response =height ) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "mean") %>%
get_ci(level = 0.95)## # A tibble: 1 x 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 1.69 1.70
The confident interval got narrower as the range is (1.695216 ,1.698853)~(1.70,1.70)
yrbss%>%
specify(response =height ) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "mean") %>%
get_ci(level = 0.90)## # A tibble: 1 x 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 1.70 1.70
H0:There is a difference in average height for those who exercise at least three times a week and those who don’t HA: There is a difference in average height for those who exercise at least three times a week and those who don’t
obs_diff <- yrbss %>%
specify(height ~ physical_3plus) %>%
calculate(stat = "diff in means", order = c("yes", "no"))
null_dist <- yrbss %>%
specify(height ~ physical_3plus) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("yes", "no"))
ggplot(data = null_dist, aes(x = stat)) +geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
There are seven options for the hour_tv_per_school_day column.
count<-yrbss %>%count(hours_tv_per_school_day)
print(count)## # A tibble: 7 x 2
## hours_tv_per_school_day n
## <chr> <int>
## 1 <1 1407
## 2 1 1172
## 3 2 1738
## 4 3 1309
## 5 4 627
## 6 5+ 966
## 7 do not watch 1132
H0: There is no difference in the means between a student’s average height and the average hours of sleep
HA: There is a difference in the mean between a student’s average height and the average hours of sleep
yrbss %>%
group_by(school_night_hours_sleep) %>%
summarise(mean_height = mean(height, na.rm = TRUE))## # A tibble: 7 x 2
## school_night_hours_sleep mean_height
## <chr> <dbl>
## 1 <5 1.69
## 2 10+ 1.69
## 3 5 1.69
## 4 6 1.70
## 5 7 1.70
## 6 8 1.70
## 7 9 1.69
ggplot(yrbss, aes(x=height, y=school_night_hours_sleep)) +geom_boxplot()obs_diff <- yrbss %>%
specify(height ~ physical_3plus) %>%
calculate(stat = "diff in means", order = c("yes", "no"))
null_dist %>%
get_p_value(obs_stat = obs_diff, direction = "two_sided")## Warning: Please be cautious in reporting a p-value of 0. This result is an
## approximation based on the number of `reps` chosen in the `generate()` step. See
## `?get_p_value()` for more information.
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0