library(tidyverse)
library(openintro)
library(infer)
library(skimr)data(yrbss)
?yrbss## starting httpd help server ... done
What are the cases in this data set? How many cases are there in our sample?
Ans. There are 13583 cases in our dataset.
glimpse(yrbss)## Rows: 13,583
## Columns: 13
## $ age <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15, 1~
## $ gender <chr> "female", "female", "female", "female", "fema~
## $ grade <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9", ~
## $ hispanic <chr> "not", "not", "hispanic", "not", "not", "not"~
## $ race <chr> "Black or African American", "Black or Africa~
## $ height <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88, 1~
## $ weight <dbl> NA, NA, 84.37, 55.79, 46.72, 67.13, 131.54, 7~
## $ helmet_12m <chr> "never", "never", "never", "never", "did not ~
## $ text_while_driving_30d <chr> "0", NA, "30", "0", "did not drive", "did not~
## $ physically_active_7d <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, 7, ~
## $ hours_tv_per_school_day <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5+",~
## $ strength_training_7d <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, 7, ~
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "<5"~
How many observations are we missing weights from?
Ans. We are missing weight from 1004 onbersavtions.
yrbss%>%
skim()| Name | Piped data |
| Number of rows | 13583 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 8 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| gender | 12 | 1.00 | 4 | 6 | 0 | 2 | 0 |
| grade | 79 | 0.99 | 1 | 5 | 0 | 5 | 0 |
| hispanic | 231 | 0.98 | 3 | 8 | 0 | 2 | 0 |
| race | 2805 | 0.79 | 5 | 41 | 0 | 5 | 0 |
| helmet_12m | 311 | 0.98 | 5 | 12 | 0 | 6 | 0 |
| text_while_driving_30d | 918 | 0.93 | 1 | 13 | 0 | 8 | 0 |
| hours_tv_per_school_day | 338 | 0.98 | 1 | 12 | 0 | 7 | 0 |
| school_night_hours_sleep | 1248 | 0.91 | 1 | 3 | 0 | 7 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 77 | 0.99 | 16.16 | 1.26 | 12.00 | 15.00 | 16.00 | 17.00 | 18.00 | ▁▂▅▅▇ |
| height | 1004 | 0.93 | 1.69 | 0.10 | 1.27 | 1.60 | 1.68 | 1.78 | 2.11 | ▁▅▇▃▁ |
| weight | 1004 | 0.93 | 67.91 | 16.90 | 29.94 | 56.25 | 64.41 | 76.20 | 180.99 | ▆▇▂▁▁ |
| physically_active_7d | 273 | 0.98 | 3.90 | 2.56 | 0.00 | 2.00 | 4.00 | 7.00 | 7.00 | ▆▂▅▃▇ |
| strength_training_7d | 1176 | 0.91 | 2.95 | 2.58 | 0.00 | 0.00 | 3.00 | 5.00 | 7.00 | ▇▂▅▂▅ |
Make a side-by-side violin plots of physical_3plus and weight. Is there a relationship between these two variables? What did you expect and why?
Ans. We can see a relation between physical activity and weight. There are more outliers for the those who don’t exercise than those who do. This indicates a certain relation despite the clusters being fairly close to each other.
yrbss <- yrbss %>%
mutate(physical_3plus = if_else(physically_active_7d > 2, "yes", "no"))
ggplot(yrbss, aes(x=weight, y=physical_3plus)) + geom_boxplot() + theme_bw()## Warning: Removed 1004 rows containing non-finite values (stat_boxplot).
yrbss %>%
group_by(physical_3plus) %>%
summarise(mean_weight = mean(weight, na.rm = TRUE))## # A tibble: 3 x 2
## physical_3plus mean_weight
## <chr> <dbl>
## 1 no 66.7
## 2 yes 68.4
## 3 <NA> 69.9
Are all conditions necessary for inference satisfied? Comment on each. You can compute the group sizes with the summarize command above by defining a new variable with the definition n().
Ans. The conditions for inference are independence and normality. Since the data is gathered from individuals with no connection between them, the data is independent; and through the violin plots above, we know that the data is normally distributed.
yrbss %>%
group_by(physical_3plus)%>%
summarise(n=sum(table(weight)))## # A tibble: 3 x 2
## physical_3plus n
## <chr> <int>
## 1 no 4022
## 2 yes 8342
## 3 <NA> 215
Write the hypotheses for testing if the average weights are different for those who exercise at least times a week and those who don’t.
Ans. H_o = people who exercise three or more times a week have the same average weight as those who do not exercise three or more times a week. H_1 = people who exercise three or more times a week do not have the same average weight as those who do not exercise three or more times a week.
obs_diff <- yrbss %>%
specify(weight ~ physical_3plus) %>%
calculate(stat = "diff in means", order = c("yes", "no"))## Warning: Removed 1219 rows containing missing values.
null_dist <- yrbss %>%
specify(weight ~ physical_3plus) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("yes", "no"))## Warning: Removed 1219 rows containing missing values.
ggplot(data = null_dist, aes(x = stat)) +
geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Add a vertical red line to the plot above, demonstrating where the observed difference in means (obs_diff) falls on the distribution.
visualize(null_dist) +
shade_p_value(obs_stat = obs_diff, direction = "two_sided")## Warning: F usually corresponds to right-tailed tests. Proceed with caution.
How many of these null_dist permutations have a difference at least as large (or larger) as obs_diff?
Ans. The red line is too far off from the measurements, therefore the answer is none. It can be seen through the code given below.
null_dist %>%
get_p_value(obs_stat = obs_diff, direction = "two_sided")## Warning: Please be cautious in reporting a p-value of 0. This result is an
## approximation based on the number of `reps` chosen in the `generate()` step. See
## `?get_p_value()` for more information.
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0
What warning message do you get? Why do you think you get this warning message?
Ans. The warning message highlights that the p-value might not be completely accurate, but it is small enough; however, one must be careful while using and reporting the given p value.
Construct and record a confidence interval for the difference between the weights of those who exercise at least three times a week and those who don’t, and interpret this interval in context of the data.
yrbss %>%
group_by(physical_3plus) %>%
summarise(sd_weight = sd(weight, na.rm = TRUE))## # A tibble: 3 x 2
## physical_3plus sd_weight
## <chr> <dbl>
## 1 no 17.6
## 2 yes 16.5
## 3 <NA> 17.6
yrbss %>%
group_by(physical_3plus) %>%
summarise(mean_weight = mean(weight, na.rm = TRUE))## # A tibble: 3 x 2
## physical_3plus mean_weight
## <chr> <dbl>
## 1 no 66.7
## 2 yes 68.4
## 3 <NA> 69.9
yrbss %>%
group_by(physical_3plus) %>%
summarise(n = sum(table(weight)))## # A tibble: 3 x 2
## physical_3plus n
## <chr> <int>
## 1 no 4022
## 2 yes 8342
## 3 <NA> 215
meanNA<-66.67389
meanA<-68.44847
sdNA<-17.63805
sdA<-16.47832
nNA<-4022
nA<-8342
z=1.96
upperCINA<-meanNA+z*(sdNA/sqrt(nNA))
lowerCINA<-meanNA-z*(sdNA/sqrt(nNA))
upperCIA<-meanA+z*(sdA/sqrt(nA))
lowerCIA<-meanA-z*(sdA/sqrt(nA))Through these calculations, the confidence interval comes to be: For those who are active: (68.0949,68.8021) For those who aren’t active: (66.1288, 67.2190)