library(tidyverse)
library(openintro)
library(infer)

Exercise 1

What are the cases in this data set? How many cases are there in our sample? Health patterns of high schoolers. The sample has 13583 cases.

data('yrbss', package='openintro')
?yrbss
glimpse(yrbss)
## Rows: 13,583
## Columns: 13
## $ age                      <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15, 1…
## $ gender                   <chr> "female", "female", "female", "female", "fema…
## $ grade                    <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9", …
## $ hispanic                 <chr> "not", "not", "hispanic", "not", "not", "not"…
## $ race                     <chr> "Black or African American", "Black or Africa…
## $ height                   <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88, 1…
## $ weight                   <dbl> NA, NA, 84.37, 55.79, 46.72, 67.13, 131.54, 7…
## $ helmet_12m               <chr> "never", "never", "never", "never", "did not …
## $ text_while_driving_30d   <chr> "0", NA, "30", "0", "did not drive", "did not…
## $ physically_active_7d     <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, 7, …
## $ hours_tv_per_school_day  <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5+",…
## $ strength_training_7d     <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, 7, …
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "<5"…
nrow(yrbss)
## [1] 13583
head(yrbss)
## # A tibble: 6 × 13
##     age gender grade hispanic race                      height weight helmet_12m
##   <int> <chr>  <chr> <chr>    <chr>                      <dbl>  <dbl> <chr>     
## 1    14 female 9     not      Black or African American  NA      NA   never     
## 2    14 female 9     not      Black or African American  NA      NA   never     
## 3    15 female 9     hispanic Native Hawaiian or Other…   1.73   84.4 never     
## 4    15 female 9     not      Black or African American   1.6    55.8 never     
## 5    15 female 9     not      Black or African American   1.5    46.7 did not r…
## 6    15 female 9     not      Black or African American   1.57   67.1 did not r…
## # ℹ 5 more variables: text_while_driving_30d <chr>, physically_active_7d <int>,
## #   hours_tv_per_school_day <chr>, strength_training_7d <int>,
## #   school_night_hours_sleep <chr>

Exercise 2

How many observations are we missing weights from?

There are 1004 values missing weights from.

summary(yrbss$weight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   29.94   56.25   64.41   67.91   76.20  180.99    1004
sum(is.na(yrbss))
## [1] 9476
sum(is.na(yrbss$weight))
## [1] 1004

Exercise 3

Make a side-by-side boxplot of physical_3plus and weight. Is there a relationship between these two variables? What did you expect and why?

Yes. There is a relationship between them with value of active physical_3plus is 68.4484 kg and inactive physical_3plus is 66.67389 kg in term of mean weight.

yrbss <- yrbss %>% 
  mutate(physical_3plus = ifelse(yrbss$physically_active_7d > 2, "yes", "no"))

sum(is.na(yrbss$physical_3plus)) # missing variables physical_3plus
## [1] 273
ggplot(yrbss, aes(x=weight, y=physical_3plus)) + geom_boxplot() + theme_bw()
## Warning: Removed 1004 rows containing non-finite values (`stat_boxplot()`).

yrbss %>%
  group_by(physical_3plus) %>%
  summarise(mean_weight = mean(weight, na.rm = TRUE))
## # A tibble: 3 × 2
##   physical_3plus mean_weight
##   <chr>                <dbl>
## 1 no                    66.7
## 2 yes                   68.4
## 3 <NA>                  69.9
yrbss2 <- yrbss %>% 
  mutate(physical_3plus = ifelse(yrbss$physically_active_7d > 2, "yes", "no")) %>%
  na.exclude()
ggplot(yrbss2, aes(x=weight, y=physical_3plus)) + geom_boxplot() + theme_bw()

### Exercise 4 Are all conditions necessary for inference satisfied? Comment on each. You can compute the group sizes with the summarize command above by defining a new variable with the definition n().

Yes, all conditions are necessary for inference satisfied. The data sample is independent, assumed as normally distributed and no extreme outliers.

#calculate the sample size of weight in term of physical_3plus
yrbss %>% 
  group_by(physical_3plus) %>% 
  summarise(freq = table(weight)) %>% 
  summarise(n = sum(freq))
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'physical_3plus'. You can override using
## the `.groups` argument.
## # A tibble: 3 × 2
##   physical_3plus     n
##   <chr>          <int>
## 1 no              4022
## 2 yes             8342
## 3 <NA>             215

Exercise 5

Write the hypotheses for testing if the average weights are different for those who exercise at least times a week and those who don’t.

Null hypothesis refers to students who are active physically for 3 or more days a week have the same average weight as those who do not.

Alternative hypothesis stands for students who are active physically for 3 or more days a week have the different average weight as those who do not.

Exercise 6

How many of these null permutations have a difference of at least obs_stat?

set.seed(10000)
yrbss%>%
  observe(weight ~ NULL, stat = "mean")
## Warning: Removed 1004 rows containing missing values.
## Response: weight (numeric)
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1  67.9
obs_diff <- yrbss %>%
  specify(response = weight) %>%
  calculate(stat = "mean")
## Warning: Removed 1004 rows containing missing values.

Exercise 7

Construct and record a confidence interval for the difference between the weights of those who exercise at least three times a week and those who don’t, and interpret this interval in context of the data.

# standard deviation
yrbss %>% 
  group_by(physical_3plus) %>% 
  summarise(sd_weight = sd(weight, na.rm = TRUE))
## # A tibble: 3 × 2
##   physical_3plus sd_weight
##   <chr>              <dbl>
## 1 no                  17.6
## 2 yes                 16.5
## 3 <NA>                17.6
# mean
yrbss %>% 
  group_by(physical_3plus) %>% 
  summarise(mean_weight = mean(weight, na.rm = TRUE))
## # A tibble: 3 × 2
##   physical_3plus mean_weight
##   <chr>                <dbl>
## 1 no                    66.7
## 2 yes                   68.4
## 3 <NA>                  69.9
#sample size: N
yrbss %>% 
  group_by(physical_3plus) %>% 
  summarise(freq = table(weight)) %>%
  summarise(n = sum(freq))
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'physical_3plus'. You can override using
## the `.groups` argument.
## # A tibble: 3 × 2
##   physical_3plus     n
##   <chr>          <int>
## 1 no              4022
## 2 yes             8342
## 3 <NA>             215
mean_not_active <- 66.67389
n_not_active <- 4022
sd_not_active <- 17.63805
mean_active <- 68.44847
n_active <- 8342
sd_active <- 16.47832
z = 1.96
upper_ci_not_act <- mean_not_active + z*(sd_not_active/sqrt(n_not_active))
lower_ci_not_act <- mean_not_active - z*(sd_not_active/sqrt(n_not_active))
upper_ci_act <- mean_active + z*(sd_active/sqrt(n_active))
lower_ci_act <- mean_active - z*(sd_active/sqrt(n_active))
c("Those not active:", lower_ci_not_act, upper_ci_not_act)
## [1] "Those not active:" "66.1287781694363"  "67.2190018305637"
c("Those active:", lower_ci_act, upper_ci_act)
## [1] "Those active:"    "68.0948523684916" "68.8020876315084"

Exercise 8

Calculate a 95% confidence interval for the average height in meters (height) and interpret it in context.

The 95% confidence interval is ” “1.68941116354256” to “1.69307075075905”

summary(yrbss$height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.270   1.600   1.680   1.691   1.780   2.110    1004
height<-yrbss%>%
  filter(!is.na(height))%>%
  select(height)

n<-nrow(height)
df<-n-1
height<-height$height


x_bar<-mean(height)
sigma<-sd(height)
SE<-sigma/sqrt(n)
t_star<-qt(.025,df=df)

bot<-x_bar-abs(t_star*SE)
top<-x_bar+abs(t_star*SE)

c("the 95% confidence interval is ",bot," to ",top)
## [1] "the 95% confidence interval is " "1.68941116354256"               
## [3] " to "                            "1.69307075075905"
yrbss<-as.data.frame(yrbss)
ggplot(yrbss, aes(x=height))+
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1004 rows containing non-finite values (`stat_bin()`).

Exercise 10

Conduct a hypothesis test evaluating whether the average height is different for those who exercise at least three times a week and those who don’t.

# standard deviation
yrbss %>% 
  group_by(physical_3plus) %>% 
  summarise(sd_height = sd(height, na.rm = TRUE))
## # A tibble: 3 × 2
##   physical_3plus sd_height
##   <chr>              <dbl>
## 1 no                 0.103
## 2 yes                0.103
## 3 <NA>               0.107
# mean
yrbss %>% 
  group_by(physical_3plus) %>% 
  summarise(mean_height = mean(height, na.rm = TRUE))
## # A tibble: 3 × 2
##   physical_3plus mean_height
##   <chr>                <dbl>
## 1 no                    1.67
## 2 yes                   1.70
## 3 <NA>                  1.71
#sample size: N
yrbss %>% 
  group_by(physical_3plus) %>% 
  summarise(freq = table(height)) %>%
  summarise(n = sum(freq))
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'physical_3plus'. You can override using
## the `.groups` argument.
## # A tibble: 3 × 2
##   physical_3plus     n
##   <chr>          <int>
## 1 no              4022
## 2 yes             8342
## 3 <NA>             215
mean_not_active <- 1.6665
n_not_active <- 4022
sd_not_active <- 0.1029
mean_active <- 1.7032
n_active <- 8342
sd_active <- 0.1033
z = 1.96
upper_ci_not_act <- mean_not_active + z*(sd_not_active/sqrt(n_not_active))
lower_ci_not_act <- mean_not_active - z*(sd_not_active/sqrt(n_not_active))
upper_ci_act <- mean_active + z*(sd_active/sqrt(n_active))
lower_ci_act <- mean_active - z*(sd_active/sqrt(n_active))
c("Those not active:", lower_ci_not_act, upper_ci_not_act)
## [1] "Those not active:" "1.66331982943891"  "1.66968017056109"
c("Those active:", lower_ci_act, upper_ci_act)
## [1] "Those active:"    "1.70098322660715" "1.70541677339285"

“Those not active heights :” “1.66331982943891” “1.66968017056109” “Those active heights:” “1.70098322660715” “1.70541677339285”

Exercise 11

Now, a non-inference task: Determine the number of different options there are in the dataset for the hours_tv_per_school_day there are.

yrbss%>%
  filter(!is.na(hours_tv_per_school_day))%>%
  select(hours_tv_per_school_day)%>%
  unique()
##    hours_tv_per_school_day
## 1                       5+
## 4                        2
## 5                        3
## 10            do not watch
## 12                      <1
## 14                       4
## 19                       1

Exercise 12

Come up with a research question evaluating the relationship between height or weight and sleep. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval. Report the statistical results, and also provide an explanation in plain language. Be sure to check all assumptions, state your α level, and conclude in context.

library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
unique(yrbss$school_night_hours_sleep)
## [1] "8"   "6"   "<5"  "9"   "10+" "7"   "5"   NA
ggplot(yrbss, aes(x = weight, y = school_night_hours_sleep)) + geom_boxplot()
## Warning: Removed 1004 rows containing non-finite values (`stat_boxplot()`).

desc <- describeBy(yrbss$weight, yrbss$school_night_hours_sleep, mat=TRUE)[,c(2,4,5,6)]
desc$Var <- desc$sd^2
print(desc, row.names=FALSE)
##  group1    n     mean       sd      Var
##      <5  859 70.29700 19.47970 379.4586
##     10+  255 69.29251 19.92961 397.1895
##       5 1378 68.41806 17.47753 305.4639
##       6 2496 68.33318 17.12553 293.2838
##       7 3283 67.43457 16.12185 259.9140
##       8 2505 67.45745 16.52393 273.0401
##       9  705 65.55898 15.87743 252.0929
aov.out <- aov(data=yrbss, weight ~ hours_tv_per_school_day )
summary(aov.out)
##                            Df  Sum Sq Mean Sq F value  Pr(>F)    
## hours_tv_per_school_day     6   20160    3360   11.83 2.8e-13 ***
## Residuals               12301 3492964     284                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1275 observations deleted due to missingness

The boxplot shows that all mediums appers similar with some subtle variations. Each sleeping group also has similar IQRs.

ANOVA provides significant codes to 0.001 that provides strong evidence to reject the null hypothesis in favor of alternative hypothesis. It means the average weight of students affects on the average number of hours of tv students watch on school nights.

