## Warning: package 'skimr' was built under R version 4.0.3
What are the cases in this data set? How many cases are there in our sample?
## starting httpd help server ... done
## Rows: 13,583
## Columns: 13
## $ age <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15...
## $ gender <chr> "female", "female", "female", "female", "f...
## $ grade <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9...
## $ hispanic <chr> "not", "not", "hispanic", "not", "not", "n...
## $ race <chr> "Black or African American", "Black or Afr...
## $ height <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88...
## $ weight <dbl> NA, NA, 84.37, 55.79, 46.72, 67.13, 131.54...
## $ helmet_12m <chr> "never", "never", "never", "never", "did n...
## $ text_while_driving_30d <chr> "0", NA, "30", "0", "did not drive", "did ...
## $ physically_active_7d <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, ...
## $ hours_tv_per_school_day <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5...
## $ strength_training_7d <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, ...
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "...
| Name | Piped data |
| Number of rows | 13583 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 8 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| gender | 12 | 1.00 | 4 | 6 | 0 | 2 | 0 |
| grade | 79 | 0.99 | 1 | 5 | 0 | 5 | 0 |
| hispanic | 231 | 0.98 | 3 | 8 | 0 | 2 | 0 |
| race | 2805 | 0.79 | 5 | 41 | 0 | 5 | 0 |
| helmet_12m | 311 | 0.98 | 5 | 12 | 0 | 6 | 0 |
| text_while_driving_30d | 918 | 0.93 | 1 | 13 | 0 | 8 | 0 |
| hours_tv_per_school_day | 338 | 0.98 | 1 | 12 | 0 | 7 | 0 |
| school_night_hours_sleep | 1248 | 0.91 | 1 | 3 | 0 | 7 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 77 | 0.99 | 16.16 | 1.26 | 12.00 | 15.00 | 16.00 | 17.00 | 18.00 | ▁▂▅▅▇ |
| height | 1004 | 0.93 | 1.69 | 0.10 | 1.27 | 1.60 | 1.68 | 1.78 | 2.11 | ▁▅▇▃▁ |
| weight | 1004 | 0.93 | 67.91 | 16.90 | 29.94 | 56.25 | 64.41 | 76.20 | 180.99 | ▆▇▂▁▁ |
| physically_active_7d | 273 | 0.98 | 3.90 | 2.56 | 0.00 | 2.00 | 4.00 | 7.00 | 7.00 | ▆▂▅▃▇ |
| strength_training_7d | 1176 | 0.91 | 2.95 | 2.58 | 0.00 | 0.00 | 3.00 | 5.00 | 7.00 | ▇▂▅▂▅ |
There are 13 columns in the sample. These cases include age,gender,grade, hispanic or not, height, race, weight, if they wear a helmet while biking, if they text while driving, if they are physically active, how many hours they watch TV per night, if they weight lift and how many hours they typically get per night. There are 13,583 rows/cases in the data set.
How many observations are we missing weights from?
## Rows: 13,583
## Columns: 13
## $ age <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15...
## $ gender <chr> "female", "female", "female", "female", "f...
## $ grade <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9...
## $ hispanic <chr> "not", "not", "hispanic", "not", "not", "n...
## $ race <chr> "Black or African American", "Black or Afr...
## $ height <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88...
## $ weight <dbl> NA, NA, 84.37, 55.79, 46.72, 67.13, 131.54...
## $ helmet_12m <chr> "never", "never", "never", "never", "did n...
## $ text_while_driving_30d <chr> "0", NA, "30", "0", "did not drive", "did ...
## $ physically_active_7d <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, ...
## $ hours_tv_per_school_day <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5...
## $ strength_training_7d <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, ...
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "...
| Name | Piped data |
| Number of rows | 13583 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 8 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| gender | 12 | 1.00 | 4 | 6 | 0 | 2 | 0 |
| grade | 79 | 0.99 | 1 | 5 | 0 | 5 | 0 |
| hispanic | 231 | 0.98 | 3 | 8 | 0 | 2 | 0 |
| race | 2805 | 0.79 | 5 | 41 | 0 | 5 | 0 |
| helmet_12m | 311 | 0.98 | 5 | 12 | 0 | 6 | 0 |
| text_while_driving_30d | 918 | 0.93 | 1 | 13 | 0 | 8 | 0 |
| hours_tv_per_school_day | 338 | 0.98 | 1 | 12 | 0 | 7 | 0 |
| school_night_hours_sleep | 1248 | 0.91 | 1 | 3 | 0 | 7 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 77 | 0.99 | 16.16 | 1.26 | 12.00 | 15.00 | 16.00 | 17.00 | 18.00 | ▁▂▅▅▇ |
| height | 1004 | 0.93 | 1.69 | 0.10 | 1.27 | 1.60 | 1.68 | 1.78 | 2.11 | ▁▅▇▃▁ |
| weight | 1004 | 0.93 | 67.91 | 16.90 | 29.94 | 56.25 | 64.41 | 76.20 | 180.99 | ▆▇▂▁▁ |
| physically_active_7d | 273 | 0.98 | 3.90 | 2.56 | 0.00 | 2.00 | 4.00 | 7.00 | 7.00 | ▆▂▅▃▇ |
| strength_training_7d | 1176 | 0.91 | 2.95 | 2.58 | 0.00 | 0.00 | 3.00 | 5.00 | 7.00 | ▇▂▅▂▅ |
## [1] 9476
We are missing 1004 observations related to weight using skim().
Make a side-by-side violin plots of physical_3plus and weight. Is there a relationship between these two variables? What did you expect and why?
# Insert code for Exercise 3 here
yrbss2 <- yrbss %>%
mutate(physical_3plus = ifelse(yrbss$physically_active_7d >= 3, "yes", "no")) %>%
na.exclude()
ggplot(yrbss2, aes(x=weight, y=physical_3plus)) + geom_violin()It seems as though the people who were active more actually weigh more on average than those who are less active.I would expect it to be the other way around as if you are active you tend to be skinnier/weigh less. However, it might make sense as you would have more muscle and muscle weighs more than fat.
Are all conditions necessary for inference satisfied? Comment on each. You can compute the group sizes with the summarize command above by defining a new variable with the definition n().
# Insert code for Exercise 4 here
yrbss <- yrbss %>%
mutate(physical_3plus = if_else(physically_active_7d > 2, "yes", "no"))
yrbss %>%
group_by(physical_3plus) %>%
summarise(freq = table(weight)) %>%
summarise(n = sum(freq))## `summarise()` regrouping output by 'physical_3plus' (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
## physical_3plus n
## <chr> <int>
## 1 no 4022
## 2 yes 8342
## 3 <NA> 215
The data does seem to be random and the results seem to be independent of each other.The sample size is also large enough that it probably has approximately normal distribution although there are long right tails as indicated by the violin plots.
Write the hypotheses for testing if the average weights are different for those who exercise at least times a week and those who don’t.
# Insert code for Exercise 5 here
obs_diff <- yrbss %>%
specify(weight ~ physical_3plus) %>%
calculate(stat = "diff in means", order = c("yes", "no"))## Warning: Removed 1219 rows containing missing values.
null_dist <- yrbss %>%
specify(weight ~ physical_3plus) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("yes", "no"))## Warning: Removed 1219 rows containing missing values.
H0: Students who are physically active 3 or more days per week will have the same average weight as students who are not physically active 3 or more days a week
HA: Students who are physically active 3 or more days per week have a different average weight than those who are not physically active 3 or more days a week.
Add a vertical red line to the plot above, demonstrating where the observed difference in means (obs_diff) falls on the distribution.
# Insert code for Exercise 6 here
yrbss <- yrbss %>%
mutate(physical_3plus = if_else(physically_active_7d > 2, "yes", "no"))
yrbss %>%
group_by(physical_3plus) %>%
summarise(mean_weight = mean(weight, na.rm = TRUE))## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
## physical_3plus mean_weight
## <chr> <dbl>
## 1 no 66.7
## 2 yes 68.4
## 3 <NA> 69.9
obs_diff <- yrbss %>%
specify(weight ~ physical_3plus) %>%
calculate(stat = "diff in means", order = c("yes", "no"))## Warning: Removed 1219 rows containing missing values.
null_dist <- yrbss %>%
specify(weight ~ physical_3plus) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("yes", "no"))## Warning: Removed 1219 rows containing missing values.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Calculate a 95% confidence interval for the average height in meters (height) and interpret it in context
# Insert code for Exercise 10 here
t.test(yrbss$height, mu = mean(yrbss$height, na.rm = TRUE), conf.level = 0.95)##
## One Sample t-test
##
## data: yrbss$height
## t = 0, df = 12578, p-value = 1
## alternative hypothesis: true mean is not equal to 1.691241
## 95 percent confidence interval:
## 1.689411 1.693071
## sample estimates:
## mean of x
## 1.691241
We are 95% confident that the average height for the entire population is between 1.689411 and 1.693071 meters.
Calculate a new confidence interval for the same parameter at the 90% confidence level. Comment on the width of this interval versus the one obtained in the previous exercise.
# Insert code for Exercise 11 here
t.test(yrbss$height, mu = mean(yrbss$height, na.rm = TRUE), conf.level = 0.9)##
## One Sample t-test
##
## data: yrbss$height
## t = 0, df = 12578, p-value = 1
## alternative hypothesis: true mean is not equal to 1.691241
## 90 percent confidence interval:
## 1.689705 1.692777
## sample estimates:
## mean of x
## 1.691241
We are 90% confident that the average height for the entire population is between 1.689705 and 1.692777 meters. The width of the confidence interval gets larger if the confidence level increases.
Conduct a hypothesis test evaluating whether the average height is different for those who exercise at least three times a week and those who don’t.
# Insert code for Exercise 12 here
yrbss$physical_3plus <-ifelse(yrbss$physically_active_7d >= 3, "yes", "no")
t.test(height ~ physical_3plus, data = yrbss)##
## Welch Two Sample t-test
##
## data: height by physical_3plus
## t = -19.029, df = 7973.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.04150183 -0.03374994
## sample estimates:
## mean in group no mean in group yes
## 1.665587 1.703213
Null: The average height is not different for those who exercise at least three times a week Alternative: The average height is different for those who exercise at least three times a week We reject the null hypothesis as the p value is very small; p value < 0.05.
Now, a non-inference task: Determine the number of different options there are in the dataset for the hours_tv_per_school_day there are.
##
## <1 1 2 3 4 5+
## 2168 1750 2705 2139 1048 1595
## do not watch
## 1840
There are seven options.
Come up with a research question evaluating the relationship between height or weight and sleep. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval. Report the statistical results, and also provide an explanation in plain language. Be sure to check all assumptions, state your α level, and conclude in context.
# Insert code for Exercise 14 here
yrbss1<-yrbss %>%
mutate(school_night_hours_sleep1 = case_when(
school_night_hours_sleep == "<5" ~ "4",
school_night_hours_sleep == "5" ~ "5",
school_night_hours_sleep == "6" ~ "6",
school_night_hours_sleep == "7" ~ "7",
school_night_hours_sleep == "8" ~ "8",
school_night_hours_sleep == "9" ~ "9",
school_night_hours_sleep == "10+" ~ "10",
TRUE ~ "NA"))
yrbss1$school_night_hours_sleep1<-as.numeric(yrbss1$school_night_hours_sleep1)## Warning: NAs introduced by coercion
##
## Call:
## lm(formula = weight ~ school_night_hours_sleep1, data = yrbss1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.806 -12.032 -3.175 8.251 111.637
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.4964 0.7764 92.091 < 2e-16 ***
## school_night_hours_sleep1 -0.5357 0.1130 -4.741 2.16e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.95 on 11479 degrees of freedom
## (2102 observations deleted due to missingness)
## Multiple R-squared: 0.001954, Adjusted R-squared: 0.001867
## F-statistic: 22.47 on 1 and 11479 DF, p-value: 2.157e-06
##
## Call:
## lm(formula = height ~ school_night_hours_sleep1, data = yrbss1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.42185 -0.08920 -0.00831 0.08727 0.42169
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.6847801 0.0047919 351.590 <2e-16 ***
## school_night_hours_sleep1 0.0008833 0.0006975 1.266 0.205
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1046 on 11479 degrees of freedom
## (2102 observations deleted due to missingness)
## Multiple R-squared: 0.0001397, Adjusted R-squared: 5.259e-05
## F-statistic: 1.604 on 1 and 11479 DF, p-value: 0.2054
The data is random and large enough. The α level is 0.05.
My research question is does the amount of hours you sleep per night affect your weight?
HO: The student’s weight is not dependent on the amount of hours they sleep per night. HA: The student’s weight is dependent on the amount of hours they sleep per night.
Based on the p value of 2.157e-06, we would reject the null hypothesis as it is < 0.05 which is our α level. We can conclude that weight is dependent on the amount of hours the student sleeps per night. With a slope of -0.5357, this means that for every one additional hour of sleep a student gets each night, on average they weigh .54 kg less. The more sleep a student gets the less that they weigh.