library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.7
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(infer)
What are the cases in this data set? How many cases are there in our sample?
Since cases are columns in a data and rows are observations there are 13 cases and 13,583 observation in the yrbss dataset.
glimpse(yrbss)
## Rows: 13,583
## Columns: 13
## $ age <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15, 1~
## $ gender <chr> "female", "female", "female", "female", "fema~
## $ grade <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9", ~
## $ hispanic <chr> "not", "not", "hispanic", "not", "not", "not"~
## $ race <chr> "Black or African American", "Black or Africa~
## $ height <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88, 1~
## $ weight <dbl> NA, NA, 84.37, 55.79, 46.72, 67.13, 131.54, 7~
## $ helmet_12m <chr> "never", "never", "never", "never", "did not ~
## $ text_while_driving_30d <chr> "0", NA, "30", "0", "did not drive", "did not~
## $ physically_active_7d <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, 7, ~
## $ hours_tv_per_school_day <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5+",~
## $ strength_training_7d <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, 7, ~
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "<5"~
How many observations are we missing weights from?
Using the summary function from R we can see there are 1004s Nas which can be assumed that those are the missing weights so there are 1004 observations missing.
summary(yrbss$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 29.94 56.25 64.41 67.91 76.20 180.99 1004
Make a side-by-side boxplot of physical_3plus and weight. Is there a relationship between these two variables? What did you expect and why?
After filtering out the nas in the data I was surprised that the data was smiliar. There is a relationship between weight and physical activity usually the more one works out the less they would weigh usually. I expected those that worked more than 3 days to be weigh less than those who didn’t. Surprisingly those who did workout weigh a bit more as seen in their quartiles and median.
yrbss <- yrbss %>%
mutate(physical_3plus = ifelse(yrbss$physically_active_7d > 2, "yes", "no"))
yr_plot <- yrbss %>%
filter(!is.na(physical_3plus),!is.na(weight))
ggplot(yr_plot,aes(x=physical_3plus, y=weight)) +
geom_boxplot()
Are all conditions necessary for inference satisfied? Comment on each. You can compute the group sizes with the summarize command above by defining a new variable with the definition n().
The conditions necessary for inferences are if the sample is randomised and if the sample size and if the sample size is greater than 10 so the sampling distribution can be normal. For physical3_plus we can assume the sample is randomised since we randomly asked every student if they workout more than 3 days. For weight its the same thing except we ask each student about their weight. For sample size excluding the nas the sample size is greater than 30 so we can use the sample to create an inference for the mean.
yrbss %>%
group_by(physical_3plus) %>%
summarise(mean_weight = mean(weight, na.rm = TRUE))
## # A tibble: 3 x 2
## physical_3plus mean_weight
## <chr> <dbl>
## 1 no 66.7
## 2 yes 68.4
## 3 <NA> 69.9
yrbss %>%
group_by(physical_3plus) %>%
count(physical_3plus)
## # A tibble: 3 x 2
## # Groups: physical_3plus [3]
## physical_3plus n
## <chr> <int>
## 1 no 4404
## 2 yes 8906
## 3 <NA> 273
Write the hypotheses for testing if the average weights are different for those who exercise at least times a week and those who don’t.
The null hypothesis would that the average weights are not different between those who exercise atleast 3 times a week and those who dont The alternative hypothesis would be the average weights are different for those who dont’t exercise and those who do.
How many of these null permutations have a difference of at least obs_stat?
None of the null permutations have a difference of atleast obs_stat
## conducting the hypothesis tests
obs_diff <- yrbss %>%
specify(weight ~ physical_3plus) %>%
calculate(stat = "diff in means", order = c("yes", "no"))
## Warning: Removed 1219 rows containing missing values.
## The null distribution
null_dist <- yrbss %>%
specify(weight ~ physical_3plus) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("yes", "no"))
## Warning: Removed 1219 rows containing missing values.
## 1.744584 is the value of obs_diff
null_dist %>%
filter(stat > "1.774584")
## Response: weight (numeric)
## Explanatory: physical_3plus (factor)
## Null Hypothesis: independence
## # A tibble: 0 x 2
## # ... with 2 variables: replicate <int>, stat <dbl>
Construct and record a confidence interval for the difference between the weights of those who exercise at least three times a week and those who don’t, and interpret this interval in context of the data. I constructed a 95% confidence interval and We are 95% confident that the difference between those who exercise at least 4 times to those who don’t is between 1.151 and 2.398.
##avg value of 1
mean_1<-yrbss %>%
group_by(physical_3plus) %>%
summarise(avg = mean(weight, na.rm = TRUE))%>%
filter(physical_3plus=="yes") %>%
select(avg)
## value of s1
sigma_1<-yrbss %>%
group_by(physical_3plus) %>%
summarise(sd = sd(weight, na.rm = TRUE))%>%
filter(physical_3plus=="yes") %>%
select(sd)
##avg value of x2
mean_2<-yrbss %>%
group_by(physical_3plus) %>%
summarise(avg = mean(weight, na.rm = TRUE))%>%
filter(physical_3plus=="no") %>%
select(avg)
## value of s2
sigma_2<-yrbss %>%
group_by(physical_3plus) %>%
summarise(sd = sd(weight, na.rm = TRUE))%>%
filter(physical_3plus=="no") %>%
select(sd)
## Constructing a 95% confidence Interval we get:
n_1 <- 8906
n_2 <- 4404
SE <- ((sigma_1^2/n_1)+(sigma_2^2/n_2))^(1/2)
T_score<--qt(.025,df=4403)
## formula is point estimate +- margin of error where (x1-x2) + t_df * SE
top <- (mean_1-mean_2)+T_score*SE
bottom <-(mean_1-mean_2)-T_score*SE
sprintf("The confidence interval is between %.3f and %.3f",bottom,top)
## [1] "The confidence interval is between 1.151 and 2.398"
Calculate a 95% confidence interval for the average height in meters (height) and interpret it in context.
We are 95% confident that the average height is between 1.68 and 1.69 meters
heigh_t <- yrbss %>%
select(height) %>%
filter(!is.na(height)) %>%
summarise(avg=mean(height))
stddev <- yrbss %>%
select(height) %>%
filter(!is.na(height)) %>%
summarise(std_dev = sd(height))
size <- yrbss %>%
select(height) %>%
filter(!is.na(height)) %>%
count()
df <- size - 1
T_val <- qt(0.025,df=12578)
Se <- stddev/(size^(1/2))
## Constructing the confidence interval
topp <- heigh_t + T_val * Se
bott <- heigh_t - T_val * Se
sprintf("The confidence interval is between 1.689411 and 1.693071")
## [1] "The confidence interval is between 1.689411 and 1.693071"
Calculate a new confidence interval for the same parameter at the 90% confidence level. Comment on the width of this interval versus the one obtained in the previous exercise.
We are 90% confident that the average height in meters is between 1.689 and 1.692. The difference is really minor is it because of the areas of the tail?
heigh_t <- yrbss %>%
select(height) %>%
filter(!is.na(height)) %>%
summarise(avg=mean(height))
stddev <- yrbss %>%
select(height) %>%
filter(!is.na(height)) %>%
summarise(std_dev = sd(height))
size <- yrbss %>%
select(height) %>%
filter(!is.na(height)) %>%
count()
df <- size - 1
T_val <- qt(0.05,df=12578)
Se <- stddev/(size^(1/2))
toppp <- heigh_t + T_val * Se
bottt <- heigh_t - T_val * Se
print("The 90% confidence interval is between 1.689411 and 1.692777")
## [1] "The 90% confidence interval is between 1.689411 and 1.692777"
Our null hypothesis would be there is the difference would be zero between the height of those who exercise and those who dont. Our alternative hypothesis is there the difference would be not zero between the heights of those who exercise and those who dont.
After calculating everything we get a t score of 4.54 this means that there is a difference of heights between those who work-out and those who doesn’t. Thus we can reject the null hypothesis.
yrbss1 <- yrbss %>%
mutate(physical_3plus = ifelse(yrbss$physically_active_7d > 3, "yes", "no"))
yes_avg <- yrbss1 %>%
filter(!is.na(height)) %>%
filter(physical_3plus == "yes") %>%
summarise(averg = mean(weight))
no_avg <- yrbss1 %>%
filter(!is.na(height)) %>%
filter(physical_3plus == "no") %>%
summarise(averg = mean(weight))
yes_sd <- yrbss1 %>%
filter(!is.na(height)) %>%
filter(physical_3plus == "yes") %>%
summarise(sd = sd(weight))
no_sd <- yrbss1 %>%
filter(!is.na(height)) %>%
filter(physical_3plus == "no") %>%
summarise(sd = sd(weight))
## Calculate the T statistic: and SE
yes_size <- yrbss1 %>%
filter(!is.na(height)) %>%
filter(physical_3plus == "yes") %>%
count()
no_size <- yrbss1 %>%
filter(!is.na(height)) %>%
filter(physical_3plus == "no") %>%
count()
stderr <- ((yes_sd)^2/(yes_size) + (no_sd)^2/(no_size)) ^ (1/2)
t <- ((yes_avg-no_avg)-0)/(stderr)
print("The test statistics is: 4.548727")
## [1] "The test statistics is: 4.548727"
Now, a non-inference task: Determine the number of different options there are in the dataset for the hours_tv_per_school_day there are.
There are 7 different options in the students watch tv everyday
yrbss %>%
select(hours_tv_per_school_day) %>%
filter(!is.na(hours_tv_per_school_day)) %>%
group_by(hours_tv_per_school_day) %>%
count()
## # A tibble: 7 x 2
## # Groups: hours_tv_per_school_day [7]
## hours_tv_per_school_day n
## <chr> <int>
## 1 <1 2168
## 2 1 1750
## 3 2 2705
## 4 3 2139
## 5 4 1048
## 6 5+ 1595
## 7 do not watch 1840
Come up with a research question evaluating the relationship between height or weight and sleep. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval. Report the statistical results, and also provide an explanation in plain language. Be sure to check all assumptions, state your α level, and conclude in context.
Is there a relationship between heights and those who sleep more than 8 hours? Ho: There is no difference between heights and those who sleep more than 8 hours Ha: There is differences between heights and those who sleep for more than 8 hours
From creating the hypothesis and calculating the t-stat and p-value I will use signifance level of 0.05 and conclude that there is a difference in heights between people who sleep more than 8 hours and those who sleep less since we get a p-value of 0.0151 which is less than 0.05 and thus we can reject the null hypothesis.
rest <-yrbss %>%
mutate(well_rest = ifelse(school_night_hours_sleep >= 8,"yes","no"))
## calculate yes statistics
## avg_height
n1 <-rest %>%
select(well_rest) %>%
filter(well_rest=="no") %>%
count()
ye_avg <-rest %>%
select(height,well_rest) %>%
filter(well_rest=="yes",!is.na(height),!is.na(well_rest)) %>%
summarise(yee_av = mean(height))
ye_sd <-rest %>%
select(height,well_rest) %>%
filter(well_rest == "yes",!is.na(height),!is.na(well_rest)) %>%
summarise(sd = sd(height))
## calculate no statistics
n2 <-rest %>%
select(well_rest) %>%
filter(well_rest=="no") %>%
count()
nope_avg <-rest %>%
select(height,well_rest) %>%
filter(well_rest=="no",!is.na(height),!is.na(well_rest)) %>%
summarise(yee_av = mean(height))
nope_sd <- rest %>%
select(height,well_rest) %>%
filter(well_rest=="no",!is.na(height),!is.na(well_rest)) %>%
summarise(yee_av = sd(height))
## calculate standard error and the t statistics
## we need the complement of pt for the two tail distribution value
std_err <- ((ye_sd)^2/(n1) + (nope_sd)^2/(n2)) ^ (1/2)
tstat <- ((ye_avg-nope_avg)-0)/(std_err)
sprintf("The t statistics is: %.5f",tstat)
## [1] "The t statistics is: 2.16382"
sprintf("The p-value for the complement of the t statistics is: %.4f",1 -pt(2.16832,3454))
## [1] "The p-value for the complement of the t statistics is: 0.0151"