library(tidyverse)
library(openintro)
library(infer)
library(ggplot2)
Every two years, the Centers for Disease Control and Prevention conduct the Youth Risk Behavior Surveillance System (YRBSS) survey, where it takes data from high schoolers (9th through 12th grade), to analyze health patterns. You will work with a selected group of variables from a random sample of observations during one of the years the YRBSS was conducted.
Load the yrbss
data set into your workspace.
data('yrbss', package='openintro')
There are observations on 13 different variables, some categorical and some numerical. The meaning of each variable can be found by bringing up the help file:
?yrbss
yrbss_original <- yrbss
glimpse(yrbss)
## Rows: 13,583
## Columns: 13
## $ age <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15, 1…
## $ gender <chr> "female", "female", "female", "female", "fema…
## $ grade <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9", …
## $ hispanic <chr> "not", "not", "hispanic", "not", "not", "not"…
## $ race <chr> "Black or African American", "Black or Africa…
## $ height <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88, 1…
## $ weight <dbl> NA, NA, 84.37, 55.79, 46.72, 67.13, 131.54, 7…
## $ helmet_12m <chr> "never", "never", "never", "never", "did not …
## $ text_while_driving_30d <chr> "0", NA, "30", "0", "did not drive", "did not…
## $ physically_active_7d <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, 7, …
## $ hours_tv_per_school_day <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5+",…
## $ strength_training_7d <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, 7, …
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "<5"…
You will first start with analyzing the weight of the participants in
kilograms: weight
.
Using visualization and summary statistics, describe the distribution
of weights. The summary
function can be useful.
summary(yrbss$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 29.94 56.25 64.41 67.91 76.20 180.99 1004
Next, consider the possible relationship between a high schooler’s weight and their physical activity. Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions.
First, let’s create a new variable physical_3plus
, which
will be coded as either “yes” if they are physically active for at least
3 days a week, and “no” if not.
yrbss <- yrbss %>%
mutate(physical_3plus = ifelse(yrbss$physically_active_7d > 2, "yes", "no"))
physical_3plus
and
weight
. Is there a relationship between these two
variables? What did you expect and why? #Answer 3 : They look very
similar. I was expecting people who were physical 3 times a week to
weigh less on average but the medians look similar as well as the
IQR.yrbss$physical_3plus<-as.factor(yrbss$physical_3plus)
yrb_plot<-yrbss%>%
filter(!is.na(weight),!is.na(physical_3plus))
p<-ggplot(yrb_plot, aes(x=physical_3plus,y=weight))+
geom_boxplot()
p
The box plots show how the medians of the two distributions compare,
but we can also compare the means of the distributions using the
following to first group the data by the physical_3plus
variable, and then calculate the mean weight
in these
groups using the mean
function while ignoring missing
values by setting the na.rm
argument to
TRUE
.
yrbss %>%
group_by(physical_3plus) %>%
summarise(mean_weight = mean(weight, na.rm = TRUE))
## # A tibble: 3 × 2
## physical_3plus mean_weight
## <fct> <dbl>
## 1 no 66.7
## 2 yes 68.4
## 3 <NA> 69.9
There is an observed difference, but is this difference statistically significant? In order to answer this question we will conduct a hypothesis test.
summarize
command above by defining a new variable with the definition
n()
.#Answer 4 #CONDITIONS FOR INFERENCE: #Independence: the two groups are not linked, necessarily one who excersizes 3 times a week plus cannot be in the group that doesn’t. #Size: the groups are less than 10% of the total population of kids who excersize and those who don’t in the study…
yrbss<- yrbss %>%
mutate(n = ifelse(yrbss$physically_active_7d > 2, "yes", "no"))
yrbss$n<-as.factor(yrbss$n)
view(yrbss)
yrbss%>%
count(n)
## Storing counts in `nn`, as `n` already present in input
## ℹ Use `name = "new_name"` to pick a new name.
## # A tibble: 3 × 2
## n nn
## <fct> <int>
## 1 no 4404
## 2 yes 8906
## 3 <NA> 273
#Answer 5 # H0 - there is no difference in weight between those who exercise 3 times a week and those who do not exercise at all ( i.e. the means are the same ) # H1 - there is a difference in weight between those who exercise 3 times a week and those who do not exercise at all ( i.e. the means are different )
Next, we will introduce a new function, hypothesize
,
that falls into the infer
workflow. You will use this
method for conducting hypothesis tests.
But first, we need to initialize the test, which we will save as
obs_diff
.
yrbss <- yrbss %>% na.omit # remove rows with NA
obs_diff <- yrbss %>%
specify(weight ~ physical_3plus) %>%
calculate(stat = "diff in means", order = c("yes", "no"))
Notice how you can use the functions specify
and
calculate
again like you did for calculating confidence
intervals. Here, though, the statistic you are searching for is the
difference in means, with the order being
yes - no != 0
.
After you have initialized the test, you need to simulate the test on
the null distribution, which we will save as null
.
null_dist <- yrbss %>%
specify(weight ~ physical_3plus) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("yes", "no"))
Here, hypothesize
is used to set the null hypothesis as
a test for independence. In one sample cases, the null
argument can be set to “point” to test a hypothesis relative to a point
estimate.
Also, note that the type
argument within
generate
is set to permute
, whichis the
argument when generating a null distribution for a hypothesis test.
We can visualize this null distribution with the following code:
ggplot(data = null_dist, aes(x = stat)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
null
permutations have a difference
of at least obs_stat
? #Answer 6 none of the values are
greater than the obs_diff_valobs_diff_val<-obs_diff$stat[1]
null_list<-as.list(null_dist$stat)
null_abs<-lapply(null_list, FUN=function(x){abs(x)})
null_dist%>%
summarise(mean= mean(stat, na.rm=TRUE))
## # A tibble: 1 × 1
## mean
## <dbl>
## 1 0.00110
null_dist%>%
filter(stat>obs_diff_val)
## Response: weight (numeric)
## Explanatory: physical_3plus (factor)
## Null Hypothesis: independence
## # A tibble: 1 × 2
## replicate stat
## <int> <dbl>
## 1 314 1.56
Now that the test is initialized and the null distribution formed,
you can calculate the p-value for your hypothesis test using the
function get_p_value
.
null_dist %>%
get_p_value(obs_stat = obs_diff, direction = "two_sided")
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0.002
This the standard workflow for performing hypothesis tests.
#Answer #7 since H_0: there is NO difference bewteen the two groups falls OUTSIDE our 95% confidence interval, we can REJECT the null hypothesis.
# number of groups
n_1<-8406
n_2<-4408
x_bar_diff<-1.78
T_score<-pt(.025,4407,lower.tail = FALSE)*2
#get sigmas of samples
sigma_1<-yrbss %>%
group_by(physical_3plus) %>%
summarise(sd = sd(weight, na.rm = TRUE))%>%
filter(physical_3plus=="yes")%>%
select(sd)%>%
as.double()
sigma_2<-yrbss %>%
group_by(physical_3plus) %>%
summarise(sd = sd(weight, na.rm = TRUE))%>%
filter(physical_3plus=="no")%>%
select(sd)%>%
as.double()
SE<-sqrt((sigma_1^2 /n_1)+(sigma_2^2/n_2))
bot<-x_bar_diff-T_score*SE
top<-x_bar_diff+T_score*SE
cat("the 95% confidence interval for comparing the differnece between the means of these two independent samples is ",bot,"to",top)
## the 95% confidence interval for comparing the differnece between the means of these two independent samples is 1.461695 to 2.098305
height
) and interpret it in context.height<-yrbss%>%
filter(!is.na(height))%>%
select(height)
n<-nrow(height)
df<-n-1
height<-height$height
x_bar<-mean(height)
sigma<-sd(height)
SE<-sigma/sqrt(n)
t_star<-qt(.025,df=df)
#confidence interval
bot<-x_bar-abs(t_star*SE)
top<-x_bar+abs(t_star*SE)
cat("the 95% confidence interval is ",bot," to ",top, "we are 95% confident that the sample mean falls within the confidence interval")
## the 95% confidence interval is 1.69481 to 1.699298 we are 95% confident that the sample mean falls within the confidence interval
t_star<-abs(qt(.05,df=df))
# the rest is the same
x_bar<-mean(height)
sigma<-sd(height)
SE<-sigma/sqrt(n)
bot_1<-x_bar-abs(t_star*SE)
top_1<-x_bar+abs(t_star*SE)
cat("the 90% confidence interval is ",bot_1," to ",top_1)
## the 90% confidence interval is 1.695171 to 1.698937
obs_diff <- yrbss %>%
specify(height ~ physical_3plus) %>%
calculate(stat = "diff in means", order = c("yes", "no"))
obs_diff <- yrbss %>%
specify(height ~ physical_3plus) %>%
calculate(stat = "diff in means", order = c("yes", "no"))
null<-0
#sd for physically active
sigma_1<-yrbss %>%
group_by(physical_3plus) %>%
summarise(sd = sd(height, na.rm = TRUE))%>%
filter(physical_3plus=="yes")%>%
select(sd)%>%
as.double()
#sd for not physically active
sigma_2<-yrbss %>%
group_by(physical_3plus) %>%
summarise(sd = sd(height, na.rm = TRUE))%>%
filter(physical_3plus=="no")%>%
select(sd)%>%
as.double()
#standard error
SE<-sqrt((sigma_1^2 /n_1)+(sigma_2^2/n_2))
T_score<-abs(qt(.025,4407))
bot<-x_bar_diff-T_score*SE
top<-x_bar_diff+T_score*SE
cat("the 95% confidence interval for the difference in mean height between physically active and non physically active is ",bot,"to",top)
## the 95% confidence interval for the difference in mean height between physically active and non physically active is 1.776257 to 1.783743
hours_tv_per_school_day
there are. #Answer 7 There are 7 options for this surveyyrbss%>%
filter(!is.na(hours_tv_per_school_day))%>%
select(hours_tv_per_school_day)%>%
unique()
## # A tibble: 7 × 1
## hours_tv_per_school_day
## <chr>
## 1 5+
## 2 2
## 3 3
## 4 do not watch
## 5 <1
## 6 4
## 7 1