Inference for numerical data Getting Started Load packages In this lab, we will explore and visualize the data using the tidyverse suite of packages, and perform statistical inference using infer. The data can be found in the companion package for OpenIntro resources, openintro.

Let’s load the packages.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(infer)
library(statsr)
## Loading required package: BayesFactor
## Loading required package: coda
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## ************
## Welcome to BayesFactor 0.9.12-4.4. If you have questions, please contact Richard Morey (richarddmorey@gmail.com).
## 
## Type BFManual() to open the manual.
## ************
## 
## Attaching package: 'statsr'
## 
## The following object is masked from 'package:infer':
## 
##     rep_sample_n
## 
## The following objects are masked from 'package:openintro':
## 
##     calc_streak, evals, nycflights, present
library(psych)
## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(cherryblossom)
library(usdata)
library(BayesFactor)
library(coda)
library(Matrix)
library(ggplot2)

The data Every two years, the Centers for Disease Control and Prevention conduct the Youth Risk Behavior Surveillance System (YRBSS) survey, where it takes data from high schoolers (9th through 12th grade), to analyze health patterns. You will work with a selected group of variables from a random sample of observations during one of the years the YRBSS was conducted.

Load the yrbss data set into your workspace.

data('yrbss', package='openintro')

There are observations on 13 different variables, some categorical and some numerical. The meaning of each variable can be found by bringing up the help file:

?yrbss
## starting httpd help server ... done
  1. What are the cases in this data set? How many cases are there in our sample?
ls(yrbss)
##  [1] "age"                      "gender"                  
##  [3] "grade"                    "height"                  
##  [5] "helmet_12m"               "hispanic"                
##  [7] "hours_tv_per_school_day"  "physically_active_7d"    
##  [9] "race"                     "school_night_hours_sleep"
## [11] "strength_training_7d"     "text_while_driving_30d"  
## [13] "weight"

The cases in this set are listed via ls(yrbss). There are 13,583 cases in this dataset

Remember that you can answer this question by viewing the data in the data viewer or by using the following command:

glimpse(yrbss)
## Rows: 13,583
## Columns: 13
## $ age                      <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15, 1…
## $ gender                   <chr> "female", "female", "female", "female", "fema…
## $ grade                    <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9", …
## $ hispanic                 <chr> "not", "not", "hispanic", "not", "not", "not"…
## $ race                     <chr> "Black or African American", "Black or Africa…
## $ height                   <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88, 1…
## $ weight                   <dbl> NA, NA, 84.37, 55.79, 46.72, 67.13, 131.54, 7…
## $ helmet_12m               <chr> "never", "never", "never", "never", "did not …
## $ text_while_driving_30d   <chr> "0", NA, "30", "0", "did not drive", "did not…
## $ physically_active_7d     <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, 7, …
## $ hours_tv_per_school_day  <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5+",…
## $ strength_training_7d     <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, 7, …
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "<5"…

There at 13,583 rows which is also the number of cases in this sample. There are 13 cases in this data set

Exploratory data analysis You will first start with analyzing the weight of the participants in kilograms: weight.

Using visualization and summary statistics, describe the distribution of weights. The summary function can be useful.

summary(yrbss$weight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   29.94   56.25   64.41   67.91   76.20  180.99    1004

2.How many observations are we missing weights from?

sum(is.na(yrbss))
## [1] 9476

There are 9,476 missing values.

Under the observations of weights we have 1004 missing values.

sum(is.na(yrbss$weight))
## [1] 1004
summary(yrbss[,7])
##      weight      
##  Min.   : 29.94  
##  1st Qu.: 56.25  
##  Median : 64.41  
##  Mean   : 67.91  
##  3rd Qu.: 76.20  
##  Max.   :180.99  
##  NA's   :1004

Next, consider the possible relationship between a high schooler’s weight and their physical activity. Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions.

First, let’s create a new variable physical_3plus, which will be coded as either “yes” if they are physically active for at least 3 days a week, and “no” if not.

yrbss <- yrbss %>% 
  mutate(physical_3plus = ifelse(yrbss$physically_active_7d > 2, "yes", "no"))

3.Make a side-by-side boxplot of physical_3plus and weight. Is there a relationship between these two variables? What did you expect and why?

yrbss2 <- yrbss %>% 
  mutate(physical_3plus = ifelse(yrbss$physically_active_7d > 2, "yes", "no")) %>%
  na.exclude()
ggplot(yrbss2, aes(x=weight, y=physical_3plus)) + geom_boxplot() + theme_bw()

The box plots show how the medians of the two distributions compare, but we can also compare the means of the distributions using the following to first group the data by the physical_3plus variable, and then calculate the mean weight in these groups using the mean function while ignoring missing values by setting the na.rm argument to TRUE.

yrbss %>%
  group_by(physical_3plus) %>%
  summarise(mean_weight = mean(weight, na.rm = TRUE))
## # A tibble: 3 × 2
##   physical_3plus mean_weight
##   <chr>                <dbl>
## 1 no                    66.7
## 2 yes                   68.4
## 3 <NA>                  69.9
yrbss <- yrbss %>% 
  mutate(physical_3plus = ifelse(yrbss$physically_active_7d > 2, "yes", "no"))

yr_plot <- yrbss %>%
  filter(!is.na(physical_3plus),!is.na(weight))

ggplot(yr_plot,aes(x=physical_3plus, y=weight)) +
  geom_boxplot()

4.Are all conditions necessary for inference satisfied? Comment on each. You can compute the group sizes with the summarize command above by defining a new variable with the definition n().

yrbss %>%
  group_by(physical_3plus) %>%
  summarise(mean_weight = mean(weight, na.rm = TRUE))
## # A tibble: 3 × 2
##   physical_3plus mean_weight
##   <chr>                <dbl>
## 1 no                    66.7
## 2 yes                   68.4
## 3 <NA>                  69.9
yrbss %>%
  group_by(physical_3plus) %>%
  count(physical_3plus)
## # A tibble: 3 × 2
## # Groups:   physical_3plus [3]
##   physical_3plus     n
##   <chr>          <int>
## 1 no              4404
## 2 yes             8906
## 3 <NA>             273

5.Write the hypotheses for testing if the average weights are different for those who exercise at least times a week and those who don’t.

The null hypothesis is that there is no difference in average weights between people who exercise at least three times per week and people who don’t. The other theory is that people who exercise regularly and those who don’t have differing average weights.

6.How many of these null permutations have a difference of at least obs_stat?

The typical procedure for doing hypothesis tests looks like this.

7.Construct and record a confidence interval for the difference between the weights of those who exercise at least three times a week and those who don’t, and interpret this interval in context of the data.

yrbss %>% 
  group_by(physical_3plus) %>% 
  summarise(sd_weight = sd(weight, na.rm = TRUE))
## # A tibble: 3 × 2
##   physical_3plus sd_weight
##   <chr>              <dbl>
## 1 no                  17.6
## 2 yes                 16.5
## 3 <NA>                17.6

The sd is 17.638 for those who do are not physically active at least 3 days per week and 16.478 for those who are.

#means of the weights
yrbss %>% 
  group_by(physical_3plus) %>% 
  summarise(mean_weight = mean(weight, na.rm = TRUE))
## # A tibble: 3 × 2
##   physical_3plus mean_weight
##   <chr>                <dbl>
## 1 no                    66.7
## 2 yes                   68.4
## 3 <NA>                  69.9
#sample size for each category
yrbss %>% 
  group_by(physical_3plus) %>% 
  summarise(freq = table(weight)) %>%
  summarise(n = sum(freq))
## `summarise()` has grouped output by 'physical_3plus'. You can override using
## the `.groups` argument.
## # A tibble: 3 × 2
##   physical_3plus     n
##   <chr>          <int>
## 1 no              4022
## 2 yes             8342
## 3 <NA>             215
xnot3 <- 66.67389
nnot3 <- 4022
snot3 <- 17.63805
x3 <- 68.44847
n3 <- 8342
s3 <- 16.47832

z = 1.96

uci_not <- xnot3 + z*(snot3/sqrt(nnot3))
lci_not <- xnot3 - z*(snot3/sqrt(nnot3))
uci_not
## [1] 67.219
lci_not
## [1] 66.12878
u_ci <- x3 + z*(s3/sqrt(n3))
l_ci <- x3 - z*(s3/sqrt(n3))

u_ci
## [1] 68.80209
l_ci
## [1] 68.09485

With 95% confident that students who exercise at least three times a week have an average weight between 68.09 kg and 68.8 kg. Also those students who do not exercise at least three times a week have an average weight between 66.13 kg and 67.22 kg with 95% confident.

More Practice

8.Calculate a 95% confidence interval for the average height in meters (height) and interpret it in context.

tb <- as.data.frame(table(yrbss$height))
freq <- sum(tb$Freq)

x_h <- mean(yrbss$height, na.rm = TRUE)
sd_h <- sd(yrbss$height, na.rm = TRUE)
n_h <- yrbss %>% 
  summarise(freq = table(height)) %>%
  summarise(n = sum(freq, na.rm = TRUE))

u_h <- x_h + z*(sd_h/sqrt(n_h))
l_h <- x_h - z*(sd_h/sqrt(n_h))
u_h
##          n
## 1 1.693071
l_h
##          n
## 1 1.689411

With 95% confident that the average height of the students in this population is between 1.689m and 1.693m.

9.Calculate a new confidence interval for the same parameter at the 90% confidence level. Comment on the width of this interval versus the one obtained in the previous exercise.

t1 <- 1.645
uh1 <- x_h + t1*(sd_h/sqrt(n_h))
lh1 <- x_h - t1*(sd_h/sqrt(n_h))
uh1
##          n
## 1 1.692777
lh1
##          n
## 1 1.689705
r_95 <- (u_h - l_h)
r_90 <- (uh1 - lh1)
r_95
##             n
## 1 0.003659302
r_90
##           n
## 1 0.0030712

Also the 95% confidence interval has a slightly larger range than the confidence interval 90%.

10.Conduct a hypothesis test evaluating whether the average height is different for those who exercise at least three times a week and those who don’t.

yrbss1 <- yrbss %>% 
  mutate(physical_3plus = ifelse(yrbss$physically_active_7d > 3, "yes", "no"))

yes_avg <- yrbss1 %>%
  filter(!is.na(height)) %>%
  filter(physical_3plus == "yes") %>%
  summarise(averg = mean(weight))

no_avg <- yrbss1 %>%
  filter(!is.na(height)) %>%
  filter(physical_3plus == "no") %>%
  summarise(averg = mean(weight))

yes_sd <- yrbss1 %>%
  filter(!is.na(height)) %>%
  filter(physical_3plus == "yes") %>%
  summarise(sd = sd(weight))

no_sd <- yrbss1 %>%
  filter(!is.na(height)) %>%
  filter(physical_3plus == "no") %>%
  summarise(sd = sd(weight))
## Calculate the T statistic: and SE

yes_size <- yrbss1 %>%
  filter(!is.na(height)) %>%
  filter(physical_3plus == "yes") %>%
  count()

no_size <- yrbss1 %>%
  filter(!is.na(height)) %>%
  filter(physical_3plus == "no") %>%
  count()

stderr <- ((yes_sd)^2/(yes_size) + (no_sd)^2/(no_size)) ^ (1/2)

t <- ((yes_avg-no_avg)-0)/(stderr) 

print("The test statistics is: 4.548727")
## [1] "The test statistics is: 4.548727"

11.Now, a non-inference task: Determine the number of different options there are in the dataset for the hours_tv_per_school_day there are.

yrbss %>%group_by(hours_tv_per_school_day)%>% summarise(n())
## # A tibble: 8 × 2
##   hours_tv_per_school_day `n()`
##   <chr>                   <int>
## 1 <1                       2168
## 2 1                        1750
## 3 2                        2705
## 4 3                        2139
## 5 4                        1048
## 6 5+                       1595
## 7 do not watch             1840
## 8 <NA>                      338

There are 7 different options, plus NA

12.Come up with a research question evaluating the relationship between height or weight and sleep. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval. Report the statistical results, and also provide an explanation in plain language. Be sure to check all assumptions, state your α level, and conclude in context.

Q: Is there evidence that students who are shorter than the mean height sleep less than students who are taller than the mean height?

HO: There is no relationship between the mean height and sleep bewteen students.

95% confident level

Conditions:

-Independent sample-yes - Normality - yes

results - The P-value of > 0.05 the null hypothesis must be rejected but it could lead to Type 1 error.

yrbss <- yrbss %>%
  mutate(sleep_less = ifelse(yrbss$school_night_hours_sleep < 6, "yes", "no"))
 
height_less <- yrbss %>% 
  select(height, sleep_less) %>% 
  filter(sleep_less == "no") %>%
  na.omit()

height_more <- yrbss %>% 
  select(height, sleep_less) %>% 
  filter(sleep_less == "yes") %>%
  na.omit()
boxplot(height_less$height, height_more$height,
        names = c("less_sleep", "more_sleep"))

mn <- mean(height_less$height)
sd <- sd(height_less$height)
max <- max(height_less$height)

mn1 <- mean(height_more$height)
sd2 <- sd(height_more$height)
max2 <- max(height_more$height)
meandiff <- mn1 - mn
sd <- 
  sqrt(
  ((mn1^2) / nrow(height_more)) +
  ((mn^2) / nrow(height_less))
  )
df <- 2492-1
t <- qt(.05/2, df, lower.tail = FALSE)

u<- meandiff + t * sd
l<- meandiff - t * sd
u
## [1] 0.06780798
l
## [1] -0.08195098
pv <- 2*pt(t,df, lower.tail = FALSE)
pv
## [1] 0.05

he null hypothes can be rejected because p-value is 0.05