STA 214 Lab 3
Complete all Questions and submit your final PDF or html (either works!) under Assignments in Canvas.
The Goal
We have been working with linear mixed effects (LME) models. Today, we are going to put together everything we have learned to perform an analysis on a data set.
The Data
Today we will be working with data from a study on teen drinking behavior. The data may be found on Canvas.
Our data contain information on 82 teenagers. Each teenager was given the same survey 3 times: once at age 14, once at age 15, and once at age 16. The goal was to monitor \(Y=\) alcohol use in these teenagers and to explore what factors might be related to higher alcohol use in teens.
The data contain information on the following variables:
- id: numerical identifier for the teenager taking the survey.
- age: the age of the teenager when they took the survey: 14, 15, or 16
- coa: 1 if the teen is a child of an alcoholic parent; 0 otherwise.
- male: 1 if the teen identifies as male; 0 otherwise.
- peer: a measure of peer alcohol use, taken when each teen was 14. This is the square root of the sum of two 6-point items about the proportion of their friends who drink occasionally or regularly.
We also have information on the response variable, \(Y\) which is recorded in the
alcuse
column in the data set. This variable is a numeric
measure of alcohol use. Four items were used to obtain this measure:
- how often did the teen drink beer or wine ?
- how often did the teen drink hard liquor?
- how often did the teen drink 5 or more drinks in a row?
- how often did the teen get drunk?
Responses to each of the 4 questions above were scored on an 8-point scale, where 0=“not at all” and 8=“every day”. Our response variable \(Y\) is the square root of the sum of these four items.
Researcher 1
Researcher 1 comes to you and asks you to use linear regression to answer the following research question:
- What is the relationship between the alcohol use of their peers (X =
peer
) and the alcohol use of the teen (Y =alcuse
)?
They ask you to build a linear regression model using only these two variables, \(X\) and \(Y\).
Question 1
Build the model requested by Researcher 1 and write out the LSLR line using appropriate notation.
Question 2
Interpret the slope of the LSLR line in the context of the data to respond to the research question posed by Researcher 1.
Researcher 2
Researcher 2 is interested in the same research question as
Researcher 1, but they believe that it is important to control for (1)
whether the teen identifies as male, (2) whether or not a parent of
teenager is an alcoholic, and (3) the age of the teenager when looking
at the relationship between peer drinking and \(Y\) = the drinking habits of a teen
(alcuse
).
Researcher 2 knows that male
and coa
are
binary variables, but they are uncertain about how to treat age in the
model. Age is a discrete numeric variable, but should they include it in
the model as a numeric variable or a categorical variable?
Question 3
Treating age as a numeric variable, build a regression model to explore the claims of Researcher 2. Write down the fitted model using appropriate notation.
Question 4
Treating age as a categorical variable, build a regression model to explore the claims of Researcher 2. Write down the fitted model using appropriate notation.
Hint: To convert a variable in a model to categorical, use
as.factor(nameOfVariable)
, where nameOfVarible
is replaced with the variable you want to convert.
Question 5
Which of the two models (your model in Question 3 or Question 4) is a better fit to the data? Justify your choice. The model that you choose will be model we will proceed with for Researcher 2’s research question.
Now that Researcher 2 has a model and Researcher 1 has a model, they ask you which of the two models you would recommend that they use to answer their research question.
Question 6
Would you recommend using the model suggested by Researcher 1 (in Question 1) or the model suggested by Researcher 2 (the model in your Question 5)? Clearly explain your reasoning.
Question 7
Based on the model that you have chosen, build and interpret a 95% confidence interval for the relationship between peer drinking and teen drinking habits.
Researcher 3
A third researcher enters this discussion, and they believe both Researcher 1 and Researcher 2 have failed to notice something important - this is a repeated measures study.
Question 8
What is a repeated measures study, and why is it important to notice that this study is a repeated measures study?
Question 9
Adapt the model you chose in Question 6 to reflect the fact that this is a repeated measures study. Write down the model. Be very clear about what all notation that you use in the model represents.
Question 10
Fit the model you suggested in Question 9. Interpret both standard deviation estimates for your model.
Hint: Remember you need to run library(lme4)
before
building an LME model in R. Remember also that you use
summary(model)
to see the output after you build a model in
R.
Question 11
After accounting for age, peer alcohol use, self identifying as male, and age, what percent of the variability in teen alcohol use in these data is related to traits of each teenager?
We now have three Researchers and three different models!
Question 12
The three Researchers ask you to clearly explain which of the three models you would recommend that they use to answer the research question of interest. They ask you to both explain your reasoning in words and provide a numeric measure of some kind (a number) to support your choice of model.
Inference with the model of Researcher 3
Researcher 3 has further questions about their model. They know that they will need to report a confidence interval rather than just a sample statistic as a result of their study. Their data is only a sample, and they want to be able to draw conclusions about teenage drinking as a whole. The model we used for Researcher 3 is an LME model. To build confidence intervals with an LME model, we often need to use a technique called bootstrapping to construct confidence intervals. Let’s explore this technique to wrap up our lab.
Confidence Intervals: The Big Idea
When we use confidence intervals in LME models, it is generally because we have information about relationships in a sample (\(\hat{\beta}_1\)) and we want to be able to make conclusions about relationships in a population (\(\beta_1\)).
The value of a population parameter like \(\beta_1\) is likely different from the value of the sample statistic \(\hat{\beta}_1\), because a sample statistic is built using only the information from a sample, while a population parameter is built using all the data in the population.
This means that we cannot just report \(\beta_1 = \hat{\beta}_1\), because there is a good chance that this is not true! Instead, we want to report a confidence interval, which is a range of plausible values for the population parameter.
Ideally, we build a confidence interval by taking a lot of samples from the population, building our model on each sample, and looking at the sample statistics. If we take 100 samples from the population, and all the \(\hat{\beta}_1\) values are between 1 and 2, that gives us a pretty good idea that \(\beta_1\) is also between 1 and 2!
However, we can’t typically reach into the population and get new samples. One solution to this problem is what is called bootstrapping.
Question 13
Remind me - what is a bootstrap sample?
To create a bootstrap sample of a data set, we have to sample
with replacement from the rows in the data set. For
example, suppose I create a data set using the 1st 5 rows in the
alcohol
data set:
<- alcohol[1:5,] dataSmall
This means the data set contains rows 1 - 5 from the original
alcohol
data set. A bootstrap sample of this small data set
will have the same number of rows that we started with (5), but the rows
in the data sets do not have to be the same. For example:
Original Data
- Rows 1, 2, 3, 4, 5
Bootstrap Sample
- Rows 2, 2, 4, 1, 4
To create a bootstrap sample using these rows, we run the code:
<- dataSmall[ c(2, 2, 4, 1, 4), ]
bootSample bootSample
If you run the code, you will notice a new data set called
bootSample
has appeared in your Environment. It has the
same dimensions as dataSmall
, but it has different
rows.
Now that we know what a bootstrap sample looks like, let’s create a bootstrap sample of the entire data set. To do that, we use the following code. It is not important that you know how this code works, but please ask if you are curious!
# Makes sure everyone gets the same bootstrap sample
set.seed(271)
# Choose which rows are in the bootstrap sample
<-sample(1:246, 246, replace = TRUE)
rowsChosen
# Select those rows to create the bootstrap sample
<- alcohol[ rowsChosen, ] bootSample1
Once we have built our bootstrap sample, the next step is to build our LME model using this data and see what estimate we get for our coefficient of interest (in this case the relationship between peer drinking and teen drinking.)
Question 14
Build Researcher 3’s model using the bootstrap sample
(bootSample1
) created by running the code chunk above. Is
the value for \(\hat{\beta}_1\) (the
coefficient for peer drinking) larger or smaller than what you got on
the original data set?
Okay, so now we see how the sample statistic could change with just one sample. However, we need a lot of samples to build a confidence interval! To create \(B = 100\) bootstrap estimates of our coefficient of interest, we can use the following code:
# Compute n, the number of rows
<- nrow(alcohol)
n
# Set the Seed
set.seed(214)
= 100
B
<- data.frame("beta1" = rep(0,B))
betas
for(i in 1:B){
# Tell R to sample those rows with replacement
<- alcohol[ sample(1:n, n, replace = TRUE), ]
bootSample
# Build Researcher 3's model
<- lmer( alcuse ~ peer + coa + male + age + (1|id), data = bootSample)
model_boot
# Store the output
$beta1[i] <- summary(model_boot)$coefficients[2,1]
betas
}
summary(betas)
Question 15
The middle 50% of the estimates \(\hat{\beta}_1\) (peer drinking) are in what range? In other words, what is a 50% bootstrap confidence interval for \(\beta_1\)?
If you want a 95% confidence interval, we want the middle 95% of the \(\hat{\beta}_1\) estimates. To obtain this, we can use the following code:
# Find the lower bound
quantile(betas$beta1, .025)
# Find the upper bound
quantile(betas$beta1, .975 )
Question 16
Interpret the 95% bootstrap confidence interval for \(\beta_1\).
Turning in your assignment
When your Markdown document is complete, do a final knit and make sure everything compiles. You will then submit your document on Canvas. Make sure you look through the final document to make sure everything has knit correctly. Only html or PDF files will be accepted.
References
The data set used in this lab is part of the data provided as accompanying data sets for the online textbook Broadening Your Statistical Horizons. The data were accessed through the book GitHub repository.
This
work was created by Nicole Dalzell and is licensed under a
Creative
Commons Attribution-NonCommercial 4.0 International License. Last
updated 2023 September 18.