STA 214 Lab 2
Complete all Questions and submit your final PDF or html (either works!) under Assignments in Canvas.
The Goal
In our last lab, we reviewed linear regression in R. Today, we are going to extend this into linear mixed effects models.
The Data
Today we will be working with a data set from airbnb, a company that focuses on renting houses and rooms for vacations. You can load the data using the following link.
<- read.csv("https://raw.githubusercontent.com/proback/BYSH/master/data/airbnb.csv") bnb
The data contain prices and other information for \(n =1561\) rentals in Chicago, Illinois. We have a client who is interested in the relationship between X = the number of bedrooms in the rental and Y = price in US dollars.
The variables we will be using are as follows:
- price: The price in US dollars of a one night rental.
- bedrooms: the number of bedrooms in the unit.
- district: the district in Chicago in which the rental is located.
EDA
We are interested in the relationship between \(X\) = the number of bedrooms and \(Y =\) the price of the rental. We know that before we begin any analysis we take a look at the data through EDA.
Question 1
Create a scatter plot to visualize the relationship between the number of bedrooms and price. Make sure to label your axes and title your plot Figure 1.
Hint: Look at Lab 1 or the R Help Site if you need the code for this!
In your scatter plot, you will notice that you seem to have a series of vertical lines of dots. This is because the number of bedrooms is a discrete numeric variable, meaning it can only take on whole number values like 1, 2, etc. When you have such a variable, you get vertical lines in your scatter plot because there are only very specific values of \(X\) that are possible.
Question 2
Based on Figure 1, does it looks like there is a relationship between the number of bedrooms and the price? Describe the relationship if so (strength, shape, direction).
Considering a Linear Regression Model
With a numeric \(X\) and a numeric \(Y\), our first thought for modeling the relationship between the number of bedrooms and the price is probably a least squares linear regression (LSLR) model.
However, before we try this, there is one more variable we need to consider: district. If there are multiple rentals per district, then we might need to consider incorporating district into the model.
Question 3
Create a table to check to how many rentals there are in each district. Which district has the most rentals? The least?
Hint: The following code will be helpful.
::kable( table( bnb$district ), col.names = c("District", "Number of Rentals")) knitr
As we can see in the table, there are multiple properties per district. This means that district is a potential grouping variable.
What we have to decide now is if we think the prices of the rentals differ by district. If they do, we need to incorporate district into our model, as district will be a source of variability in price. If it looks like price does not vary across districts, then we may not need to include district in the model.
To help us decide, we need to create a visualization that can look at the relationship between a categorical variable (district) and a numeric one (price). A good plot for such a purpose is a side-by-side boxplot.
ggplot( bnb, aes( x= district, y = price)) +
geom_boxplot( )
Question 4
Adapt the code from above this question to make the boxes in the box plots magenta and add appropriate labels and titles. Show the plot.
Question 5
With this plot, it is actually a little easier to compare the prices
if we flip the x and y axis. To do this, add +coord_flip()
to the end of the code from Question 4. Show the plot.
Question 6
Based on what you see in the side by side box plots, does it look like the average price of a rental varies from district to district?
Finding the average price per district
Another way that we can check for the group effect is by finding the average price per district. This means we need to (1) gather together all the rentals in each district and (2) find the mean rental price for each district.
One really helpful package for doing such multi-stage tasks in R is
dplyr
.
suppressMessages(library(dplyr))
This package allows us to use a tool called a pipe
(|>
) which allows us to do multi-stage tasks. Take a
look at the code below.
|>
bnb group_by(district) |>
summarise( mean(price))
Hmm. This looks confusing. To make this easier to read, I am going to
annotate the code. This means I will put in lines of
text (not code) that will explain what each line of code is doing. To
annotate in R, you just put a #
in front of anything you
want R to treat as text rather than code.
# Start with this data set AND THEN
|>
bnb # Gather together all rentals in a district AND THEN
group_by(district) |>
# Find the average price for each district
summarise( mean(price))
Based on the annotated code above, we can see that the pipe
|>
can be read as “and then”. So, this code means start
with the bnb data set (bnb
) and then (|>
)
collect all the rentals in the same district together
(group_by(district)
) and then (|<
) take the
mean of the price in each district.
Question 7
Using the code above as a template, find the average rentals prices
by type of room (variable name room_type
). Show your code
output as your answer to this question.
LME Model
Based on what we have seen so far, it looks like price of a rentals may vary with the number of bedrooms. It also looks like price may vary with district. Because there are multiple rentals per district, it makes sense to consider a linear mixed effects (LME) model for modeling the price of a rental.
Question 8
Identify (1) the treatment effect, (2) the group effect, and (3) the individual effect in our LME model.
Question 9
Write down the linear mixed effects model for looking at the relationship between number of bedroom and price, after controlling for district. You do not need to include the distributions for \(\epsilon_{ij}\) and \(u_j\) right now.
We want the population model right now, so no hats and no numbers! Remember that to write math notation, you need something like this in the white space of your Markdown file:
$$Price_{ij} = \beta_0 + ...$$
Fitting an LME model in R
Once we have chosen a model, we need to fit that model, which means estimating all the values of the parameters the model needs in order to be used.
In class, we learned that code in the lme4
package can
help us fit linear mixed effects (LME) regression models in R.
suppressMessages(library(lme4))
We then use a code structure that is a lot like what we used in the last lab to fit the model:
<- lmer( y ~ x + (1|group), data = ) m1
where
y
is the name of our y variable.x
is the name of the variable corresponding to the treatment effect.group
is the name of the variable corresponding to the grouping effect.data =
should have the name of our data set (bnb
) after the=
Question 10
Fit the LME model from Question 9 using R. Show the output as the answer to this question. However, we don’t want the raw output we get from R. We want it nicely formatted like we do for our projects. To do that, use the following:
# The top part of the output
::kable(summary(m1)$varcor) knitr
# The bottom part of the output
::kable(summary(m1)$coefficients) knitr
Using the output
Once we have built the model, we can explore the question of interest! What is the relationship between the number of bedrooms and the price, after accounting for district?
Question 11
Interpret \(\hat{\beta}_0\).
Question 12
Interpret \(\hat{\beta}_1\).
Question 13
Interpret \(\hat{\sigma}_{u}\).
Question 14
Interpret \(\hat{\sigma}_{\epsilon}\).
Question 15
Calculate and interpret the estimated intra-class correlation.
Okay, great!! We have built our model!! To finish up the lab, there is one more thing we want to try.
Simulation Study
Let’s switch away from our Chicago data set and let’s go back to the data from class.
Recall that we had data on test scores from students in different classes and with different teaching methods. We had \(n = 375\) students, 3 teaching methods, and 15 classes. The class served as a grouping variable, while the teaching method was our variable of interest.
The model that we used to reflect the sources of variability in test scores was the following LME model:
\[Score_{ij} = \beta_0 + \beta_1 SomeFlipped_i + \beta_2 FullyFlipped_i + u_j + \epsilon_{ij}\]
\[u_j \stackrel{iid}{\sim} N(0,\sigma_u)\]
\[\epsilon_{ij} \stackrel{iid}{\sim} N(0,\sigma_\epsilon)\]
We have used \(\beta\) terms plenty in STA 112, but the idea of the \(\sigma\) terms is probably pretty new. What do \(\sigma_{\epsilon}\) and \(\sigma_u\) really mean???
One way for us to explore the role of different parameters in our model is to simulate data. This means that we create data using the mathematical structure of our model. When we do this, we have the ability to change the values of certain parameters and see what happens to the patterns in the data.
To simulate data using our model, we can use the following function. You don’t need to understand how the code works (though please ask if you are interested!!) but you do need to copy and paste the following into a chunk and press play:
<- function(beta0, beta1, beta2, sigmaepsilon, sigmau){
simulate_data
set.seed(100)
# Assign students to classes
<- rep(c(1:15), each = 25)
classes
# Create the students
<- data.frame("score" = NA, "class" = classes, "style" = NA)
data
# Assign classes to teaching styles
<- sample(1:15,5, replace = FALSE)
fullflip <- sample(c(1:15)[-fullflip], 5, replace = F)
someflip <- c(1:15)[-c(fullflip,someflip)]
noflip
# Match up teaching styles and classes
$style[which(data$class %in% fullflip)] <- "full"
data$style[which(data$class %in% someflip)] <- "some"
data$style[which(data$class %in% noflip)] <- "none"
data
# Generate Test Scores: treatment effect
$score <- beta0 + beta1*(data$style=="some") + beta2*(data$style=="full")
data
# Generate test scores: group effect
<- rnorm(15, 0 , sigmau)
u $score <- data$score + u[data$class]
data
# Generate test scores: individual effect
<- rnorm(375, 0 , sigmaepsilon)
epsilon $score <- data$score + epsilon
data
# Keeping it in a possible range
$score[which(data$score > 100)] <- 100
data
$class <- as.factor(data$class)
data
# Output the data
data }
When you do this, it will seem like nothing has happened…but it has. We just taught R a new function. This function takes 5 inputs, and then simulates data based on the values we provide:
- \({\beta}_0\)
- \({\beta}_1\)
- \(\beta_2\)
- \({\sigma}_{\epsilon}\)
- \({\sigma}_u\)
Recall that we already know the estimated values of all of these parameters:
- \(\hat{\beta}_0 = 77.36\)
- \(\hat{\beta}_1 = 9.77\)
- \(\hat{\beta}_2 = 5.63\)
- \(\hat{\sigma}_{\epsilon} = 1.94\)
- \(\hat{\sigma}_u = 3.84\)
So, we can generate data from the model using these values with our new function:
<- simulate_data(beta0 = 77.36, beta1 =9.77,
data_original beta2 = 5.63, sigmaepsilon = 1.94, sigmau = 3.84)
Running this code creates a data set, data_original
,
with information on 375 simulated students. Each
student is in one of 15 simulated classes and each was
assigned a teaching method (one of three). The relationships in the data
are all based on our LME model, with the values of the parameters set to
match our sample statistics. Let’s see what the data we just built looks
like.
ggplot(data_original, aes(x = class, y = score)) +
geom_boxplot() +
facet_wrap(~style, scales='free_x') +
labs(x = "Class (1 -15)", y = "Student Scores") +
ylim(60,100)
Okay, great. We can simulate data!! Now, let’s see what happens if we change the values of some of the parameters.
Question 16
Using the same function (simulate_data
), simulate a new
data set called data_LargeSigmaU
, but this time change the
value of \(\sigma_u\) to be 10.
Create a side by side box plot of scores by classes and teaching
style (see the code above) and show the result. What do you notice is
different about these box plots versus the ones we made with
data_original
?
Question 17
Using the same function, simulate a new data set called
data_LargeSigmaEpsilon
. This time, put \(\sigma_u\) back at 3.84 but now change
\(\sigma_\epsilon\) to be 10.
Create a side by side box plot of scores by classes and teaching
style (see the code above). What do you notice is different about these
box plots versus the ones we made with data_original
?
Question 18
One of the \(\sigma\) terms (\(\sigma_u\) or \(\sigma_\epsilon\)) moves the centers of the boxes in the box plots up and down. The other \(\sigma\) term makes the spread within the boxes bigger or smaller. Which \(\sigma\) term does which based on Question 16 and Question 17?
Turning in your assignment
When your Markdown document is complete, do a final knit and make sure everything compiles. You will then submit your document on Canvas. Make sure you look through the final document to make sure everything has knit correctly.
References
The data set used in this lab is part of the data provided as accompanying data sets for the online textbook Broadening Your Statistical Horizons. The data were accessed through the book GitHub repository.
This
work was created by Nicole Dalzell and is licensed under a
Creative
Commons Attribution-NonCommercial 4.0 International License. Last
updated 2023 September 2.