STA 214 Lab 3

Complete all Questions and submit your final PDF or html (either works!) under Assignments in Canvas.

The Goal

We have been working on linear mixed effects models. This includes models with random intercepts and models with random slopes. Today, we are going to apply these concepts on a data set about vacation rentals.

The Data and EDA

Our data for today comes from airbnb, a company that focuses on renting houses and rooms for vacations. You can load the data using the following link.

# Load the data 
bnb <- read.csv("https://raw.githubusercontent.com/proback/BYSH/master/data/airbnb.csv")

# Keep only the columns we want for today
bnb <- bnb[, c(2,5,7,10)]

The data contain prices and other information for \(n =1561\) rentals in Chicago, Illinois. Our goal for today is to model Y = the price in US dollars of an overnight stay in a rental.

Exploring the response variable

Suppose we are considering the following model for price:

\[Y_i \stackrel{iid}{\sim} N(\mu,\sigma)\] This models assumes that the distribution of price is unimodal and symmetric.

Question 1

Create a histogram of price. Does this distribution have a unimodal, symmetric shape?

Hint: As always, your graph must have appropriate axis labels.

When we find ourselves dealing with a skewed distribution, but we would like to use a normal model, we generally consider taking the natural log of the variable.

Question 2

Make a histogram of the natural log of price. Use appropriate labels. Does this distribution have a unimodal, symmetric shape?

Based on what we seen explored so far, we will consider the model

\[log(Y_i) \stackrel{iid}{\sim} N(\mu,\sigma)\]

In this model, \(\mu\) represents the average log price of one night in an airbnb rental in Chicago. \(\sigma\) represents the standard deviation of log rental prices, or how much we expect the log price of an individual rental would differ from \(\mu\).

However, one of the “i”s in iid stands for independent. This suggests that the prices of each of the rentals can be treated as independent of one another. In other words, we essentially assume the rentals are a random sample from across Chicago and the same \(\mu\) and \(\sigma\) are a good description for all rentals.

If we look at our data set, however, we will notice something about the structure. The city of Chicago is divided into several districts, and district is a variable in a our data set. If we have multiple rentals per district in our data set, it is then not appropriate to consider each rental as independent.

Question 3

Create a table to check to how many rentals in our data set are from each district. Make sure to professionally format your table as part of your answer to this question. Which district has the most rentals? The least?

Hint: The following code will be helpful.

knitr::kable( table(  ), col.names = c("District", "Number of Rentals"))

As we can see in the table, there are multiple rentals per district! This means that we have a grouping variable and we will incorporate this into the model using a random effect.

Model 1: Price with Random Intercept

If we choose to include a random effect for district, this means we are breaking up the log price of a rental \(i\) into three components:

\[\underbrace{\beta_0}_{\text{Overall Average}} + \underbrace{u_j}_{\text{District Component}} + \underbrace{\epsilon_{ij}}_{\text{Individual Component}} \]

Essentially, this is assumes that:

\(\beta_0\): There is some overall average log price of a rental in Chicago.
\(u_j\): However, the average log price in a district \(j\) may be differ from the overall average log price in Chicago.
\(\epsilon_{ij}\): Furthermore, the log price of a specific rental in a district may differ from the district average.

We can also think of this as:

\[\underbrace{\beta_0 + u_j}_{\text{District Average}} + \underbrace{\epsilon_{ij}}_{\text{Individual Component}} \] Let’s try an example.

Question 4

Suppose we are told that the average log price of a rental in district 2 is about .22 higher than is typical in Chicago. Which parameter does this tell us about: \(\beta_0, u_j, \epsilon_{ij}\)?

Question 5

Suppose we are told that the overall average log price of a rental in Chicago is 4.6. Which parameter does this tell us about: \(\beta_0, u_j, \epsilon_{ij}\)?

Question 6

Suppose we are told that in District 2, a particular rental (rental 4) is high end and the log price is .14 a night more than is typical in District 2. Which parameter does this tell us about: \(\beta_0, u_j, \epsilon_{ij}\)?

Question 7

Based on the information in Questions 4 - 6, (a) what is the log average price in District 2 and (b) what is the actual log price AND actual price (not on the log scale) of rental 4 from Question 6?

EDA

When we are considering adding random effects, it is helpful to explore how the variable we are modeling differs from group to group. To see how the log prices changes from district to district, we need to create a visualization that can look at the relationship between a categorical variable (district) and a numeric one (log price). A good plot for such a purpose is a side-by-side boxplot.

ggplot( bnb, aes( x= district, y = log(price))) + 
  geom_boxplot( )

Question 8

Adapt the code from above this question to fill in the boxes in the box plots with the color magenta and add appropriate labels and titles. Show the plot.

Question 9

With this plot, it is actually a little easier to compare the prices if we flip the x and y axis. To do this, add +coord_flip() to the end of the code from Question 8. Show the plot.

These plots indicate that log prices vary from district to district. However, based on our model we will be modeling the average log price per district. Box plots show medians, not means.

To compute the district specific average log price, we want to (1) gather together all the rentals in each district and (2) find the mean log rental price for each district. One really helpful package for doing such multi-stage tasks in R is dplyr.

library(dplyr)

This package allows us to use a tool called a pipe (|>) which allows us to do multi-stage tasks. Take a look at the code below.

bnb |>
    group_by(district) |>
    summarise( mean( log(price) )) |>
    knitr::kable()

Hmm. This looks confusing. To make this easier to read, I am going to annotate the code. This means I will put in lines of text (not code) that will explain what each line of code is doing. To annotate in R, you just put a # in front of anything you want R to treat as text rather than code.

# Start with this data set AND THEN
bnb |>
  # Gather together all rentals in a district AND THEN 
  group_by(district) |>
  # Find the average log price for each district AND THEN
  summarise( mean(log(price) ) ) |> 
  # Format the output
  knitr::kable(col.names=c("District", "Avg. Log Price"))

District	Avg. Log Price
Central	5.122984
Far North	4.384163
Far Southeast	3.571292
Far Southwest	4.026767
North	4.512697
Northwest	4.352124
South	4.261810
Southwest	3.958855
West	4.591106

Based on the annotated code above, we can see that the pipe |> can be read as “and then”. So, this code means start with the bnb data set (bnb) and then (|>) collect all the rentals in the same district together (group_by(district)) and then (|<) take the mean of the price in each district (summarise( mean(log(price) ) )).

Question 10

Using the code above as a template, find the average log rental prices grouped by type of room (variable name room_type) instead of district. Show your code output as your answer to this question.

Fitting the LME Model

Based on the work we have done so far, the first model we will consider for price is:

\[log(Price_{ij}) {\sim} N( \mu_j , \sigma_\epsilon),\]

\[\mu_j \stackrel{iid}{\sim} N( 0 , \sigma_u)\]

\[\epsilon_{ij} \stackrel{iid}{\sim} N( 0 , \sigma_\epsilon)\]

where \(i = 1, \dots, 1561\) denotes the rental and \(j = 1, \dots, 9\) denotes the district.

To fit this model in R, we need the lme4 library.

suppressMessages(library(lme4))

We can then use the following code::

LME1 <- lmer(  log(y) ~  (1|group), data = )

where

y is the name of our y variable.
group is the name of the variable corresponding to the random effect.
data = should have the name of our data set after the =

Question 11

Adapt the code above to fit our LME model in R. Store your fitted model output under the name LME1. State the values of \(\hat{\beta}_0\), \(\hat{\sigma}_u\), and \(\hat{\sigma}_\epsilon\) as the answer to this question.

Question 12

What is the predicted average log price of a rental in Chicago?

Question 13

How much do we expect the average log price in each district to differ from that overall average?

Question 14

How much do we expect the actual log price to differ from the district average?

Question 15

What percent of the variability in log price is being captured with the random effect for district?

Considering \(u_j\)

Suppose we want to see the different values of \(\hat{\beta}_0 + \hat{u}_j\) that we have estimated in our model. We have assumed that average log price differs from district to district, so we might be interested to see how much those district specific averages differ from the overall average.

To find that information, we use the code:

coef(LME1)$district

This output provides the log average price in each district. Note that district is our grouping variable; if you have a data set with a different grouping variable, you need to change district to match that new grouping variable.

Question 16

Based on our fitted model, which district has the highest predicted average price? The lowest predicted average price?

Question 17

The average log price for the Northwest district is predicted to be how much higher than the overall average log price of a rental in Chicago?

Hint: The value you see in the code output is \(\hat{\beta}_0 + \hat{u}_6\)

Model 2 EDA: Bedrooms vs. Price

Okay, so we are able to model the log price and account for district. However, what if we want to include an explanatory variable into our model along with our random effect?

To explore this, we are going to change our goal. We now want to model the relationship between \(X\) = the number of bedrooms and the \(Y =\) the log price of a rental.

As both of these variables are numeric, a logical first place to start is to create a scatter plot.

Question 18

Create a scatter plot to visualize the relationship between the number of bedrooms and log price. Make sure to label your axes and title your plot.

Hint: Look at Lab 1 or the R Help Site if you need the code for this!

In your scatter plot, you will notice that you seem to have a series of vertical lines of dots. This is because the number of bedrooms is a discrete numeric variable, meaning it can only take on whole number values like 1, 2, etc. When you have such a variable, you get vertical lines in your scatter plot because there are only very specific values of \(X\) that are possible.

We already know that we need a random intercept for district. The next step is to determine how we handle the slope for the number of bedrooms. Do we need a random slope, or do we only need a fixed effect?

Option 1: No Random Slope

\[Y_{ij} = \beta_0 + u_j + \beta_1 X_{ij} + \epsilon_{ij}\]

Option 2: With a Random Slope

\[Y_{ij} = \beta_0 + u_j + (\beta_1 + b_j) X_{ij} + \epsilon_{ij}\]

Let’s start off by exploring option 1.

One note before we fit the models. In class, we discussed the need to center our X variable before building an LME model when \(X=0\) is not a meaningful value of \(X\). However, for these data, \(X=0\) means that we are dealing with a studio apartment, so there is actually no bedroom in the space that is being rented. That means we do not need to center the \(X\) before we begin modeling.

Option 1 in R:

Question 19

Fit the Option 1 LME model (no random slope, but yes random intercept) and call the model LME2. Show the output as the answer to this question. However, we don’t want the raw output we get from R. We want it nicely formatted like we do for our projects. To do that, use the following:

# The top part of the output
knitr::kable(summary(LME2)$varcor)

# The bottom part of the output
knitr::kable(summary(LME2)$coefficients)

To visualize the fitted model, we can use the following:

# Using x = bedrooms and y = log price
g1<- ggplot(bnb, aes(x = bedrooms, y = log(price), color = district)) +
  # Make a scatter plot 
  geom_point() +
  # Add labels 
  labs(y= "Bedrooms", x = "Log Price",
  title = "No Random Slope") +
  # Make the background of the plot clear
  theme_light() +
  # Remove the legend
  theme(legend.position = "none") +
  # Add in the lines 
  geom_abline(slope = .42 , intercept = coef(LME2)$district[,1])
  
# Show the plot 
g1

Question 20

How many lines are on the graph? Hint: You should not have to count.

Option 2 in R

Right now, our model assumes that as the number of bedrooms increases by 1, the change in the log price is the same for every neighborhood. However, what if we think the change in log price if we increase the number of bedrooms might also change with district? In this case, we consider adding a random slope.

Question 21

Fit the model with the random slope, and call the model LME3. Unlike Question 19, we don’t need to format the top part of the output, as there is an extra column that won’t format nicely.

There is one part of the output from the LME model with a random slope that we have discussed yet. In the top part of the LME output, you will see a column called Corr, which stands for correlation. Specifically, this column tells us the correlation between our random intercept for district and the random slope for district.

What does this mean?? Well, a negative correlation means districts with a lower district-average tend to have steeper slopes than districts with higher district-averages. In other words, if you are in a district with lower rents, the number of bedrooms makes more of a difference in the rent.

The magnitude of hte correlation indicates the strength of this relationship. The closer to 1 or -1, the stronger the correlation.

Question 22

Suppose the correlation between the random intercept and random slope were positive .85. What types of districts would have steeper slopes?

We can visualize the output of this model by:

g2 <- ggplot(bnb, aes(x = bedrooms, y = log(price), color = district))+
     # Make a scatter plot 
     geom_point() +
     # Add labels 
     labs(y= "Bedrooms", x = "Log Price", title = "Random Slope") +
     # Make the background of the plot clear
     theme_light() +
     # Remove the legend
     theme(legend.position = "none") +
     # Add in the lines 
     geom_abline(slope = coef(LME3)$district[,2] , 
     intercept = coef(LME3)$district[,1])
     
# Show the plot 
g2

Comparing the two options

At the moment, we have two different possibilities: a model with a random slope and a model with no random slope. Which should we choose??

We can look at plots of the two fitted models side by side using the following:

gridExtra::grid.arrange(g1,g2,ncol = 2)

Question 23

Do the two plots look: (a) very different, (b) a little different, (c) exactly the same?

If the plots look very different, this means that by including a random slope, we change the fitted model quite a bit. If the two plots look similar or only a little different, this means choosing to include a random slope results in a fitted model that is very similar to a model with no random slope.

In the case of little change, a random slope may not be needed. Random slopes drastically increase the difficulty in estimating the parameters in our LME model, so we don’t generally want to include them if they are not needed.

If we want to numerically compare the two options, we can use the AIC. This is another tool to help us decide whether to include the random slope.

AIC(LME2)
AIC(LME3)

Question 24

Based on the AIC, which model is a better fit to the data?

Using the Final Model

Now that we have selected a final model, we can use it!

Question 25

Interpret the slope for bedrooms in the model you have chosen.

Question 26

Interpret the intercept in the model you have chosen.

Turning in your assignment

When your Markdown document is complete, do a final knit and make sure everything compiles. You will then submit your document on Canvas. Make sure you look through the final document to make sure everything has knit correctly.

References

The data set used in this lab is part of the data provided as accompanying data sets for the online textbook Broadening Your Statistical Horizons. The data were accessed through the book GitHub repository.

This work was created by Nicole Dalzell and is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2024 March 21.