STA 214 Lab 5

Complete all questions and submit your final html or PDF on Canvas.

The Data

One popular toy that has stood the test of time is Legos. Lego sets are designed for a variety of ages, from toddlers to adults, and you can build things as diverse as a cactus garden to a house.

Inside your box of Legos you will typically find one or more instructions books that help you build the desired object. Today, we are going to build a model to predict how many instruction books will be needed to build a given Lego set.

The data set can be loaded from Canvas. Make sure that when you load it in, you copy the line of code that loads the data and put it in a chunk in your Markdown file. If you need a refresher on the steps, look here: https://bookdown.org/dalzelnm/bookdown-demo/intro.html#folks-using-rstudio-on-their-computers

Exploring Y

We have information on \(n=2552\) different Lego sets and our response variable is Num_Instructions, which counts the number of instruction books included with each set. This is a count random variable. This means that our first thought is probably Poisson Regression. However, there are a few things that we need to check to make sure that Poisson Regression is reasonable.

Question 1

Create a histogram to visualize the distribution of the response variable. Fill in the bars in gold (so color them gold) and outline the bars in black. Make sure you have added appropriate figure and axis labels.

Hints: Remember that you have done this in Lab 1, and that the code is in our Code Dictionary online (https://bookdown.org/dalzelnm/bookdown-demo/graphing-formatting-graphs.html#histogram)

Question 2

So far, does it seem like using a Poisson distribution is reasonable based on this graph? Explain.

The final thing we need to check before we can assume \[Y_i \sim Poisson(\lambda_i)\] is the mean/variance assumption. This assumption says that it must be reasonable to assume that the population mean of \(Y\) and the population variance of \(Y\) is the same.

Question 3

Why is it important to verify that the population mean of \(Y\) and the population variance of \(Y\) are the same? In other words, what about the Poisson distribution means this assumption is something we need to check?

Since we do not have the population data (we only have a sample of all Lego sets, believe it or not!!), we can’t directly compute the population mean and variance. However, we can compute the sample mean and the sample variance.

Question 4

What is the sample mean of \(Y\)?

Question 5

What is the sample variance of \(Y\)?

Question 6

What is the sample standard deviation of \(Y\)?

Question 7

Based on your computations on the sample, does it seem reasonable that the mean/variance condition is met?

Okay, at this point we have checked everything that needs to be checked about \(Y\). Time to get an \(X\) involved.

Building a Poisson Regression Model

We are interested in building a model to examine the relationship between \(X\) = the number of pieces in a Lego set and \(Y\) = the number of instruction books that we need to build it.

Question 8

Do you think we need an offset in this model? If so, what do we need to include as the offset?

Question 9

Write down an appropriate population model for the question of interest (including an offset if needed). Use proper notation.

The model you just wrote assumes that the relationship between \(X\) and the log average number of instruction books is linear. This is something that we have not yet checked, so we need to do that before we proceed with actually fitting the model.

Checking to see if using a line makes sense requires us to build a scatter plot with the number of pieces on the x-axis and the log average number of instruction books on the y-axis. To create the scatter plot, we will:

  • Divide our numeric X into bins with 5% of the data in each bin.
    • For example, 5% of the sets have between 4 and 36 pieces, so our first bin is 4 - 36 pieces.
  • Find the log average number of instruction books for the Legos sets in in each bin.
    • In bin 1, the log average number of instruction books is .56.
  • Plot the result on a scatter plot.
    • For our bin 1, the dot on the scatter plot is placed at (16, .56). 16 is the middle of the range of Bin 1.

To teach the R the function it needs to do this, copy and paste the following function into a chunk and press play:

checkShape_Poisson <- function(x , y, xaxis_label, yaxis_label, figure_title, formulaToTry  = NA){
  # Step 1 : Create storage space needed for the loop 
  sort = order(x)
  x = x[sort]
  y = y[sort]
  a = seq(1, length(x), by=.05*length(y))
  b = c(a[-1] - 1, length(x))
  
  average = xmean = logaverage = rep(0, length(a)) # ns is for CIs
  for (i in 1:length(a)){
    range = (a[i]):(b[i])
    average[i] = mean(y[range])
    xmean[i] = mean(x[range])
    logaverage[i] <-ifelse(average[i] >0, log(average[i]), log(average[i]+.01))
  }
  
  suppressMessages(library(ggplot2))
  
  dataHere <-data.frame("x" = xmean, "LogAverage" = logaverage)
 
   g1 <- ggplot(dataHere, aes(x, LogAverage)) + geom_point() + labs(x = xaxis_label , y = yaxis_label, title = figure_title)
   
   if( class(formulaToTry)== "formula"){
     g1+ stat_smooth(formula = formulaToTry, method = "lm", se = FALSE) 
   } else{
     g1
   }
  
   
}

As when we taught R the function to make empirical log odds plots, it will look like nothing has happened - but it has. We have just taught R a new function, and now we can use it.

The function requires several inputs:

  • x =; our X variable in the form data$variable
  • y =; our Y variable in the form data$variable
  • xaxis_label =; the label we want on the x axis
  • yaxis_label =; the label we want on the y axis
  • figure_title =; the title of the figure
  • formulaToTry =; the shape of the relationship we want to explore. For a line, it is y ~ x.

Question 10

Fill in the missing pieces in the code below to create the graph needed to check the shape condition. Make sure you have appropriate axis labels.

checkShape_Poisson(x = , y = , xaxis_label = , yaxis_label = , figure_title = , formulaToTry  = y~x) 

Question 11

Based on what you see in the graph, does it look like a line is appropriate to capture the relationship between the number of pieces and the log average number of instruction books?

Hmm…it’s not a line. Sigh. This means that we need to transform the \(X\) variable in some way.

When we see a curve, it helps to ask the following question:

Question 12

Does it look like the relationship between the number of pieces and the log average number of books is a curve that…:

  1. Changes direction?
  2. Is consistently increasing?
  3. Is consistently decreasing?

Answering this question can help us decide what transformations might be appropriate to try. Once we have decided on what transformations might be appropriate, we can adapt the formulaToTry part of the code we used above to try them out.

  • Consistently Increasing: Log transformation: formulaToTry = y ~ log(x)
  • Changes Direction Once: 2nd order polynomial: formulaToTry = y ~ poly(x,2)
  • Changes Direction Twice: 3rd order polynomial: formulaToTry = y ~ poly(x,3)

Question 13

State which transformation for \(X\) you suggest and show the plot from Question 10 with the fitted model plotted on the scatter plot (which just means adapt the formulaToTry part of the code from Question 10).

Question 14

Write down the appropriate population model that fulfills the shape condition.

Using the Fitted Model

At this point, we have checked all the conditions, so we can actually fit the model!

Question 15

Use R to build your Poisson Regression model and call your model modelCount. Write down the fitted model using appropriate notation.

Question 16

What is the percent drop in deviance between our model and the null model?

Hint: To find the values you need to compare, use the code summary(modelCount) and look towards the bottom of the output.

You may be looking at this model going hmm, that will not be fun to interpret with the X variable transformed! However, if you choose the log transformation for \(X\) (hint), you get a very cool trait. When you have a fitted model of the form

\[log(\hat{A}) = \hat{\beta}_0 + \hat{\beta}_1 log(X)\]

the interpretation of \(\hat{\beta}_1\) is that if X increases by 1%, we predict a \(\hat{\beta}_1\) percent increase/decrease in \(A\).

For example, if \(log(\hat{A}) = 15 + 2 log(X)\), then we say that if X increases by 1%, we predict a 2% increase in \(A\).

Question 17

Interpret \(\hat{\beta}_1\) from your Poisson Regression model using the hint above.

Inference with Poisson Regression

As we learned in class, one of the nice things about Poisson regression and likelihood based methods in general is that we can use the same inference procedures we learned for logistic regression.

Question 18

Build and interpret a 95% confidence interval for \(\beta_1\).

Hint: Remember the hint you used for interpreting \(\hat{\beta}_1\). It will help you with your interpretation here, too.

Question 19

Based on your CI, if the number of pieces increases 15%, what percent increase do you expect in the average number of instruction books?

References

Creative Commons License
This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2024 April 10.

The data set used in this lab was adapted from a data set from Kaggle. Please note that I used the sets data set and then ran sets <-na.omit(sets) to remove all missing data to arrive at the \(n = 2552\) rows used for this lab.

Citation: LEGO Sets and Prices Over Time, Version 1. Retrieved November 11, 2023 from https://www.kaggle.com/datasets/alexracape/lego-sets-and-prices-over-time.