STA 214 Lab 5
Complete all questions and submit your final html or PDF on Canvas.
The Data
One popular toy that has stood the test of time is Legos. Lego sets are designed for a variety of ages, from toddlers to adults, and you can build things as diverse as a cactus garden to a house.
Inside your box of Legos you will typically find one or more instructions books that help you build the desired object. Today, we are going to build a model to predict how many instruction books will be needed to build a given Lego set.
The data set can be loaded from Canvas. Make sure that when you load it in, you copy the line of code that loads the data and put it in a chunk in your Markdown file. If you need a refresher on the steps, look here: https://bookdown.org/dalzelnm/bookdown-demo/intro.html#folks-using-rstudio-on-their-computers
Exploring Y
We have information on \(n=2552\)
different Lego sets and our response variable is
Num_Instructions
, which counts the number of instruction
books included with each set. This is a count random
variable. This means that our first thought is probably Poisson
Regression. However, there are a few things that we need to check to
make sure that Poisson Regression is reasonable.
Question 1
Create a histogram to visualize the distribution of the response variable. Fill in the bars in gold (so color them gold) and outline the bars in black. Make sure you have added appropriate figure and axis labels.
Hints: Remember that you have done this in Lab 1, and that the code is in our Code Dictionary online (https://bookdown.org/dalzelnm/bookdown-demo/graphing-formatting-graphs.html#histogram)
Question 2
So far, does it seem like using a Poisson distribution is reasonable based on this graph? Explain.
The final thing we need to check before we can assume \[Y_i \sim Poisson(\lambda_i)\] is the mean/variance assumption. This assumption says that it must be reasonable to assume that the population mean of \(Y\) and the population variance of \(Y\) is the same.
Question 3
Why is it important to verify that the population mean of \(Y\) and the population variance of \(Y\) are the same? In other words, what about the Poisson distribution means this assumption is something we need to check?
Since we do not have the population data (we only have a sample of all Lego sets, believe it or not!!), we can’t directly compute the population mean and variance. However, we can compute the sample mean and the sample variance.
Question 4
What is the sample mean of \(Y\)?
Question 5
What is the sample variance of \(Y\)?
Question 6
What is the sample standard deviation of \(Y\)?
Question 7
Based on your computations on the sample, does it seem reasonable that the mean/variance condition is met?
Okay, at this point we have checked everything that needs to be checked about \(Y\). Time to get an \(X\) involved.
Building a Poisson Regression Model
We are interested in building a model to examine the relationship between \(X\) = the number of pieces in a Lego set and \(Y\) = the number of instruction books that we need to build it.
Question 8
Do you think we need an offset in this model? If so, what do we need to include as the offset?
Question 9
Write down an appropriate population model for the question of interest (including an offset if needed). Use proper notation.
The model you just wrote assumes that the relationship between \(X\) and the log average number of instruction books is linear. This is something that we have not yet checked, so we need to do that before we proceed with actually fitting the model.
Checking to see if using a line makes sense requires us to build a scatter plot with the number of pieces on the x-axis and the log average number of instruction books on the y-axis. To create the scatter plot, we will:
- Divide our numeric X into bins with 5% of the data in each bin.
- For example, 5% of the sets have between 4 and 36 pieces, so our first bin is 4 - 36 pieces.
- Find the log average number of instruction books for the Legos sets
in in each bin.
- In bin 1, the log average number of instruction books is .56.
- Plot the result on a scatter plot.
- For our bin 1, the dot on the scatter plot is placed at (16, .56). 16 is the middle of the range of Bin 1.
To teach the R the function it needs to do this, copy and paste the following function into a chunk and press play:
<- function(x , y, xaxis_label, yaxis_label, figure_title, formulaToTry = NA){
checkShape_Poisson # Step 1 : Create storage space needed for the loop
= order(x)
sort = x[sort]
x = y[sort]
y = seq(1, length(x), by=.05*length(y))
a = c(a[-1] - 1, length(x))
b
= xmean = logaverage = rep(0, length(a)) # ns is for CIs
average for (i in 1:length(a)){
= (a[i]):(b[i])
range = mean(y[range])
average[i] = mean(x[range])
xmean[i] <-ifelse(average[i] >0, log(average[i]), log(average[i]+.01))
logaverage[i]
}
suppressMessages(library(ggplot2))
<-data.frame("x" = xmean, "LogAverage" = logaverage)
dataHere
<- ggplot(dataHere, aes(x, LogAverage)) + geom_point() + labs(x = xaxis_label , y = yaxis_label, title = figure_title)
g1
if( class(formulaToTry)== "formula"){
+ stat_smooth(formula = formulaToTry, method = "lm", se = FALSE)
g1else{
}
g1
}
}
As when we taught R the function to make empirical log odds plots, it will look like nothing has happened - but it has. We have just taught R a new function, and now we can use it.
The function requires several inputs:
x =
; our X variable in the formdata$variable
y =
; our Y variable in the formdata$variable
xaxis_label =
; the label we want on the x axisyaxis_label =
; the label we want on the y axisfigure_title =
; the title of the figureformulaToTry =
; the shape of the relationship we want to explore. For a line, it isy ~ x
.
Question 10
Fill in the missing pieces in the code below to create the graph needed to check the shape condition. Make sure you have appropriate axis labels.
checkShape_Poisson(x = , y = , xaxis_label = , yaxis_label = , figure_title = , formulaToTry = y~x)
Question 11
Based on what you see in the graph, does it look like a line is appropriate to capture the relationship between the number of pieces and the log average number of instruction books?
Hmm…it’s not a line. Sigh. This means that we need to transform the \(X\) variable in some way.
When we see a curve, it helps to ask the following question:
Question 12
Does it look like the relationship between the number of pieces and the log average number of books is a curve that…:
- Changes direction?
- Is consistently increasing?
- Is consistently decreasing?
Answering this question can help us decide what transformations might
be appropriate to try. Once we have decided on what transformations
might be appropriate, we can adapt the formulaToTry
part of
the code we used above to try them out.
- Consistently Increasing: Log transformation:
formulaToTry = y ~ log(x)
- Changes Direction Once: 2nd order polynomial:
formulaToTry = y ~ poly(x,2)
- Changes Direction Twice: 3rd order polynomial:
formulaToTry = y ~ poly(x,3)
Question 13
State which transformation for \(X\)
you suggest and show the plot from Question 10 with the fitted model
plotted on the scatter plot (which just means adapt the
formulaToTry
part of the code from Question 10).
Question 14
Write down the appropriate population model that fulfills the shape condition.
Using the Fitted Model
At this point, we have checked all the conditions, so we can actually fit the model!
Question 15
Use R to build your Poisson Regression model and call your model
modelCount
. Write down the fitted model using appropriate
notation.
Question 16
What is the percent drop in deviance between our model and the null model?
Hint: To find the values you need to compare, use the code
summary(modelCount)
and look towards the bottom of the
output.
You may be looking at this model going hmm, that will not be fun to interpret with the X variable transformed! However, if you choose the log transformation for \(X\) (hint), you get a very cool trait. When you have a fitted model of the form
\[log(\hat{A}) = \hat{\beta}_0 + \hat{\beta}_1 log(X)\]
the interpretation of \(\hat{\beta}_1\) is that if X increases by 1%, we predict a \(\hat{\beta}_1\) percent increase/decrease in \(A\).
For example, if \(log(\hat{A}) = 15 + 2 log(X)\), then we say that if X increases by 1%, we predict a 2% increase in \(A\).
Question 17
Interpret \(\hat{\beta}_1\) from your Poisson Regression model using the hint above.
Inference with Poisson Regression
As we learned in class, one of the nice things about Poisson regression and likelihood based methods in general is that we can use the same inference procedures we learned for logistic regression.
Question 18
Build and interpret a 95% confidence interval for \(\beta_1\).
Hint: Remember the hint you used for interpreting \(\hat{\beta}_1\). It will help you with your interpretation here, too.
Question 19
Based on your CI, if the number of pieces increases 15%, what percent increase do you expect in the average number of instruction books?
References
This
work was created by Nicole Dalzell is licensed under a
Creative
Commons Attribution-NonCommercial 4.0 International License. Last
updated 2024 April 10.
The data set used in this lab was adapted from a data set from
Kaggle. Please note that I used the sets
data set and then
ran sets <-na.omit(sets)
to remove all missing data to
arrive at the \(n = 2552\) rows used
for this lab.
Citation: LEGO Sets and Prices Over Time, Version 1. Retrieved November 11, 2023 from https://www.kaggle.com/datasets/alexracape/lego-sets-and-prices-over-time.