STA 363/663 Lab 5

Complete all Questions and submit your final PDF or html under Assignments in Canvas.

The Goal

In our last class together, we started to explore splines. These are a special type of regression model that we use when the relationship between a numeric feature \(X\) and a numeric response variable \(Y\) is a curve which is not well matched to a polynomial.

Today, we are going to work on splines. Our goal is to increase our understanding of spline models, to practice building spline models in R, and to try assessing the predictive accuracy of splines models.

The Data

The data you need can be found on Canvas. When you load the data in, it should have 50598 rows and 79 columns. The data set contains information on different countries, and there are multiple rows of information for each country. These variables include things like population and GDP for a country for a given year. We are interested in \(X=\) the number of people living in the country and \(Y\) = the amount of carbon dioxide (CO2) the country produces.

We are going to focus on just the United States today, so we can subset our data down to only the rows relating to the United States.

dataUS <- subset(owid.co2.data, country == "United States")

When you run the code above, you should end up with a data set of \(n=222\) rows. This is the data set we will work with for today, so you can remove the original data set if you wish.

rm(owid.co2.data)

EDA

Now that we have our data, the next step is to visualize the relationship that we are interested in. Recall that we want to build a model for \(Y\) = the amount of carbon dioxide (CO2) in kilotons the country produces using \(X=\) the number of people living in the country as a feature.

Question 1

Make a scatter plot to explore the relationship between population (X) and CO2 production (Y). Make sure you appropriately label your axes and title the plot Figure 1.

Question 2

What about the shape of the relationship suggests that a spline model might be an appropriate choice?

Choosing a knot

For our first model for these data, we are going to start off with a spline model with only one knot. Why? This choice has nothing to do with the graph and everything to do with us exploring splines. We want to make sure we are all comfortable with the equation for a spline model and with what the model is actually doing, and it is easiest to do this with only one knot.

To choose the single knot, we will use the cut_number function, which is in the ggplot2 package. This function will divide the domain of \(X\) into two regions such that roughly the same number of rows in the data end up in each region.

table(cut_number(dataUS$population,2))

When we do this, you will notice that the numbers are very large, since our feature \(X =\) population is recorded in numbers of people. The population of the US is many, many people, so we end up with large values in the domain.

To make our visualizations and splines a little easier to interpret, let’s convert \(X\) to a new scale: hundreds of thousands of people.

Question 3

Convert the variable population from the units “people” to the units “hundreds of thousands of people”. Update your graph from Question 1 using the newly scaled X variable and show the plot as your answer to this question.

Question 4

With your newly scaled \(X\), what is the value of the knot (in hundreds of thousands of people?)

Question 5

Add a vertical line at the knot to your plot from Question 3.

Hint: We can add a vertical line with the code +geom_vline(xintercept = )

Adding the vertical line at the knot is useful when building splines so we can clearly see the two different regions. Spline models allow a different cubic polynomial to be fit in each region, so it helps to see the pattern in each region.

Building the Spline

With our knot chosen, we are ready to progress to building our spline. To create a cubic basis spline model, or just spline model for short, we need the splines package in R. You will need to install the package, and once you do you can run the library command.

library(splines)

To build a spline in R, the code is very similar to linear regression:

spline_1knot <- lm( co2 ~ bs( population, knots = ),data = dataUS)

We use bs to let R know we want a cubic basis spline. There is also an input called knots which allows us to specify where we want the knot(s) to go. You will notice that I have left knots = blank, so you will need to fill in the value of the knot you chose in Question 4.

Once we have run the code to create the spline_1knot object, our model is trained. Just like in linear regression, to view the actual coefficients for the trained model we have to use the summary command.

Question 6

Write down the trained model using appropriate notation. Round to two decimal places. You should have the \(h\) function in your trained model.

Hint: Remember that mathematical notation in Markdown has to go in the white space, not in a chunk, and that the R code dictionary from class has information on how to make mathematical symbols using Latex in Markdown.

Question 7

What is the trained model for Region 1? Use appropriate notation.

Note: You must fully simplify / combine all terms until it looks like a 3rd order polynomial regression model (a cubic).

Question 8

What is the trained model for Region 2? Use appropriate notation.

Note: You must fully simplify / combine all terms until it looks like a 3rd order polynomial regression model (a cubic).

Note 2: Leave the knot as \(c_1\), rather than 909. Otherwise, your numbers will get larger very quickly!

Now that we have trained the model, it can be helpful to visualize the spline that we have built to see how it matches the patterns in the data. To do this, we need to adapt our scatter plot from Question 5.

Question 9

To add the trained model to the plot from Question 5, add + geom_line(aes(population, fitted(spline_1knot))). Show the plot and title it Figure 4.

Hint: If you wish, you can also make the spline a different color from the points using color = inside the geom_line command.

+ geom_line(col = "blue", aes(population, fitted(spline_1knot))

Assessing Predictive Accuracy

The goal is to use our model to predict the amount of carbon monoxide produced in the US for a specific population size, so we need to assess how well our model predicts!

With a data set this size, we can use LOOCV or 10-fold CV.

Question 10

Using 10-fold CV, find the test RMSE of your spline model using a seed of 363663. Show your code and state the test RMSE. Recall that:

compute_RMSE <- function(truth, predictions){
  
  # Part 1
  part1 <- truth - predictions
  
  #Part 2
  part2 <- part1^2
  
  #Part 3: RSS 
  part3 <- sum(part2)
  
  #Part4: MSE
  part4 <- 1/length(predictions)*part3
  
  # Part 5:RMSE
  sqrt(part4)
}

Note: You will notice two warnings when you run your code. They will both say: “Warning: some ‘x’ values beyond boundary knots may cause ill-conditioned bases”. We’ll talk about those later, but your code will still run so ignore them for now.

Improving the test RMSE

Right now, we sort of arbitrarily choose that we wanted to use one knot. However, we may be able to improve our test RMSE if we choose our knots differently.

If we want to change the number of knots in the spline, we want to allow the spline to easily update for the change in knots. Luckily, there is a way we can write our code to do just that.

# Option 1: Specify the knots

spline_2knots <- lm( co2 ~ bs( population, knots = ),data = dataUS)

# Option 2: Specify how many knots 

spline_2knots <- lm( co2 ~ bs( population, df =  ),data = dataUS)

In option 2, the degrees of freedom is the number of knots plus 3 (so the number of \(\beta\) terms in your spline, not including the intercept). This means that for our spline with one knot, we would use \(df = 4\).

This new option 2 will allow us to more easily change the number of knots.

Question 11

Using 10-fold CV, find the test RMSE if we increase the number of knots to 2, using a seed of 363663. State the test RMSE.

Question 12

We want to explore what happens if we allow up to 20 knots. Using 10-fold CV with a seed of 363663, find the test RMSE of a spline model for each of 1 - 20 knots. Show the 20 different test RMSE values.

Note: Because I haven’t told you how to get rid of the warnings yet (yes, we are getting there!) I would recommend you change your chunk header to be {r, warning = FALSE}. This will stop the warnings from showing up.

Question 13

Based on your results from Question 12, state how many knots you would recommend and what the test RMSE is for the spline with that many knots.

Question 14

Build the spline with the number of knots you chose in Question 13. Update your graph from Question 9 to plot your new trained model. Show the graph and call the plot Figure 5.

Excellent!! So we can now train the model, use the model to make predictions, plot the trained model, and assess predictive accuracy. But…what about those warnings?

UNDERGRADS: You don’t have to do the lab past this point. Grad students, you need to continue to the end.

Natural Splines

When you run the 10-fold CV code, we get warning messages. This message appears exactly twice: once for the smallest population value in the data set and once for largest population in the data set. In other words, we get the warnings with the the min and max values of the domain They define what are called boundary knots.

Okay, cool, but what do the warnings actually mean? What this warning message is telling us is that we will have trouble estimating \(\hat{y}_i\) at the boundary knots, i.e., min and max points in the domain. The reason is that there is no data beyond the boundary knots, and the spline models are so flexible that this can be a problem if we want to estimate beyond the boundary knots.

If we are interested in estimating at the boundary knots, or beyond the boundary knots, we generally make a very small adjustment to our spline. Instead of using a cubic basis spline, we use a natural cubic spline.

A natural cubic spline is an adaptation of the cubic spline that adjusts for the fact that cubic splines often have unstable estimates at the boundary knots. To correct this, natural splines require that the spline function be linear at the boundary, i.e., in the region smaller than the minimum \(x\) value and larger than the maximum \(x\) value. To do this, natural splines make the restriction that the second and third derivatives of the basis form of the spline model is 0 at the boundary knots.

We don’t really need to know this math, but we do need to know that we can use natural splines to stabilize estimation at the boundary knots.

To fit a natural spline in R, all we have to do is change the bs in our code to an ns.

# Option 1: Specify the knots

spline_2knots <- lm( co2 ~ ns( population, knots = ),data = dataUS)

# Option 2: Specify how many knots 

spline_2knots <- lm( co2 ~ ns( population, df =  ),data = dataUS)

Question 15

Build a natural spline with the number of knots you chose in Question 13. Update your graph from Question 9 to plot your new trained model. Show the graph and call the plot Figure 6.

Now, when we switch to a natural spline, we do unfortunately need to check again to see what the optimal number of knots is. Luckily, this is a very simple adaption to the code you have already done!

Question 16

We want to explore what happens if we allow up to 20 knots with natural splines. Using 10-fold CV with a seed of 363663, find the test RMSE of a natural spline model for each of 1 - 20 knots. Show the 20 different test RMSE values.

Question 17

Based on your results from Question 16, state how many knots you would recommend and what the test RMSE is for the spline with that many knots.

Question 18

Build the spline with the number of knots you chose in Question 17. Update your graph from Question 9 to plot your new trained model. Show the graph and call the plot Figure 7.

Question 19

Which spline model had the lowest test RMSE overall: the cubic spline or the natural cubic spline?

Turning in your assignment

When your Markdown document is complete, do a final knit and make sure everything compiles. You will then submit your document on Canvas. You must submit a PDF or html document to Canvas. No other formats will be accepted. Make sure you look through the final document to make sure everything has knit correctly.

This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2023 October 8.

The data set used in this lab is from Our World in Data: Hannah Ritchie, Max Roser and Pablo Rosado (2020) - “CO₂ and Greenhouse Gas Emissions”. Published online at OurWorldInData.org. Retrieved from: ‘https://ourworldindata.org/co2-and-greenhouse-gas-emissions’ [Online Resource] .