The true test of a model is not how it performs on the data you used to build it but, rather, how is performs on new data. Put another way, we judge our models based on how well they make predictions. These predictions don’t necessarily need to be about the future but they do need to be about scenarios where we don’t know the outcomes.

For instance, I can use a set of mammals to build a model for how brain size, exposure to danger (and whatever else) predict the amount a mammal sleeps. This model might be good or it might be lousy. I can test it by using data on a new set of mammals and predicting how much they will sleep and then comparing these predictions to their actual sleep amount.

I’ll need a metric to judge the accuracy of these predictions. In this lab, we’ll use “Root Mean Square Error”, better known as RMSE, which takes all of our errors, squares them, averages them and then takes the square root of that average. You can create your own RMSE function (we’ll need this latter):

RMSE <- function(x, y){sqrt(mean((x-y)^2))}

and try it out:

my_guesses <- c(0, 1, 5, 3)
your_guesses <- c(3, 2, 1, 6)

actual <- c(1, 2, 4, 4)

RMSE(my_guesses, actual); RMSE(your_guesses, actual)

You can see that my guesses were more accurate (lower RMSE) than your guesses… as you may well have guessed.

Now, let’s load the packages we need and get back to mammal sleep data.

library(openintro)
library(datarium)
# The openintro and datarium packages have datasets that we will use

library(dplyr)
library(ggplot2)
# These are packages for manipulating and plotting data

After loading the mammal data, I’ll also create logBrainWt and logBodyWt as in our previous lab.

data("mammals", package = "openintro")

mammals <- mammals %>% mutate(logBrainWt = log(BrainWt, base=10), logBodyWt = log(BodyWt, base=10)) %>% filter(!is.na(TotalSleep))

Next, I’ll split the data into training and test sets. We’ll build models using the training set and then test them for accuracy on our test sets. In this case, I’m going to put 70% of our mammal data in the training set and the other 30% in test set.

training_index <- sample(1:nrow(mammals), size=0.7*nrow(mammals), replace=FALSE)

training <- mammals[training_index, ]
test <- mammals[-training_index, ]

Now, let’s build three linear models to predict sleep. I won’t use LifeSpan or Gestation in any of these models because some mammals are missing those values. The first model, uses all of the other variables (except sleep times because that would be cheating) and I call it the complex model. The simple model is limited to using Log Brain Weight, Predation and Danger and the “super simple” model predicts the same amount of sleep time for every mammal. This super simple model is essentially a lack of a model.

#Leaving out Lifespan and Gestation
m_complex <- lm(TotalSleep ~ logBrainWt + logBodyWt + Predation + Exposure + Danger, data=training)

m_simple <- lm(TotalSleep ~ logBrainWt + Predation + Danger, data=training)

m_super_simple <- lm(TotalSleep ~ 1, data=training)

Now, we’ll use all three of these models to make predictions on the training set.

training$pred_sleep_complex <- predict(m_complex, training)

training$pred_sleep_simple <- predict(m_simple, training)

training$pred_sleep_super_simple <- predict(m_super_simple, training)

And now we’ll look at the RMSE of this model on the training set. More complex models will always perform better “in-sample” meaning on the training set or, put another way, adding an addition variable can’t hurt… until the test that matters, accuracy on the test set.

training %>% summarize(
  RMSE(pred_sleep_complex, TotalSleep),
  RMSE(pred_sleep_simple, TotalSleep),
  RMSE(pred_sleep_super_simple, TotalSleep),
)

Let’s now use our models to make predictions on the test set.

test$pred_sleep_complex <- predict(m_complex, test)

test$pred_sleep_simple <- predict(m_simple, test)

test$pred_sleep_super_simple <- predict(m_super_simple, test)

And check out the RMSE’s “out-of-sample”, meaning on the test set.

test %>% summarize(
  RMSE(pred_sleep_complex, TotalSleep),
  RMSE(pred_sleep_simple, TotalSleep),
  RMSE(pred_sleep_super_simple, TotalSleep),
)

Occam’s razor, a preference for simpler models, is sometimes stated as “other things being equal, simpler explanations are generally better than more complex ones.” We might interpret this to mean that if two models perform similarly on the training set, we should prefer the simpler one.

Challenge: Split one of the following data sets into training and test sets (use the code above as your guide). Build a model based on the training set and test its accuracy on the test set.

1. Stress

data("stress", package = "datarium")

2. Surviving the Titanic

data("titanic.raw", package = "datarium")

titanic.raw <- titanic.raw %>% mutate(SurvivedTF = Survived=="Yes")

3. Babies

data("babies", package = "openintro")

Babies Description

Try to predict babies’ birth weights using the other variables and interpret your model as clearly as possible.

4. MLB Stats

This data contains Major League Baseball Player Hitting Statistics for 2010.

data("mlbBat10", package = "openintro")

MLB Bat Description

Try to predict Runs, “R”, from other batting statistics. Does this model make sense?

5. SAT and College GPA

This data set contains SAT and GPA data for 1000 students at an unnamed college.

data("satGPA", package = "openintro")

SAT and College GPA Description

Try to predict four year college GPA “FYGPA”. Note that SAT verbal and SAT math add up SATSum so it doesn’t make sense to have all three of these variables in your model simultaneously (and if you do, R will return a coefficient of NA for one of them).