The Big Question

Which of the decision trees we made in our last lab were the best? In some cases it may have been clear that our decision tree was too simple while in other cases, we may have clearly created too many braches, but how do we know where to draw the line?

In this lab, we’ll use out-of-sample prediction error to evaluate our models.

Data and Packages

First let’s load the data and the packages.

nfl <- read.csv('/home/rstudioshared/shared_files/data/NFLPlaybyPlay2015.csv')
View(nfl)

Once again, we’ll need dplyr and the decision tree packages.

library(dplyr);library(rpart); library(rpart.plot)

Scoring Functions

Kaggle has a list of scoring functions used in its competitions. Two that we’ve used so far and will continue to use are root mean square error and logarithmic loss.

RMSE <- function(predictions, actuals){
  sqrt(mean((predictions-actuals)^2))
}

LogLoss <- function(predictions, actuals){
  (-1/length(predictions)) * sum (actuals * log(predictions) + (1-actuals)*log(1-predictions))
}

Going for It: Training and Test Sets

First, let’s recreate the “go.on.4th” data set with all the plays where a team went for it on fourth down. Once again, we’ll create a column “got.it” that takes on values of 0 or 1 depending on whether the offensive team got a first down.

nfl.run.or.pass <- nfl %>% filter(PlayType %in% c("Run", "Pass")) %>% mutate(run = ifelse(PlayType=="Run", 1, 0))
go.on.4th <- nfl.run.or.pass %>% filter(down==4 & ydstogo >0) %>% mutate(got.it = as.numeric(Yards.Gained >= ydstogo))

nrow(go.on.4th)

This time, before creating our models to predict 4th down success, we’ll split the data into two subsets, the training and test sets. The sampled rows are just random integers between 1 and the total number of rows in our data set.

sampled.rows <- sample(1:nrow(go.on.4th), round(nrow(go.on.4th)/2), replace=FALSE)
sampled.rows

Now, we’ll place all of the sampled rows into the training set and all of the rows that weren’t sampled into the test set.

train <- go.on.4th[sampled.rows, ]
test <- go.on.4th[-sampled.rows, ]

Now, we’re finally ready to make our models.

A simple model:

Let’s start with a simple model using yards need for a first down and whether the team ran or passed. This code displayed the model and calculates its root mean square error in sample. That is, it calculates how well it performs when making predictions on the data set which was used to make the model. The final line, calculates the standard deviation of our predicted variable which is the RMSE of a tree with no branches.

fit <- rpart(got.it ~ run+ydstogo,data=train, cp=0.01)
prp(fit)
RMSE(predict(fit, train), train$got.it)
sd(train$got.it)

What we really want to know, however, is how this model performs out of sample. Let’s calculate it’s RMSE when making predictions on the test set. We’ll also calculate the standard deviation in the the test set which will tell us how well a branchless tree would perform on the test set.

RMSE(predict(fit, test), test$got.it)
sd(test$got.it)

A more complex model:

Let’s perform the same calculations with a more complex model. We’ll make our model more complex by (counterintuitively) lowering the complexity parameter and changing “minsplit”. minsplit is the minimum number of leaves (data points) that must exist on a branch for it to branch further. It has a default value of 20 but we’ll change it to 10.

fit <- rpart(got.it ~ run+ydstogo,data=train, cp=0.001, minsplit=10)
prp(fit)
RMSE(predict(fit, train), train$got.it)
sd(train$got.it)
RMSE(predict(fit, test), test$got.it)
sd(test$got.it)

What do these results tell you?

A super complex model:

Let’s lower our complexity parameter by another factor of 10 and bring misplit down to 3.

fit <- rpart(got.it ~ run+ydstogo,data=train, cp=0.0001, minsplit=3)
prp(fit)
RMSE(predict(fit, train), train$got.it)
sd(train$got.it)
RMSE(predict(fit, test), test$got.it)
sd(test$got.it)

What can you say about this complex model?

Going for It: k-fold Cross Validation

Splitting the data into two halves, the training and test set, has a major disadvantage – our models are only built on half the data. Data is valuable - the more information we have the smarter our models can be. We’d really like to build our models using almost all of the data. That said, we can only fairly evaluate our models but judging them on out-of-sample data and if we use almost all of the data to make the model that would leave very little data left to test our models and our test results will be increasingly noisy.

One way out of this is to use k-fold cross validation. We’ll split the data in 10 “folds”. We’ll use 9 folds (or nine tenths of our data) to build our model and then test it on the 10th fold (just one tenth of the data). But, we’re not done! Next we can choose a different set of 9 folds to build our model and once again test it on the remaining (“left out”) fold. In fact, we can do this 10 times, each time leaving out a different fold. By doing this, we get to both build on model using almost all the data and test our model on all of the data. You can use as many folds as you want – up to the total number of data points. If you have as many folds as data points, this method is called “Leave one out cross validation” or just LOOCV since each model is built using all of the data points except one. The more folds you use, the more data you use to test your model but there are diminishing returns. With k = 10 (10 folds), you build your models with 90% of the data and with k = 20 you build your models with 95% of the data – but in order to use that additional 5%, the computation will take twice as long.

createFolds.R

Find let’s create a function that partions our data into k folds.

Save the following code into a createFolds.R file.

createFolds <- function (y, k = 10, list = TRUE, returnTrain = FALSE) 
{
  if (class(y)[1] == "Surv") 
    y <- y[, "time"]
  if (is.numeric(y)) {
    cuts <- floor(length(y)/k)
    if (cuts < 2) 
      cuts <- 2
    if (cuts > 5) 
      cuts <- 5
    breaks <- unique(quantile(y, probs = seq(0, 1, length = cuts)))
    y <- cut(y, breaks, include.lowest = TRUE)
  }
  if (k < length(y)) {
    y <- factor(as.character(y))
    numInClass <- table(y)
    foldVector <- vector(mode = "integer", length(y))
    for (i in 1:length(numInClass)) {
      min_reps <- numInClass[i]%/%k
      if (min_reps > 0) {
        spares <- numInClass[i]%%k
        seqVector <- rep(1:k, min_reps)
        if (spares > 0) 
          seqVector <- c(seqVector, sample(1:k, spares))
        foldVector[which(y == names(numInClass)[i])] <- sample(seqVector)
      }
      else {
        foldVector[which(y == names(numInClass)[i])] <- sample(1:k, 
                                                               size = numInClass[i])
      }
    }
  }
  else foldVector <- seq(along = y)
  if (list) {
    out <- split(seq(along = y), foldVector)
    names(out) <- paste("Fold", gsub(" ", "0", format(seq(along = out))), 
                        sep = "")
    if (returnTrain) 
      out <- lapply(out, function(data, y) y[-data], y = seq(along = y))
  }
  else out <- foldVector
  out
}

You can either run this code in the console or tell R to run all of the code in your file using:

source('createFolds.R')

We can use this function as follows and view our folds:

folds <- createFolds(go.on.4th$got.it, k = 10)
folds

Now, let’s build our model k times, leaving out each fold exactly once and testing oru model on that fold. Each time we’ll calculated the RMSE and store the value in cv_results. We can look at our results for each fold and also average the RMSE’s across folds.

cv_results <- lapply(folds, function(x) {
  train <- go.on.4th[-x, ]
  test <- go.on.4th[x, ]
  model <- rpart(got.it ~ run+ydstogo,data=train, cp=0.01)
  pred <- predict(model, test)
  actual <- test$got.it
  rmse <- RMSE(actual, pred)
  return(rmse)
})

cv_results

mean(unlist(cv_results))

Let’s try the same thing, this time calculating logarithmic loss for each fold:

cv_results <- lapply(folds, function(x) {
  train <- go.on.4th[-x, ]
  test <- go.on.4th[x, ]
  model <- rpart(got.it ~ run+ydstogo,data=train, cp=0.01)
  pred <- predict(model, test)
  actual <- test$got.it
  logloss <- LogLoss(pred, actual)
  return(logloss)
})

cv_results

mean(unlist(cv_results))

Now check more complex models using cross-validation using both root mean square error and log loss. Which is the best model?

Going Further