The Data
Comparing Simple Models Graphically: Decision Tree v. Linear Regression
Comparing Simple Models with Cross Validation: Decision Tree v. Linear Regression
More Complex Models
- Comparing Complex Models with Cross Validation
Making the Best Model

In this lab, we’ll use linear regression to predict the probability of a successfuly shot. In the process, we’ll see some of the strengths and weaknesses of linear regression.

The Data

Once again, we’ll use data from December 2015 first quarter NBA jump shots.

shots <- read.csv('/home/rstudioshared/shared_files/data/nba_savant_jumpshots_dec2015_q1.csv')

Comparing Simple Models Graphically: Decision Tree v. Linear Regression

Recall that we used this data to make decision trees before:

library(rpart); library(rpart.plot)
tree_fit <- rpart(shot_made_flag ~ shot_clock,cp=0.001,data=shots)
prp(tree_fit, type=1, fallen.leaves=TRUE, extra=1, cex=0.7)

You can follow the paths along the tree to see the predicts for any shot clock time or you can use the predict function as follows:

shot_clock <- seq(1,24, 1)
predict(tree_fit, data.frame(shot_clock=shot_clock))

We could even make this into a graph:

plot(shot_clock,predict(tree_fit, data.frame(shot_clock=shot_clock)) ,ylab="predicted shooting %")

Let’s try using linear regression instead:

linear_fit <- lm(shot_made_flag ~ shot_clock,data=shots)

coef(linear_fit)

The linear model gives us the equation of the best fit line for predicting whether the shot was made by the amount of time left on the clock. In our case that equation is:

shot_made = 0.334 + 0.0048 * shot_clock

We could simply plug in shot clock times and calculate predicted probabilities of shots being made or we could let the predict funciton do the work for us:

predict(linear_fit, data.frame(shot_clock=shot_clock))
plot(shot_clock,predict(linear_fit, data.frame(shot_clock=shot_clock)) ,ylab="predicted shooting %")

Q1 How do the predictions of the linear model compare to the predictions of the decision tree model? Based on your understanding of basketball, which do you think is a better model of reality?

Comparing Simple Models with Cross Validation: Decision Tree v. Linear Regression

Of course, we can use more than our intuition to evaluated these methods. Let’s evaulate them using Cross Validation!

First, let’s bring our createFolds function back to life and make 10 folds from our shot data:

source('createFolds.R')
folds <- createFolds(shots$shot_made_flag, k = 10)

We’ll also need our RMSE and LogLoss functions:

RMSE <- function(predictions, actuals){
  sqrt(mean((predictions-actuals)^2))
}

LogLoss <- function(predictions, actuals){
  (-1/length(predictions)) * sum (actuals * log(predictions) + (1-actuals)*log(1-predictions))
}

cv_results <- lapply(folds, function(x) {
  train <- shots[-x, ]
  test <- shots[x, ]
  model <- rpart(shot_made_flag ~ shot_clock,cp=0.001,data=shots)
  pred <- predict(model, test)
  actual <- test$shot_made_flag
  rmse <- RMSE(actual, pred)
  return(rmse)
})


mean(unlist(cv_results))

cv_results <- lapply(folds, function(x) {
  train <- shots[-x, ]
  test <- shots[x, ]
  model <- lm(shot_made_flag ~ shot_clock,data=shots)
  pred <- predict(model, test)
  actual <- test$shot_made_flag
  rmse <- RMSE(actual, pred)
  return(rmse)
})

mean(unlist(cv_results))

More Complex Models

Let’s add in shot_distance and defender_distance and once again create tree and linear models:

tree_fit <- rpart(shot_made_flag ~ shot_clock+shot_distance+defender_distance,cp=0.001,data=shots)
prp(tree_fit, type=1, fallen.leaves=TRUE, extra=1, cex=0.7)

linear_fit <- lm(shot_made_flag ~ shot_clock+shot_distance+defender_distance,data=shots)
coef(linear_fit)

The linear model now takes the form:

\[shot\ made = \beta_0 + \beta_{shot\ clock} \cdot shot\ clock + \beta_{shot\ distance} \cdot shot\ distance + \beta_{defender\ distance} \cdot defender\ distance \] and the coefficient’s are the \(\beta's\).

Q2 Using your linear model, rredict the probability of making a shot with each of the following characteristics (use the R console as a calculator):

20 seconds left on the shot clock, 5 feet to the basket and the defender is 5 feet away.
5 seconds left on the shot clock, 20 feet to the basket and the defender is 1 foot away.

You can check your answers with:

predict(linear_fit, data.frame(shot_clock=c(20,5), shot_distance=c(5,20), defender_distance=c(5,1)))

Q3 What do you think are the shortcomings of the linear model? What are it’s strengths relative to the tree model?

Comparing Complex Models with Cross Validation

We can use more of less the same code we used to evaluate the simple models to evaluate the complex models. Note the changes to the fourth line of code where we create the models.

cv_results <- lapply(folds, function(x) {
  train <- shots[-x, ]
  test <- shots[x, ]
  model <- rpart(shot_made_flag ~ shot_clock+shot_distance+defender_distance,cp=0.001,data=shots)
  pred <- predict(model, test)
  actual <- test$shot_made_flag
  rmse <- RMSE(actual, pred)
  return(rmse)
})


mean(unlist(cv_results))

cv_results <- lapply(folds, function(x) {
  train <- shots[-x, ]
  test <- shots[x, ]
  model <- lm(shot_made_flag ~ shot_clock+shot_distance+defender_distance,data=shots)
  pred <- predict(model, test)
  actual <- test$shot_made_flag
  rmse <- RMSE(actual, pred)
  return(rmse)
})

mean(unlist(cv_results))

Making the Best Model

Challenge: Create the best model you can to predict shot_made_flag using either linear regression or decision trees. Calculate the out-of-sample RMSE of your model.

Shot Made Linear Regression

Sports Data Science

March 31, 2017