In this lab, we’ll use linear regression to predict the probability of a successfuly shot. In the process, we’ll see some of the strengths and weaknesses of linear regression.
Once again, we’ll use data from December 2015 first quarter NBA jump shots.
shots <- read.csv('/home/rstudioshared/shared_files/data/nba_savant_jumpshots_dec2015_q1.csv')
Recall that we used this data to make decision trees before:
library(rpart); library(rpart.plot)
tree_fit <- rpart(shot_made_flag ~ shot_clock,cp=0.001,data=shots)
prp(tree_fit, type=1, fallen.leaves=TRUE, extra=1, cex=0.7)
You can follow the paths along the tree to see the predicts for any shot clock time or you can use the predict function as follows:
shot_clock <- seq(1,24, 1)
predict(tree_fit, data.frame(shot_clock=shot_clock))
We could even make this into a graph:
plot(shot_clock,predict(tree_fit, data.frame(shot_clock=shot_clock)) ,ylab="predicted shooting %")
Let’s try using linear regression instead:
linear_fit <- lm(shot_made_flag ~ shot_clock,data=shots)
coef(linear_fit)
The linear model gives us the equation of the best fit line for predicting whether the shot was made by the amount of time left on the clock. In our case that equation is:
shot_made = 0.334 + 0.0048 * shot_clock
We could simply plug in shot clock times and calculate predicted probabilities of shots being made or we could let the predict funciton do the work for us:
predict(linear_fit, data.frame(shot_clock=shot_clock))
plot(shot_clock,predict(linear_fit, data.frame(shot_clock=shot_clock)) ,ylab="predicted shooting %")
Q1 How do the predictions of the linear model compare to the predictions of the decision tree model? Based on your understanding of basketball, which do you think is a better model of reality?
Of course, we can use more than our intuition to evaluated these methods. Let’s evaulate them using Cross Validation!
First, let’s bring our createFolds function back to life and make 10 folds from our shot data:
source('createFolds.R')
folds <- createFolds(shots$shot_made_flag, k = 10)
We’ll also need our RMSE and LogLoss functions:
RMSE <- function(predictions, actuals){
sqrt(mean((predictions-actuals)^2))
}
LogLoss <- function(predictions, actuals){
(-1/length(predictions)) * sum (actuals * log(predictions) + (1-actuals)*log(1-predictions))
}
cv_results <- lapply(folds, function(x) {
train <- shots[-x, ]
test <- shots[x, ]
model <- rpart(shot_made_flag ~ shot_clock,cp=0.001,data=shots)
pred <- predict(model, test)
actual <- test$shot_made_flag
rmse <- RMSE(actual, pred)
return(rmse)
})
mean(unlist(cv_results))
cv_results <- lapply(folds, function(x) {
train <- shots[-x, ]
test <- shots[x, ]
model <- lm(shot_made_flag ~ shot_clock,data=shots)
pred <- predict(model, test)
actual <- test$shot_made_flag
rmse <- RMSE(actual, pred)
return(rmse)
})
mean(unlist(cv_results))
Let’s add in shot_distance and defender_distance and once again create tree and linear models:
tree_fit <- rpart(shot_made_flag ~ shot_clock+shot_distance+defender_distance,cp=0.001,data=shots)
prp(tree_fit, type=1, fallen.leaves=TRUE, extra=1, cex=0.7)
linear_fit <- lm(shot_made_flag ~ shot_clock+shot_distance+defender_distance,data=shots)
coef(linear_fit)
The linear model now takes the form:
\[shot\ made = \beta_0 + \beta_{shot\ clock} \cdot shot\ clock + \beta_{shot\ distance} \cdot shot\ distance + \beta_{defender\ distance} \cdot defender\ distance \] and the coefficient’s are the \(\beta's\).
Q2 Using your linear model, rredict the probability of making a shot with each of the following characteristics (use the R console as a calculator):
You can check your answers with:
predict(linear_fit, data.frame(shot_clock=c(20,5), shot_distance=c(5,20), defender_distance=c(5,1)))
Q3 What do you think are the shortcomings of the linear model? What are it’s strengths relative to the tree model?
We can use more of less the same code we used to evaluate the simple models to evaluate the complex models. Note the changes to the fourth line of code where we create the models.
cv_results <- lapply(folds, function(x) {
train <- shots[-x, ]
test <- shots[x, ]
model <- rpart(shot_made_flag ~ shot_clock+shot_distance+defender_distance,cp=0.001,data=shots)
pred <- predict(model, test)
actual <- test$shot_made_flag
rmse <- RMSE(actual, pred)
return(rmse)
})
mean(unlist(cv_results))
cv_results <- lapply(folds, function(x) {
train <- shots[-x, ]
test <- shots[x, ]
model <- lm(shot_made_flag ~ shot_clock+shot_distance+defender_distance,data=shots)
pred <- predict(model, test)
actual <- test$shot_made_flag
rmse <- RMSE(actual, pred)
return(rmse)
})
mean(unlist(cv_results))
Challenge: Create the best model you can to predict shot_made_flag using either linear regression or decision trees. Calculate the out-of-sample RMSE of your model.