Scorecasting: Tiger Woods is Human

In this chapter the authors made what I think is a rather surprising claim (and I’m not referring to the claim that Tiger Woods is human). The authors claimed we can’t treat sporting events as “path independent”. In other words, a 3-2 count in baseball is not just a 3-2 count. According to the author’s it matters how we got there, and a pitcher will have more success if he started out ahead, raaching an 0-2 count, and is at risk of losing an out that he thought he had secured. Likewise, it’s the hitter who will have more success if he started out 3-0 and is at risk of losing something that had seemed more likely - a trip to first.

The authors further claim that we see the same phenomenom in basketball games. A team will perform better in the final 12 minutes, they claim, if they were previously ahead and are now at risk of losing a game that once seemed to be in hand.

In this lab, we will investigate this claim using the scores of over 26,000 basketball games from the 1995 through 2016 seaons after each quarter.

The Data

nba <- read.csv('/home/rstudioshared/shared_files/data/nba_scores_by_quarter.csv')
head(nba)

Each row contains the season, the team, the line (remember that negative lines mean that the team in questions was favored), the date, the game.number (1 would mean that its the first game of the season) and the margins (leads) at the end of the 1st, 2nd, 3rd quarters (m1, m2 and m3) as well as at the end of the game (m4).

Notice that every game shows up twice. For instance in the first two rows we see an opening day (November 3rd) 1995 game between the Bulls and Pelicans from the perspective of both teams. The Bulls were favored in this game by 13 points, fell behind by 1 point after the first quarter and by 8 at the half. The Bulls rallied in the 3rd quarter outscoring the Pelicans by 22 points to take a 14 point lead which they maintained over the course of the 4th quarter.

The relationship between lines and scores

The following code creates a plot of final margins versus the line for the game and then calculates the equation for the best fit line.

library(dplyr); library(ggplot2)

ggplot(nba, aes(line, m4)) + geom_point() +geom_smooth(method="lm")
lm(m4 ~ line, data=nba)

What is the equation of the best fit line?
How would you interpret the slope of this line?

Predicting the Outcome from the Scoring at the End of 3

To predict the scores at the end of games based on the scores at the end of 3, we can use the same linear model function. We can then use our best fit lines to make predictions – what margin would we expect a team to win or loss by given their margin entering the 4th quarter. In the second line of code make predictions using end-of-3rd-quarter margins of -10, -5, 5 and 10 points.

model <- lm(m4 ~ m3, data=nba)

coef(model)
round(predict(model, list(m3=c(-10, -5, 0, 5, 10))),2)

We can also use this model to predict the final margin of every one of our 26,493 games based on their end-of-3rd-quarter margins.

nba <- nba %>% mutate(pred.score = predict(model, nba))
head(nba)

Since we have predictions for final scores we can compare them to actual end of game scores and calculate the difference between our the actual margin and our predicted margin for each game.

nba <- nba %>% mutate(margin.over.predicted = m4-pred.score)
head(nba)

For each game, we have now calculated how much better or worse each team did in the 4th quarter than we would expect based on the margin entering the 4th.

Now (finally!) to check Scorecasting claims!

Do teams that were ahead and lost points in the 3rd quarter do better in the 4th quarter than we expect?

We can answer this by look at the differences reality and our predictions as a function of half-time scores. If the scorecasting authors are correct, teams with higher half-time scores should do well, on average, relative to our predictions. More specifically, the slope of the margin.over.predicted versus m2 line should be positive (meaning that teams that previous lead the game should do better in the 4th quarter).

ggplot(nba, aes(m2, margin.over.predicted))+geom_point()+geom_smooth(method="lm")
lm(margin.over.predicted ~ m2, data=nba)

Does this data provide support for Scorecasting’s claims?

Close Games

Let’s focus in on close games, games that are within five points entering the fourth quarter and calculate this same slope.

ggplot(nba %>% filter(abs(m3)<=5), aes(m2, margin.over.predicted))+geom_point()+geom_smooth(method="lm")
lm(margin.over.predicted ~ m2, data=nba %>% filter(abs(m3)<=5))

Does this data provide stronger evidence for Scorecasting’s claims?

Bootstrapping

The slopes we just found are based on a sample of reality - one basketball league for a number of years. We can use bootstrapping to understand how much uncertainty there is in our estimate of the true slope. Sticking with close games, our estimate was based on 9,285 games (18,570 rows in our dataset which has two rows per game) so in each of our bootstrap samples we’ll choose 9,285 games with replacement.

slopes <- rep(NA, 100)
close_games <- nba %>% filter(abs(m3)<=5)
n <- nrow(close_games)/2

for (i in 1:100){
  rows <- sample(1:n, n, replace=TRUE)
  slopes[i] <- as.numeric(coef(lm(margin.over.predicted ~ m2, data=close_games[rows, ]))[2])
}

Now, let’s take a look at our bootstrapped slopes:

hist(slopes)
quantile(slopes, c(0.01, 0.1, 0.5, 0.9, 0.99))

How much uncertainty is there in our estimate of the relationship between half-time margin and doing better than expected in the 4th quarter?

Going Further

Can you devise a different way to evalute Scorecasting’s hypothesis using this data?

NBA Scores by Quarter

Sports Data Science

1/11/2017