How to Make a Forecast

While fans spend a great deal of time talking about how players have performed in the past, teams, fantasy players, gamblers and analysts care primarily about how a player is likely to perform going forward. In other words, they are mostly concerned with player forecasts. A forecast is, for the most part, an estimate of true talent or, in Bayesian terms, a posterior distribution. To find it, we need to know two things: the prior distribution (the range in abilities of this player’s peers) and the likelihood distribution (the probability of seeing stats like these given every possible level of ability). Determine who a player’s peers are and how they’ve performed isn’t always straight forward. Are a basketball player’s peers, all players from the same league? Only those players of the same age or who play the same position? Players who have the same built or the same vertical leap? In this lab, we’ll forecasting free throw shooting ability using a couple of different priors. The ideas used here apply to forecasting any number of abilities in any number of sports.

The Data

We’ll read in data from the 2015 Women’s NCAA season:

ncaa <- read.csv('/home/rstudioshared/shared_files/data/ncaaw_2015.csv')
View(ncaa)

Since, we’re focusing on free throw shooting, let’s making a smaller data frame with a few columns describing the player and then the information we need on their free throw shooting.

ncaa_small <- ncaa[, c("name", "school", "playerid", "pos", "FT", "FTA", "FTper")]

We’d like to know how much to regression to the mean to apply when projecting free throw shooting ability. We can figure this out by splitting the free throws taken by each player into two random samples with roughly the same numbers of attempts in each sample. The code below does just that. Try to understand as much of it as you can. The final result of this code is the addition of two columns, FTpercent1 and FTpercent2, each player’s success rate in the two random samples.

library(dplyr)

ncaa_small <- ncaa_small %>% filter(!is.na(FTA) & FTA>1) %>% mutate(FTmissed = FTA - FT,
                                    FTA1 = floor(FTA/2), FTA2 = ceiling(FTA/2))


for (i in 1:nrow(ncaa_small)){
  shots <- c(rep(0, ncaa_small$FTmissed[i]), rep(1, ncaa_small$FT[i]))
  ncaa_small$FT1[i] <- sum(sample(shots,ncaa_small$FTA1[i]))
  
}
ncaa_small <- ncaa_small %>% mutate(FT2 = FT-FT1, FTpercent1 = FT1/FTA1, FTpercent2 = FT2/FTA2)
View(ncaa_small)

Let’s take a look at how a player’s free throwing shooting in one random sample relates to her free throw shooting in the other sample. I’ll limit this plot to players who took at least 20 total free throws (and therefore have at least 10 free throws in each random sample).

library(ggplot2)
ggplot(ncaa_small %>% filter(FTA>=20), aes(FTpercent1, FTpercent2, size=FTA, col=FT/FTA)) + geom_point()

Now, let’s calculate the amount of regression to use. In the past we did this using the equation:

\[N_{regression} = N_{sample} \cdot \frac{1-R}{R}\] where \(N_{sample}\) is the number of attempts we observed in each sample, \(R\) is the correlation between samples and \(N_{regression}\) is the amount of regression observations to use.

There’s one hang-up. Each player has a different \(N_{sample}\)!

We can address this by calculating \(N_{regression}\) using a few different ranges of \(N_{sample}\) and, hopefully, getting similar answers.

Using players with 10-15 attempts in each sample:

ncaa_small %>% filter(FTA1>=10, FTA1<=15) %>% summarize(length(FT2), mean(FTA2), cor(FTpercent1, FTpercent2))
##   length(FT2) mean(FTA2) cor(FTpercent1, FTpercent2)
## 1        2075   12.82795                   0.3402894

we can calculate \(N_{regression}\) as:

12.8*(1-.315)/.315
## [1] 27.83492

Note that your numbers may well differ from the ones I used above since these samples are generated randomly.

Using players with 15-20 attempts in each sample:

ncaa_small %>% filter(FTA1>=15, FTA1<=20) %>% summarize(length(FT2), mean(FTA2), cor(FTpercent1, FTpercent2))
##   length(FT2) mean(FTA2) cor(FTpercent1, FTpercent2)
## 1        1594   17.89523                   0.4387148

we can calculate \(N_{regression}\) as:

17.9*(1-.430)/.430
## [1] 23.72791

Using players with 20-25 attempts in each sample:

ncaa_small %>% filter(FTA1>=20, FTA1<=25) %>% summarize(length(FT2), mean(FTA2), cor(FTpercent1, FTpercent2))
##   length(FT2) mean(FTA2) cor(FTpercent1, FTpercent2)
## 1        1312   22.91845                   0.4946384

we can calculate \(N_{regression}\) as:

22.9*(1-.489)/.489
## [1] 23.93027

While these numbers aren’t precisely the same, they give us a sense of how much regression we should use, namely 23-27 free attempts worth.

Next’s let’s calculate the league average shooting rate (the mean that we will regress towards):

ncaa_small %>% summarize(sum(FT)/sum(FTA))
##   sum(FT)/sum(FTA)
## 1        0.6795606

We can also relate these pieces of information (the league shooting percentage and the number of regression attempts) back to our beta priors with the equations:

\[\alpha + \beta = 25 (the\ number\ of\ regression\ attempts)\] \[\frac{\alpha}{\alpha + \beta} = 0.68 (the\ league\ average)\]

Rearrange these two equations to solve for \(\alpha\) and \(\beta\) and assign those values to variables:

alpha <- 
beta <- 

Now, we can plot our prior distribution:

prob.range <- seq(0, 1, .001)
prior <- dbeta(prob.range, alpha, beta)
plot(prob.range, prior, type="l", main="Prior Distribution for FT%")

Our posterior distribution will be different for each player, of course, but there’s a shortcut to using Bayes’ Formula for each player. It turns out that every player’s posterior is also a beta distribution and we can calculate the parameters of their distribution (the alpha and beta) as follows:

\[posterior\ alpha = prior\ alpha + FT\ made\] \[posterior\ beta = prior\ beta + FT\ missed\]

In R code, we’ll do:

ncaa_small <- ncaa_small %>% mutate(posterior_alpha = FT+alpha, posterior_beta = FTmissed+beta,
                      posterior_FTpercent = posterior_alpha/(posterior_alpha+posterior_beta))
View(ncaa_small)

How has the best posterior FT percent? Is she the player with the highest observed FT percent? If not, why not?

We can also plot any player’s posterior distribution using their posterior alpha and posterior beta. Look back at the code for plotting a beta distribution. You should be able to make a plot that looks similar to this:

The plot above, shows that while we think that Annie Armstrong is an \(89%\) shooter, there’s some uncertainty in our estimate of her ability.

Prior Distributions Based on Position

You may have noticed that almost all of the best free throw shooters are guards. We can use this information to inform our projections.

ncaa_small %>% group_by(pos) %>% summarize(sum(FT)/sum(FTA)) #0.68
## # A tibble: 4 × 2
##      pos `sum(FT)/sum(FTA)`
##   <fctr>              <dbl>
## 1      -          0.6539538
## 2      C          0.6303045
## 3      F          0.6565237
## 4      G          0.7018407
ncaa_small %>% group_by(pos) %>% filter(FTA1>=20, FTA1<=30) %>% summarize(length(FT2), mean(FTA2), cor(FTpercent1, FTpercent2))
## # A tibble: 4 × 4
##      pos `length(FT2)` `mean(FTA2)` `cor(FTpercent1, FTpercent2)`
##   <fctr>         <int>        <dbl>                         <dbl>
## 1      -            39     25.07692                     0.5448881
## 2      C           159     25.31447                     0.4069788
## 3      F           742     25.16577                     0.5462286
## 4      G          1216     25.00329                     0.4820727

Use the information immediately above to calculate alphas and betas for guards, forwards and center and assign those values to variables:

alphaG <- 
betaG <- 
  
alphaC <- 
betaC <- 
  
alphaF <- 
betaF <- 

Next, we’ll put this information into a data frame and join it with our NCAA shooting data:

alphabetas <- data.frame(pos = c("G", "C", "F","-"),
           alpha_pos = c(alphaG, alphaC, alphaF, alpha),
           beta_pos = c(betaG, betaC, betaF, beta)
             )
  
ncaa_small <- left_join(ncaa_small, alphabetas)
## Joining, by = "pos"
View(ncaa_small)

Finally, we can calculate new posterior alphas and betas for each player and a new estimate of their shooting percentage using the position based priors:

ncaa_small <- ncaa_small %>% mutate(posterior_alpha_pos = FT+alpha_pos, posterior_beta_pos = FTmissed+beta_pos,
                                    posterior_FTpercent_pos = posterior_alpha_pos/(posterior_alpha_pos+posterior_beta_pos))
View(ncaa_small)

How do these forecasts compare to your earlier forecasts? Whose forecasts are most affected by taking into account their position?