While fans spend a great deal of time talking about how players have performed in the past, teams, fantasy players, gamblers and analysts care primarily about how a player is likely to perform going forward. In other words, they are mostly concerned with player forecasts. A forecast is, for the most part, an estimate of true talent or, in Bayesian terms, a posterior distribution. To find it, we need to know two things: the prior distribution (the range in abilities of this player’s peers) and the likelihood distribution (the probability of seeing stats like these given every possible level of ability). Determine who a player’s peers are and how they’ve performed isn’t always straight forward. Are a basketball player’s peers, all players from the same league? Only those players of the same age or who play the same position? Players who have the same built or the same vertical leap? In this lab, we’ll forecasting free throw shooting ability using a couple of different priors. The ideas used here apply to forecasting any number of abilities in any number of sports.
We’ll read in data from the 2015 Women’s NCAA season:
ncaa <- read.csv('/home/rstudioshared/shared_files/data/ncaaw_2015.csv')
View(ncaa)
Since, we’re focusing on free throw shooting, let’s making a smaller data frame with a few columns describing the player and then the information we need on their free throw shooting.
ncaa_small <- ncaa[, c("name", "school", "playerid", "pos", "FT", "FTA", "FTper")]
We’d like to know how much to regression to the mean to apply when projecting free throw shooting ability. We can figure this out by splitting the free throws taken by each player into two random samples with roughly the same numbers of attempts in each sample. The code below does just that. Try to understand as much of it as you can. The final result of this code is the addition of two columns, FTpercent1 and FTpercent2, each player’s success rate in the two random samples.
library(dplyr)
ncaa_small <- ncaa_small %>% filter(!is.na(FTA) & FTA>1) %>% mutate(FTmissed = FTA - FT,
FTA1 = floor(FTA/2), FTA2 = ceiling(FTA/2))
for (i in 1:nrow(ncaa_small)){
shots <- c(rep(0, ncaa_small$FTmissed[i]), rep(1, ncaa_small$FT[i]))
ncaa_small$FT1[i] <- sum(sample(shots,ncaa_small$FTA1[i]))
}
ncaa_small <- ncaa_small %>% mutate(FT2 = FT-FT1, FTpercent1 = FT1/FTA1, FTpercent2 = FT2/FTA2)
View(ncaa_small)
Let’s take a look at how a player’s free throwing shooting in one random sample relates to her free throw shooting in the other sample. I’ll limit this plot to players who took at least 20 total free throws (and therefore have at least 10 free throws in each random sample).
library(ggplot2)
ggplot(ncaa_small %>% filter(FTA>=20), aes(FTpercent1, FTpercent2, size=FTA, col=FT/FTA)) + geom_point()
Now, let’s calculate the amount of regression to use. In the past we did this using the equation:
\[N_{regression} = N_{sample} \cdot \frac{1-R}{R}\] where \(N_{sample}\) is the number of attempts we observed in each sample, \(R\) is the correlation between samples and \(N_{regression}\) is the amount of regression observations to use.
There’s one hang-up. Each player has a different \(N_{sample}\)!
We can address this by calculating \(N_{regression}\) using a few different ranges of \(N_{sample}\) and, hopefully, getting similar answers.
Using players with 10-15 attempts in each sample:
ncaa_small %>% filter(FTA1>=10, FTA1<=15) %>% summarize(length(FT2), mean(FTA2), cor(FTpercent1, FTpercent2))
## length(FT2) mean(FTA2) cor(FTpercent1, FTpercent2)
## 1 2075 12.82795 0.3402894
we can calculate \(N_{regression}\) as:
12.8*(1-.315)/.315
## [1] 27.83492
Note that your numbers may well differ from the ones I used above since these samples are generated randomly.
Using players with 15-20 attempts in each sample:
ncaa_small %>% filter(FTA1>=15, FTA1<=20) %>% summarize(length(FT2), mean(FTA2), cor(FTpercent1, FTpercent2))
## length(FT2) mean(FTA2) cor(FTpercent1, FTpercent2)
## 1 1594 17.89523 0.4387148
we can calculate \(N_{regression}\) as:
17.9*(1-.430)/.430
## [1] 23.72791
Using players with 20-25 attempts in each sample:
ncaa_small %>% filter(FTA1>=20, FTA1<=25) %>% summarize(length(FT2), mean(FTA2), cor(FTpercent1, FTpercent2))
## length(FT2) mean(FTA2) cor(FTpercent1, FTpercent2)
## 1 1312 22.91845 0.4946384
we can calculate \(N_{regression}\) as:
22.9*(1-.489)/.489
## [1] 23.93027
While these numbers aren’t precisely the same, they give us a sense of how much regression we should use, namely 23-27 free attempts worth.
Next’s let’s calculate the league average shooting rate (the mean that we will regress towards):
ncaa_small %>% summarize(sum(FT)/sum(FTA))
## sum(FT)/sum(FTA)
## 1 0.6795606
We can also relate these pieces of information (the league shooting percentage and the number of regression attempts) back to our beta priors with the equations:
\[\alpha + \beta = 25 (the\ number\ of\ regression\ attempts)\] \[\frac{\alpha}{\alpha + \beta} = 0.68 (the\ league\ average)\]
Rearrange these two equations to solve for \(\alpha\) and \(\beta\) and assign those values to variables:
alpha <-
beta <-
Now, we can plot our prior distribution:
prob.range <- seq(0, 1, .001)
prior <- dbeta(prob.range, alpha, beta)
plot(prob.range, prior, type="l", main="Prior Distribution for FT%")
Our posterior distribution will be different for each player, of course, but there’s a shortcut to using Bayes’ Formula for each player. It turns out that every player’s posterior is also a beta distribution and we can calculate the parameters of their distribution (the alpha and beta) as follows:
\[posterior\ alpha = prior\ alpha + FT\ made\] \[posterior\ beta = prior\ beta + FT\ missed\]
In R code, we’ll do:
ncaa_small <- ncaa_small %>% mutate(posterior_alpha = FT+alpha, posterior_beta = FTmissed+beta,
posterior_FTpercent = posterior_alpha/(posterior_alpha+posterior_beta))
View(ncaa_small)
How has the best posterior FT percent? Is she the player with the highest observed FT percent? If not, why not?
We can also plot any player’s posterior distribution using their posterior alpha and posterior beta. Look back at the code for plotting a beta distribution. You should be able to make a plot that looks similar to this:
The plot above, shows that while we think that Annie Armstrong is an \(89%\) shooter, there’s some uncertainty in our estimate of her ability.
You may have noticed that almost all of the best free throw shooters are guards. We can use this information to inform our projections.
ncaa_small %>% group_by(pos) %>% summarize(sum(FT)/sum(FTA)) #0.68
## # A tibble: 4 × 2
## pos `sum(FT)/sum(FTA)`
## <fctr> <dbl>
## 1 - 0.6539538
## 2 C 0.6303045
## 3 F 0.6565237
## 4 G 0.7018407
ncaa_small %>% group_by(pos) %>% filter(FTA1>=20, FTA1<=30) %>% summarize(length(FT2), mean(FTA2), cor(FTpercent1, FTpercent2))
## # A tibble: 4 × 4
## pos `length(FT2)` `mean(FTA2)` `cor(FTpercent1, FTpercent2)`
## <fctr> <int> <dbl> <dbl>
## 1 - 39 25.07692 0.5448881
## 2 C 159 25.31447 0.4069788
## 3 F 742 25.16577 0.5462286
## 4 G 1216 25.00329 0.4820727
Use the information immediately above to calculate alphas and betas for guards, forwards and center and assign those values to variables:
alphaG <-
betaG <-
alphaC <-
betaC <-
alphaF <-
betaF <-
Next, we’ll put this information into a data frame and join it with our NCAA shooting data:
alphabetas <- data.frame(pos = c("G", "C", "F","-"),
alpha_pos = c(alphaG, alphaC, alphaF, alpha),
beta_pos = c(betaG, betaC, betaF, beta)
)
ncaa_small <- left_join(ncaa_small, alphabetas)
## Joining, by = "pos"
View(ncaa_small)
Finally, we can calculate new posterior alphas and betas for each player and a new estimate of their shooting percentage using the position based priors:
ncaa_small <- ncaa_small %>% mutate(posterior_alpha_pos = FT+alpha_pos, posterior_beta_pos = FTmissed+beta_pos,
posterior_FTpercent_pos = posterior_alpha_pos/(posterior_alpha_pos+posterior_beta_pos))
View(ncaa_small)
How do these forecasts compare to your earlier forecasts? Whose forecasts are most affected by taking into account their position?