g2017<-read.csv("../Data/2017games.csv", header=TRUE, stringsAsFactors = FALSE)
NCAA2017 <- g2017[c(1,3,6,10)]
NCAA2017$W.Rating<-g2017$Wrating
NCAA2017$L.Rating<-g2017$Lrating
NCAA2017$Abs<-g2017$Abs
NCAA2017$W.L<-g2017$Hit
Ken Pomeroy is a professional NCAA college basketball statistician that calculates team efficiency.
AdjEM is the difference between a team’s offensive and defensive efficiency. It represents the number of points the team would be expected to outscore the average Division I team over 100 possessions and it has the advantage of being a linear measure. The difference between +31 and +28 is the same as the difference between +4 and +1. It’s three points per 100 possessions which is much easier to interpret.
There were 67 games in the 2017 NCAA basketball tournament. Here is the data set of those games along with the AdjEM for each team and if the diffence of those ratings correctly predicted the winner.
datatable(NCAA2017, extensions = "Responsive",options=list(lengthMenu = c(10,25,68)))
Can we create a model using the AdjEM for two teams to predict the winner?
I took the difference in Ken Pomeroy’s AdjEM for each of the two teams in each of the 67 NCAA tournament games for 2017. I used that difference to prdict the winner of each matchup. Then I recorded if the predicted winner actually won. I took these results and using logistic regression, I created a model for predicting the probability that the higher rated team will actually win.
plot(W.L~Abs,data=NCAA2017, pch=16, xlab="Absolute Difference in AdjEM", ylab = "Predicted Winner Results", main="Ken Pomeroy's AdjEM Differential Prediction Model")
KP.glm <- glm(W.L~Abs,data=NCAA2017, family=binomial)
pander(summary(KP.glm))
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -0.291 | 0.4714 | -0.6172 | 0.5371 |
Abs | 0.215 | 0.07865 | 2.734 | 0.00626 |
(Dispersion parameter for binomial family taken to be 1 )
Null deviance: | 75.90 on 66 degrees of freedom |
Residual deviance: | 61.24 on 65 degrees of freedom |
b<-KP.glm$coefficients
pvLR<-coef(summary(KP.glm))[2,4]
curve(exp(b[1]+b[2]*x)/(1+exp(b[1]+b[2]*x)), add=TRUE)
pc<-predict(KP.glm, data.frame(Abs=10), type='response')
Is Logistical Regression an appropriate model for taking the difference of AdjEM and predicting the probability of a team winning?
I want to test two things:
From the Logistic Regression, a p-value of 0.0062601 on slope shows that there is a signifcant relationship between AdjEM and winning.
Using the Hosmer and Lemeshow goodness of fit (GOF) test:
\[ H_0:\text{Logistical Model is Appropriate}\\H_a:\text{Logistical Model is Not Appropriate} \]
library(ResourceSelection)
pander(hoslem.test(KP.glm$y, KP.glm$fitted, g=10))
Test statistic | df | P value |
---|---|---|
4.144 | 8 | 0.8439 |
pvHL<-hoslem.test(KP.glm$y, KP.glm$fitted, g=10)$p.value
# Note: doesn't give a p-value for g >= 7, default is g=10.
# Larger g is usually better than smaller g.
With a p-Value of 0.8438981, there is very little evidence that this Logistic Regression Model has a poor fit.
plot(W.L~Abs,data=NCAA2017, pch=16, xlab="Absolute Difference in AdjEM", ylab = "Predicted Winner Results", main="Ken Pomeroy's AdjEM Differential Prediction Model")
KP.glm <- glm(W.L~Abs,data=NCAA2017, family=binomial)
b<-KP.glm$coefficients
curve(exp(b[1]+b[2]*x)/(1+exp(b[1]+b[2]*x)), add=TRUE)
Notice that when the difference between AdjEM of two teams is near zero, the probability of winning is near 50%; this is to be expected.
It appears that Ken Pomeroy’s evaluation of team efficiency is an appropriate predictor of the winner in the 2017 NCAA tournament.
On 16 Mar 2018, the unthinkable happened. The University of Maryland in Baltimore County had their first invitation to the NCAA tournament. They defeated the #1 ranked team in the tournament, the University of Virgina by 20 points. Ken Pomeroy had the difference in AdjEM as 34 points.