[ source files available on GitHub ]
Libraries needed for data processing and plotting:
library("dplyr")
library("magrittr")
library("ggplot2")
Source external script with my own handy functions definitions:
source("./scripts/my_defs_u2.R")
The content of this external file is included in the Appendix at the end of this report.
Read the datasets NBA_train.csv
:
NBA <- read.csv("data/NBA_train.csv")
str(NBA)
## 'data.frame': 835 obs. of 20 variables:
## $ SeasonEnd: int 1980 1980 1980 1980 1980 1980 1980 1980 1980 1980 ...
## $ Team : Factor w/ 37 levels "Atlanta Hawks",..: 1 2 5 6 8 9 10 11 12 13 ...
## $ Playoffs : int 1 1 0 0 0 0 0 1 0 1 ...
## $ W : int 50 61 30 37 30 16 24 41 37 47 ...
## $ PTS : int 8573 9303 8813 9360 8878 8933 8493 9084 9119 8860 ...
## $ oppPTS : int 8334 8664 9035 9332 9240 9609 8853 9070 9176 8603 ...
## $ FG : int 3261 3617 3362 3811 3462 3643 3527 3599 3639 3582 ...
## $ FGA : int 7027 7387 6943 8041 7470 7596 7318 7496 7689 7489 ...
## $ X2P : int 3248 3455 3292 3775 3379 3586 3500 3495 3551 3557 ...
## $ X2PA : int 6952 6965 6668 7854 7215 7377 7197 7117 7375 7375 ...
## $ X3P : int 13 162 70 36 83 57 27 104 88 25 ...
## $ X3PA : int 75 422 275 187 255 219 121 379 314 114 ...
## $ FT : int 2038 1907 2019 1702 1871 1590 1412 1782 1753 1671 ...
## $ FTA : int 2645 2449 2592 2205 2539 2149 1914 2326 2333 2250 ...
## $ ORB : int 1369 1227 1115 1307 1311 1226 1155 1394 1398 1187 ...
## $ DRB : int 2406 2457 2465 2381 2524 2415 2437 2217 2326 2429 ...
## $ AST : int 1913 2198 2152 2108 2079 1950 2028 2149 2148 2123 ...
## $ STL : int 782 809 704 764 746 783 779 782 900 863 ...
## $ BLK : int 539 308 392 342 404 562 339 373 530 356 ...
## $ TOV : int 1495 1539 1684 1370 1533 1742 1492 1565 1517 1439 ...
Premise: There are quite a few variables that have the variable name and then the same variable with a ‘A’ suffix.
The variables with ‘A’ suffix refer to the number that were attempted, the corresponding variables without suffix refer to the number that were successful.
SeasonEnd
: year the season ended.Team
: name of the team.playoffs
: binary variable for whether or not a team made it to the playoffs that year.W
: the number of regular season wins.PTS
: points scored during the regular season.oppPTS
: opponent points scored during the regular season.FG
, FGA
: number of successful field goals, including two and three pointers,X2P
, X2PA
: 2-pointers.X3P
, X3PA
: 3-pointers.FT
, FTA
: free throws.ORB
, DRB
: offensive and defensive rebounds.AST
: assists.STL
: steals.BLK
: blocks.TOV
: turnovers.NOTE: the 2-pointer and 3-pointer variables have an ‘X’ in front of them, added by R
when reading the dataset to make variable names conform to R
rules.
tmp <- group_by(NBA, W) %>% summarise(nTot = n(), nPO = sum(Playoffs), fracPO = nPO/nTot)
print(tmp)
## Source: local data frame [59 x 4]
##
## W nTot nPO fracPO
## 1 11 2 0 0.0000000
## 2 12 2 0 0.0000000
## 3 13 2 0 0.0000000
## 4 14 2 0 0.0000000
## 5 15 10 0 0.0000000
## 6 16 2 0 0.0000000
## 7 17 11 0 0.0000000
## 8 18 5 0 0.0000000
## 9 19 10 0 0.0000000
## 10 20 10 0 0.0000000
## 11 21 12 0 0.0000000
## 12 22 11 0 0.0000000
## 13 23 11 0 0.0000000
## 14 24 18 0 0.0000000
## 15 25 11 0 0.0000000
## 16 26 17 0 0.0000000
## 17 27 10 0 0.0000000
## 18 28 18 0 0.0000000
## 19 29 12 0 0.0000000
## 20 30 20 1 0.0500000
## 21 31 16 1 0.0625000
## 22 32 12 0 0.0000000
## 23 33 17 0 0.0000000
## 24 34 16 0 0.0000000
## 25 35 16 3 0.1875000
## 26 36 21 4 0.1904762
## 27 37 19 4 0.2105263
## 28 38 15 7 0.4666667
## 29 39 20 10 0.5000000
## 30 40 22 13 0.5909091
## 31 41 37 26 0.7027027
## 32 42 37 29 0.7837838
## 33 43 20 18 0.9000000
## 34 44 29 27 0.9310345
## 35 45 25 22 0.8800000
## 36 46 16 15 0.9375000
## 37 47 28 28 1.0000000
## 38 48 15 14 0.9333333
## 39 49 17 17 1.0000000
## 40 50 32 32 1.0000000
## 41 51 12 12 1.0000000
## 42 52 20 20 1.0000000
## 43 53 17 17 1.0000000
## 44 54 18 18 1.0000000
## 45 55 24 24 1.0000000
## 46 56 16 16 1.0000000
## 47 57 23 23 1.0000000
## 48 58 13 13 1.0000000
## 49 59 14 14 1.0000000
## 50 60 8 8 1.0000000
## 51 61 10 10 1.0000000
## 52 62 13 13 1.0000000
## 53 63 7 7 1.0000000
## 54 64 3 3 1.0000000
## 55 65 3 3 1.0000000
## 56 66 2 2 1.0000000
## 57 67 4 4 1.0000000
## 58 69 1 1 1.0000000
## 59 72 1 1 1.0000000
Let’s take a look at the table, around the middle section.
fracPO
, does not rise from zero until after 35 wins.This can very clearly seen also visually:
plot(tmp$W, tmp$fracPO, pch = 21, col = "red2", bg = "orange",
xlab = "Wins", ylab = "Frequency", main = "Playoff Qualification Frequency vs. Number of Regular Season Wins")
abline(h = 0.9, lty = 2, col = "red2")
abline(v = 35, lty = 2, col = "blue2")
abline(v = 45, lty = 2, col = "blue2")
Games are won by scoring more points than the other team.
Can we use the difference between points scored and points allowed throughout the regular season in order to predict the number of games that a team will win?
We add a variable that is the difference between points scored and points allowed.
NBA$PTSdiff <- NBA$PTS - NBA$oppPTS
PTSdiff
and W
plot(NBA$PTSdiff, NBA$W, pch = 21, col = "red2", bg = "orange",
xlab = "PTS difference", ylab = "Wins", main = "Regular Season Wins vs. Total Regular Season Scored Points")
It looks like there is a very strong linear relationship between these two variables.
It seems like linear regression is going to be a good way to predict how many wins a team will have given the point difference.
W
)Let’s try to verify this. So we’re going to have PTSdiff
as the independent variable in our regression, and W
for wins as the dependent variable.
Wins_Reg <- lm(W ~ PTSdiff, data = NBA)
summary(Wins_Reg)
##
## Call:
## lm(formula = W ~ PTSdiff, data = NBA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.7393 -2.1018 -0.0672 2.0265 10.6026
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.0000000 0.1059358 387.0 <2e-16 ***
## PTSdiff 0.0325863 0.0002793 116.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.061 on 833 degrees of freedom
## Multiple R-squared: 0.9423, Adjusted R-squared: 0.9423
## F-statistic: 1.361e+04 on 1 and 833 DF, p-value: < 2.2e-16
Yielding the following relationship between the two variables \[ W = 41 + 0.0326*(PTSdiff) \]
We saw earlier with the table that a team would want to win about at least 45 games in order to have about \(>90\%\) chance of making it to the playoffs.
What does this mean in terms of their points difference?
With the linear regression model we can compute the PTSdiff
needed to get W
\(\ge45\), i.e. PTSdiff
\(\ge 122.8\).
Let’s build an equation to predict points scored using some common basketball statistics.
Our dependent variable would be PTS
, and our independent variables would be some of the common basketball statistics that we have in our data set. For example,
X2PA
,X3PA
,We can use all of these.
PointsReg1 <- lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + TOV + STL + BLK, data = NBA)
summary(PointsReg1)
##
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + TOV +
## STL + BLK, data = NBA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -527.40 -119.83 7.83 120.67 564.71
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2050.81081 203.48660 -10.078 <2e-16 ***
## X2PA 1.04288 0.02957 35.274 <2e-16 ***
## X3PA 1.25859 0.03843 32.747 <2e-16 ***
## FTA 1.12802 0.03373 33.440 <2e-16 ***
## AST 0.88577 0.04396 20.150 <2e-16 ***
## ORB -0.95539 0.07792 -12.261 <2e-16 ***
## DRB 0.03883 0.06157 0.631 0.5285
## TOV -0.02475 0.06118 -0.405 0.6859
## STL -0.19918 0.09181 -2.169 0.0303 *
## BLK -0.05576 0.08782 -0.635 0.5256
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 185.5 on 825 degrees of freedom
## Multiple R-squared: 0.8992, Adjusted R-squared: 0.8981
## F-statistic: 817.3 on 9 and 825 DF, p-value: < 2.2e-16
Taking a look at this, we can see that
STL
) only has one significance star.DRB
, TOV
, BLK
).We do have a pretty good \(R^2\) value, 0.8992, which shows that there really is a linear relationship between points and all of these basketball statistics.
Sum of Squared Errors (SSE) (not a very interpretable quantity)
SSE <- sum(PointsReg1$residuals^2)
SSE
## [1] 28394314
Root Mean Squared Error (RMSE), more interpretable, sort of the average error made.
# RMSE <- sqrt(SSE/nrow(NBA))
RMSE <- sqrt(SSE/PointsReg1$df.residual)
RMSE
## [1] 185.5191
It does seem line a very large error, but it should be seen in the context of the total number of points scored on average in a full season, which is:
mean(NBA$PTS)
## [1] 8370.24
So, the fractional error of this model is about 2.2%, which is fairly small.
It may be interesting to check the correlations between the variables that we included in this first model, to get some hints as to collinearity, which could be relevant to know if we wanted to remove some variables.
par(mar=c(5, 4, 4, 1)+0.1)
par(oma=c(0, 0, 0, 0))
pairs(NBA[, c("X2PA", "X3PA", "FTA", "AST", "ORB", "DRB", "TOV", "STL", "BLK")], gap=0.5, las=1,
pch=21, bg=rgb(0,0,1,0.25),
panel=mypanel, lower.panel=function(...) panel.cor(..., color.bg=TRUE), main="")
mtext(side=3, "pairs plot with correlation values", outer=TRUE, line=-1.2, font=2)
mtext(side=3, "Dashed lines are 'lm(y~x)' fits.\nCorrelation and scatterplot frames are color-coded on the strength of the correlation",
outer=TRUE, line=-1.6, padj=1, cex=0.8, font=1)
The first model seems to be quite accurate. However, we probably have room for improvement in this model, because not all the variables that we included were significant. Let’s see if we can remove some of the insignificant variables, and we will do it one at a time, incrementally.
TOV
PointsReg2 <- lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + STL + BLK, data = NBA)
summary(PointsReg2)
##
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + STL +
## BLK, data = NBA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -526.79 -121.09 6.37 120.74 565.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2076.67816 193.08441 -10.755 <2e-16 ***
## X2PA 1.04353 0.02951 35.366 <2e-16 ***
## X3PA 1.26273 0.03703 34.099 <2e-16 ***
## FTA 1.12537 0.03308 34.023 <2e-16 ***
## AST 0.88615 0.04393 20.173 <2e-16 ***
## ORB -0.95815 0.07758 -12.350 <2e-16 ***
## DRB 0.03892 0.06154 0.632 0.5273
## STL -0.20676 0.08984 -2.301 0.0216 *
## BLK -0.05863 0.08749 -0.670 0.5029
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 185.4 on 826 degrees of freedom
## Multiple R-squared: 0.8991, Adjusted R-squared: 0.8982
## F-statistic: 920.4 on 8 and 826 DF, p-value: < 2.2e-16
Let’s take a look at the \(R^2\) of PointsReg2
: \(R^2 = 0.8991\).
It is almost exactly identical to the \(R^2\) of the original model. It does go down, as we would expect, but very, very slightly. So it seems that we’re justified in removing turnovers, TOV
.
DRB
Let’s see if we can remove another one of the insignificant variables. The next one, based on p-value, that we would want to remove is defensive rebounds, DRB
.
PointsReg3 <- lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + STL + BLK, data = NBA)
summary(PointsReg3)
##
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + STL + BLK,
## data = NBA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -523.79 -121.64 6.07 120.81 573.64
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2015.46304 167.00875 -12.068 < 2e-16 ***
## X2PA 1.04828 0.02852 36.753 < 2e-16 ***
## X3PA 1.27079 0.03475 36.568 < 2e-16 ***
## FTA 1.12846 0.03270 34.506 < 2e-16 ***
## AST 0.89093 0.04326 20.597 < 2e-16 ***
## ORB -0.97018 0.07519 -12.903 < 2e-16 ***
## STL -0.22758 0.08356 -2.724 0.00659 **
## BLK -0.03882 0.08165 -0.475 0.63462
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 185.4 on 827 degrees of freedom
## Multiple R-squared: 0.8991, Adjusted R-squared: 0.8982
## F-statistic: 1053 on 7 and 827 DF, p-value: < 2.2e-16
Let’s look at the \(R^2\) again and see if it has changed. It is 0.8991.
Once again is basically unchanged. So it looks like we are justified again in removing defensive rebounds, DRB
.
BLK
Let’s try this one more time and see if we can remove blocks, BLK
.
PointsReg4 <- lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + STL, data = NBA)
summary(PointsReg4)
##
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + STL, data = NBA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -523.33 -122.02 6.93 120.68 568.26
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2032.71640 162.94178 -12.475 < 2e-16 ***
## X2PA 1.04997 0.02829 37.117 < 2e-16 ***
## X3PA 1.27306 0.03441 37.001 < 2e-16 ***
## FTA 1.12731 0.03260 34.581 < 2e-16 ***
## AST 0.88844 0.04292 20.701 < 2e-16 ***
## ORB -0.97430 0.07465 -13.051 < 2e-16 ***
## STL -0.22684 0.08350 -2.717 0.00673 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 185.3 on 828 degrees of freedom
## Multiple R-squared: 0.8991, Adjusted R-squared: 0.8983
## F-statistic: 1229 on 6 and 828 DF, p-value: < 2.2e-16
One more time, we check the \(R^2\): it is 0.8991.
It stayed the same again.
So now we have gotten down to a model which is a bit simpler. All the variables are significant. We’ve still got an similarly good \(R^2\).
Let’s take a look now at SSE and RMSE just to make sure we did not inflate them too much by removing a few variables.
SSE_4 <- sum(PointsReg4$residuals^2)
RMSE_4 <- sqrt(SSE_4/nrow(NBA))
The values for Model-4 (PointsReg4
) are SSE = 28421464.9 and RMSE = 184.493.
This latter to be compared with the RMSE of the first model, i.e. 185.5191.
Essentially, we’ve kept the RMSE the same. So it seems like we have narrowed down on a much better model because it is simpler, it is more interpretable, and it’s got just about the same amount of error.
In this last part we will try to make predictions for the 2012-2013 season. We need to load our test set because our training set only included data from 1980 up until the 2011-2012 season.
NBA_test <- read.csv("data/NBA_test.csv")
Let’s try to predict using our model PointReg4
how many points we will see in the 2012-2013 season. We use the predict()
command here, and we give it the model that we just determined to be the best one. The new data which is NBA_test
.
PointsPredictions <- predict(PointsReg4, newdata = NBA_test)
Now that we have our prediction, how good is it?
We can compute the out of sample \(R^2\). This is a measurement of how well the model predicts on test data.
The \(R^2\) value we had before from our model, the 0.8991, is the measure of an in-sample \(R^2\), which is how well the model fits the training data.
But to get a measure of the predictions goodness of fit, we need to calculate the out of sample \(R^2\).
First we need to compute the sum of squared errors (SSE), i.e. the sum of the predicted amount minus the actual amount of points squared
SSE <- sum((PointsPredictions - NBA_test$PTS)^2)
We also need the total sums of squares (SST), which is just the sum of the test set actual number of points minus the average number of points in the training set.
SST <- sum((mean(NBA$PTS) - NBA_test$PTS)^2)
The \(R^2\) then is calculated as usual, 1 minus the sum of squared errors divided by total sums of squares.
R2 <- 1 - SSE/SST
The Out Of Sample \(R^2\) is 0.8127.
RMSE <- sqrt(SSE/nrow(NBA_test))
At 196.37, it is a little higher than before, But it’s not too bad.
Predicting unseen data we are making an average error of about 196.
Additional locally defined functions, from the external file loaded at the beginning.
mypanel <- function(x, y, ...){
cvec.bg <- c(rgb(1.0, 0.8, 0.9) , rgb(0.9, 0.9, 1.0), rgb(0.8, 1.0, 0.8))
cvec.border <- c("red", "blue", rgb(0.0, 0.67, 0.0))
r <- abs(cor(x, y))
i.color <- 1+(r>0.7)+(r>0.8)
ll <- par("usr")
rect(ll[1], ll[3], ll[2], ll[4], border=cvec.border[i.color], lwd=3)
points(x, y, ... )
ok <- is.finite(x) & is.finite(y)
if( any(ok) ) { abline(lm(y[ok] ~ x[ok]), col="red", lty=2, ...) }
}
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, color.bg=FALSE, ...) {
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
cvec.bg <- c(rgb(1.0, 0.8, 0.9) , rgb(0.9, 0.9, 1.0), rgb(0.8, 1.0, 0.8))
cvec.text <- c("red", "blue", rgb(0.0, 0.67, 0.0))
i.color <- 1+(r>0.4)+(r>0.6)
if( color.bg ) {
ll <- par("usr")
rect(ll[1], ll[3], ll[2], ll[4], col=cvec.bg[i.color])
text(0.5, 0.5, txt, cex = cex.cor * r, col=cvec.text[i.color])
} else {
text(0.5, 0.5, txt, cex = cex.cor * r, col=cvec.text[i.color])
}
}
printStats <- function (m, with.cx=TRUE) {
if (class(m) != "lm") stop("Not an object of class 'lm' ")
f <- summary(m)$fstatistic
p <- pf(f[1], f[2], f[3], lower.tail=FALSE)
attributes(p) <- NULL
fml <- as.character(formula(m))
cx <- summary(m)$coeff
stars <- rep(" ", nrow(cx))
stars[cx[,4] <= 0.1] <- "."
stars[cx[,4] <= 0.05] <- "*"
stars[cx[,4] <= 0.01] <- "**"
stars[cx[,4] <= 0.001] <- "***"
cat("MODEL : ", sprintf("%s", paste(fml[c(2,1,3)], sep=" ", collapse=" ")), "\n", sep="")
cat("SUMMARY STATS: ")
cat("R^2 = ",sprintf("%6.4f",summary(m)$r.squared),
" (adj. = ", sprintf("%6.4f",summary(m)$adj.r.squared), ")", sep="")
cat("\n")
cat(" ")
cat("F-stats: ", sprintf("%.3f",f[1]), " on ", f[2], " and ", f[3], " DF, p-value: ", p, "\n", sep="")
if( with.cx ) {
cat("\n")
print(cbind(format(cx[,c(1,2,4)], scientific=TRUE, justify="right", digits=5), Signif=stars),
quote=FALSE, print.gap=3)
}
}
printStats.cpt <- function (m, with.cx=TRUE) {
if (class(m) != "lm") stop("Not an object of class 'lm' ")
f <- summary(m)$fstatistic
p <- pf(f[1], f[2], f[3], lower.tail=FALSE)
attributes(p) <- NULL
fml <- as.character(formula(m))
cx <- summary(m)$coeff
stars <- rep(" ", nrow(cx))
stars[cx[,4] <= 0.1] <- "."
stars[cx[,4] <= 0.05] <- "*"
stars[cx[,4] <= 0.01] <- "**"
stars[cx[,4] <= 0.001] <- "***"
cat("MODEL : ", sprintf("%s", paste(fml[c(2,1,3)], sep=" ", collapse=" ")), "\n", sep="")
cat(" : ")
cat("adj. R^2 = ", sprintf("%6.4f",summary(m)$adj.r.squared),
" / F-stats: ", sprintf("%.3f",f[1]), " on ", f[2],",", f[3], " Df, p-value: ", p, "\n", sep="")
if( with.cx ) {
print(cbind(format(cx[,c(1,2,4)], scientific=TRUE, justify="right", digits=5), Signif=stars),
quote=FALSE, print.gap=3)
cat("\n")
}
}