[ source files available on GitHub ]

PRELIMINARIES

Libraries needed for data processing and plotting:

library("dplyr")
library("magrittr")
library("ggplot2")

Source external script with my own handy functions definitions:

source("./scripts/my_defs_u2.R")

The content of this external file is included in the Appendix at the end of this report.

LOADING THE DATA

Read the datasets NBA_train.csv:

NBA <- read.csv("data/NBA_train.csv")
str(NBA)
## 'data.frame':    835 obs. of  20 variables:
##  $ SeasonEnd: int  1980 1980 1980 1980 1980 1980 1980 1980 1980 1980 ...
##  $ Team     : Factor w/ 37 levels "Atlanta Hawks",..: 1 2 5 6 8 9 10 11 12 13 ...
##  $ Playoffs : int  1 1 0 0 0 0 0 1 0 1 ...
##  $ W        : int  50 61 30 37 30 16 24 41 37 47 ...
##  $ PTS      : int  8573 9303 8813 9360 8878 8933 8493 9084 9119 8860 ...
##  $ oppPTS   : int  8334 8664 9035 9332 9240 9609 8853 9070 9176 8603 ...
##  $ FG       : int  3261 3617 3362 3811 3462 3643 3527 3599 3639 3582 ...
##  $ FGA      : int  7027 7387 6943 8041 7470 7596 7318 7496 7689 7489 ...
##  $ X2P      : int  3248 3455 3292 3775 3379 3586 3500 3495 3551 3557 ...
##  $ X2PA     : int  6952 6965 6668 7854 7215 7377 7197 7117 7375 7375 ...
##  $ X3P      : int  13 162 70 36 83 57 27 104 88 25 ...
##  $ X3PA     : int  75 422 275 187 255 219 121 379 314 114 ...
##  $ FT       : int  2038 1907 2019 1702 1871 1590 1412 1782 1753 1671 ...
##  $ FTA      : int  2645 2449 2592 2205 2539 2149 1914 2326 2333 2250 ...
##  $ ORB      : int  1369 1227 1115 1307 1311 1226 1155 1394 1398 1187 ...
##  $ DRB      : int  2406 2457 2465 2381 2524 2415 2437 2217 2326 2429 ...
##  $ AST      : int  1913 2198 2152 2108 2079 1950 2028 2149 2148 2123 ...
##  $ STL      : int  782 809 704 764 746 783 779 782 900 863 ...
##  $ BLK      : int  539 308 392 342 404 562 339 373 530 356 ...
##  $ TOV      : int  1495 1539 1684 1370 1533 1742 1492 1565 1517 1439 ...

Quick description of the variables

Premise: There are quite a few variables that have the variable name and then the same variable with a ‘A’ suffix.
The variables with ‘A’ suffix refer to the number that were attempted, the corresponding variables without suffix refer to the number that were successful.

  • SeasonEnd : year the season ended.
  • Team : name of the team.
  • playoffs : binary variable for whether or not a team made it to the playoffs that year.
    If they made it to the playoffs it’s a 1, if not it’s a 0.
  • W : the number of regular season wins.
  • PTS : points scored during the regular season.
  • oppPTS : opponent points scored during the regular season.
  • FG, FGA : number of successful field goals, including two and three pointers,
  • X2P, X2PA : 2-pointers.
  • X3P, X3PA : 3-pointers.
  • FT, FTA : free throws.
  • ORB, DRB : offensive and defensive rebounds.
  • AST : assists.
  • STL : steals.
  • BLK : blocks.
  • TOV : turnovers.

NOTE: the 2-pointer and 3-pointer variables have an ‘X’ in front of them, added by R when reading the dataset to make variable names conform to R rules.

Part 2 : HOW MANY WINS TO MAKE THE PLAYOFFS?

tmp <- group_by(NBA, W) %>% summarise(nTot = n(), nPO = sum(Playoffs), fracPO = nPO/nTot)
print(tmp)
## Source: local data frame [59 x 4]
## 
##     W nTot nPO    fracPO
## 1  11    2   0 0.0000000
## 2  12    2   0 0.0000000
## 3  13    2   0 0.0000000
## 4  14    2   0 0.0000000
## 5  15   10   0 0.0000000
## 6  16    2   0 0.0000000
## 7  17   11   0 0.0000000
## 8  18    5   0 0.0000000
## 9  19   10   0 0.0000000
## 10 20   10   0 0.0000000
## 11 21   12   0 0.0000000
## 12 22   11   0 0.0000000
## 13 23   11   0 0.0000000
## 14 24   18   0 0.0000000
## 15 25   11   0 0.0000000
## 16 26   17   0 0.0000000
## 17 27   10   0 0.0000000
## 18 28   18   0 0.0000000
## 19 29   12   0 0.0000000
## 20 30   20   1 0.0500000
## 21 31   16   1 0.0625000
## 22 32   12   0 0.0000000
## 23 33   17   0 0.0000000
## 24 34   16   0 0.0000000
## 25 35   16   3 0.1875000
## 26 36   21   4 0.1904762
## 27 37   19   4 0.2105263
## 28 38   15   7 0.4666667
## 29 39   20  10 0.5000000
## 30 40   22  13 0.5909091
## 31 41   37  26 0.7027027
## 32 42   37  29 0.7837838
## 33 43   20  18 0.9000000
## 34 44   29  27 0.9310345
## 35 45   25  22 0.8800000
## 36 46   16  15 0.9375000
## 37 47   28  28 1.0000000
## 38 48   15  14 0.9333333
## 39 49   17  17 1.0000000
## 40 50   32  32 1.0000000
## 41 51   12  12 1.0000000
## 42 52   20  20 1.0000000
## 43 53   17  17 1.0000000
## 44 54   18  18 1.0000000
## 45 55   24  24 1.0000000
## 46 56   16  16 1.0000000
## 47 57   23  23 1.0000000
## 48 58   13  13 1.0000000
## 49 59   14  14 1.0000000
## 50 60    8   8 1.0000000
## 51 61   10  10 1.0000000
## 52 62   13  13 1.0000000
## 53 63    7   7 1.0000000
## 54 64    3   3 1.0000000
## 55 65    3   3 1.0000000
## 56 66    2   2 1.0000000
## 57 67    4   4 1.0000000
## 58 69    1   1 1.0000000
## 59 72    1   1 1.0000000

Let’s take a look at the table, around the middle section.

This can very clearly seen also visually:

plot(tmp$W, tmp$fracPO, pch = 21, col = "red2", bg = "orange", 
     xlab = "Wins", ylab = "Frequency", main = "Playoff Qualification Frequency vs. Number of Regular Season Wins")
abline(h = 0.9, lty = 2, col = "red2")
abline(v = 35, lty = 2, col = "blue2")
abline(v = 45, lty = 2, col = "blue2")

How can we predict Wins?

Games are won by scoring more points than the other team.
Can we use the difference between points scored and points allowed throughout the regular season in order to predict the number of games that a team will win?

Compute points difference

We add a variable that is the difference between points scored and points allowed.

NBA$PTSdiff <- NBA$PTS - NBA$oppPTS

Check for linear relationship between PTSdiff and W

plot(NBA$PTSdiff, NBA$W, pch = 21, col = "red2", bg = "orange", 
     xlab = "PTS difference", ylab = "Wins", main = "Regular Season Wins vs. Total Regular Season Scored Points")

It looks like there is a very strong linear relationship between these two variables.
It seems like linear regression is going to be a good way to predict how many wins a team will have given the point difference.

Linear regression model for wins (W)

Let’s try to verify this. So we’re going to have PTSdiff as the independent variable in our regression, and W for wins as the dependent variable.

Wins_Reg <- lm(W ~ PTSdiff, data = NBA)

summary(Wins_Reg)
## 
## Call:
## lm(formula = W ~ PTSdiff, data = NBA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.7393 -2.1018 -0.0672  2.0265 10.6026 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 41.0000000  0.1059358   387.0   <2e-16 ***
## PTSdiff      0.0325863  0.0002793   116.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.061 on 833 degrees of freedom
## Multiple R-squared:  0.9423, Adjusted R-squared:  0.9423 
## F-statistic: 1.361e+04 on 1 and 833 DF,  p-value: < 2.2e-16

Yielding the following relationship between the two variables \[ W = 41 + 0.0326*(PTSdiff) \]

We saw earlier with the table that a team would want to win about at least 45 games in order to have about \(>90\%\) chance of making it to the playoffs.
What does this mean in terms of their points difference?

With the linear regression model we can compute the PTSdiff needed to get W\(\ge45\), i.e. PTSdiff\(\ge 122.8\).

Part 3 : Linear regression model for points scored

Let’s build an equation to predict points scored using some common basketball statistics.

Our dependent variable would be PTS, and our independent variables would be some of the common basketball statistics that we have in our data set. For example,

We can use all of these.

Model-1 : 9 predictors

PointsReg1 <- lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + TOV + STL + BLK, data = NBA)

summary(PointsReg1)
## 
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + TOV + 
##     STL + BLK, data = NBA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -527.40 -119.83    7.83  120.67  564.71 
## 
## Coefficients:
##                Estimate  Std. Error t value Pr(>|t|)    
## (Intercept) -2050.81081   203.48660 -10.078   <2e-16 ***
## X2PA            1.04288     0.02957  35.274   <2e-16 ***
## X3PA            1.25859     0.03843  32.747   <2e-16 ***
## FTA             1.12802     0.03373  33.440   <2e-16 ***
## AST             0.88577     0.04396  20.150   <2e-16 ***
## ORB            -0.95539     0.07792 -12.261   <2e-16 ***
## DRB             0.03883     0.06157   0.631   0.5285    
## TOV            -0.02475     0.06118  -0.405   0.6859    
## STL            -0.19918     0.09181  -2.169   0.0303 *  
## BLK            -0.05576     0.08782  -0.635   0.5256    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 185.5 on 825 degrees of freedom
## Multiple R-squared:  0.8992, Adjusted R-squared:  0.8981 
## F-statistic: 817.3 on 9 and 825 DF,  p-value: < 2.2e-16

Taking a look at this, we can see that

  • Some of the variables are indeed very significant.
  • Others less so: for example, steals (STL) only has one significance star.
  • Some don’t seem to be significant at all: for example, defensive rebounds, turnovers, and blocks (DRB, TOV, BLK).

We do have a pretty good \(R^2\) value, 0.8992, which shows that there really is a linear relationship between points and all of these basketball statistics.

Some summary statistics.

Sum of Squared Errors (SSE) (not a very interpretable quantity)

SSE <- sum(PointsReg1$residuals^2)
SSE
## [1] 28394314

Root Mean Squared Error (RMSE), more interpretable, sort of the average error made.

# RMSE <- sqrt(SSE/nrow(NBA))
RMSE <- sqrt(SSE/PointsReg1$df.residual)
RMSE
## [1] 185.5191

It does seem line a very large error, but it should be seen in the context of the total number of points scored on average in a full season, which is:

mean(NBA$PTS)
## [1] 8370.24

So, the fractional error of this model is about 2.2%, which is fairly small.

Correlations between predictors

It may be interesting to check the correlations between the variables that we included in this first model, to get some hints as to collinearity, which could be relevant to know if we wanted to remove some variables.

par(mar=c(5, 4, 4, 1)+0.1)
par(oma=c(0, 0, 0, 0))
pairs(NBA[, c("X2PA", "X3PA", "FTA", "AST", "ORB", "DRB", "TOV", "STL", "BLK")], gap=0.5,  las=1, 
      pch=21, bg=rgb(0,0,1,0.25), 
      panel=mypanel, lower.panel=function(...) panel.cor(..., color.bg=TRUE), main="")
mtext(side=3, "pairs plot with correlation values", outer=TRUE, line=-1.2, font=2)
mtext(side=3, "Dashed lines are 'lm(y~x)' fits.\nCorrelation and scatterplot frames are color-coded on the strength of the correlation",
      outer=TRUE, line=-1.6, padj=1, cex=0.8, font=1)

More models by removing insignificant variables

The first model seems to be quite accurate. However, we probably have room for improvement in this model, because not all the variables that we included were significant. Let’s see if we can remove some of the insignificant variables, and we will do it one at a time, incrementally.

Model-2 : remove TOV

PointsReg2 <- lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + STL + BLK, data = NBA)

summary(PointsReg2)
## 
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + STL + 
##     BLK, data = NBA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -526.79 -121.09    6.37  120.74  565.94 
## 
## Coefficients:
##                Estimate  Std. Error t value Pr(>|t|)    
## (Intercept) -2076.67816   193.08441 -10.755   <2e-16 ***
## X2PA            1.04353     0.02951  35.366   <2e-16 ***
## X3PA            1.26273     0.03703  34.099   <2e-16 ***
## FTA             1.12537     0.03308  34.023   <2e-16 ***
## AST             0.88615     0.04393  20.173   <2e-16 ***
## ORB            -0.95815     0.07758 -12.350   <2e-16 ***
## DRB             0.03892     0.06154   0.632   0.5273    
## STL            -0.20676     0.08984  -2.301   0.0216 *  
## BLK            -0.05863     0.08749  -0.670   0.5029    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 185.4 on 826 degrees of freedom
## Multiple R-squared:  0.8991, Adjusted R-squared:  0.8982 
## F-statistic: 920.4 on 8 and 826 DF,  p-value: < 2.2e-16

Let’s take a look at the \(R^2\) of PointsReg2: \(R^2 = 0.8991\).

It is almost exactly identical to the \(R^2\) of the original model. It does go down, as we would expect, but very, very slightly. So it seems that we’re justified in removing turnovers, TOV.

Model-3 : remove DRB

Let’s see if we can remove another one of the insignificant variables. The next one, based on p-value, that we would want to remove is defensive rebounds, DRB.

PointsReg3 <- lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + STL + BLK, data = NBA)

summary(PointsReg3)
## 
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + STL + BLK, 
##     data = NBA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -523.79 -121.64    6.07  120.81  573.64 
## 
## Coefficients:
##                Estimate  Std. Error t value Pr(>|t|)    
## (Intercept) -2015.46304   167.00875 -12.068  < 2e-16 ***
## X2PA            1.04828     0.02852  36.753  < 2e-16 ***
## X3PA            1.27079     0.03475  36.568  < 2e-16 ***
## FTA             1.12846     0.03270  34.506  < 2e-16 ***
## AST             0.89093     0.04326  20.597  < 2e-16 ***
## ORB            -0.97018     0.07519 -12.903  < 2e-16 ***
## STL            -0.22758     0.08356  -2.724  0.00659 ** 
## BLK            -0.03882     0.08165  -0.475  0.63462    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 185.4 on 827 degrees of freedom
## Multiple R-squared:  0.8991, Adjusted R-squared:  0.8982 
## F-statistic:  1053 on 7 and 827 DF,  p-value: < 2.2e-16

Let’s look at the \(R^2\) again and see if it has changed. It is 0.8991.
Once again is basically unchanged. So it looks like we are justified again in removing defensive rebounds, DRB.

Model-4 : remove BLK

Let’s try this one more time and see if we can remove blocks, BLK.

PointsReg4 <- lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + STL, data = NBA)

summary(PointsReg4)
## 
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + STL, data = NBA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -523.33 -122.02    6.93  120.68  568.26 
## 
## Coefficients:
##                Estimate  Std. Error t value Pr(>|t|)    
## (Intercept) -2032.71640   162.94178 -12.475  < 2e-16 ***
## X2PA            1.04997     0.02829  37.117  < 2e-16 ***
## X3PA            1.27306     0.03441  37.001  < 2e-16 ***
## FTA             1.12731     0.03260  34.581  < 2e-16 ***
## AST             0.88844     0.04292  20.701  < 2e-16 ***
## ORB            -0.97430     0.07465 -13.051  < 2e-16 ***
## STL            -0.22684     0.08350  -2.717  0.00673 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 185.3 on 828 degrees of freedom
## Multiple R-squared:  0.8991, Adjusted R-squared:  0.8983 
## F-statistic:  1229 on 6 and 828 DF,  p-value: < 2.2e-16

One more time, we check the \(R^2\): it is 0.8991.
It stayed the same again.

So now we have gotten down to a model which is a bit simpler. All the variables are significant. We’ve still got an similarly good \(R^2\).

Summary statistics for Model-4.

Let’s take a look now at SSE and RMSE just to make sure we did not inflate them too much by removing a few variables.

SSE_4 <- sum(PointsReg4$residuals^2)
RMSE_4 <- sqrt(SSE_4/nrow(NBA))

The values for Model-4 (PointsReg4) are SSE = 28421464.9 and RMSE = 184.493.
This latter to be compared with the RMSE of the first model, i.e. 185.5191.

Essentially, we’ve kept the RMSE the same. So it seems like we have narrowed down on a much better model because it is simpler, it is more interpretable, and it’s got just about the same amount of error.

Part 4 : MAKING PREDICTIONS

In this last part we will try to make predictions for the 2012-2013 season. We need to load our test set because our training set only included data from 1980 up until the 2011-2012 season.

Read-in test set

NBA_test <- read.csv("data/NBA_test.csv")

Model-4 predictions on test set

Let’s try to predict using our model PointReg4 how many points we will see in the 2012-2013 season. We use the predict() command here, and we give it the model that we just determined to be the best one. The new data which is NBA_test.

PointsPredictions <- predict(PointsReg4, newdata = NBA_test)

Now that we have our prediction, how good is it?

We can compute the out of sample \(R^2\). This is a measurement of how well the model predicts on test data.
The \(R^2\) value we had before from our model, the 0.8991, is the measure of an in-sample \(R^2\), which is how well the model fits the training data.
But to get a measure of the predictions goodness of fit, we need to calculate the out of sample \(R^2\).

Out-Of-Sample \(R^2\) and RMSE

First we need to compute the sum of squared errors (SSE), i.e. the sum of the predicted amount minus the actual amount of points squared

SSE <- sum((PointsPredictions - NBA_test$PTS)^2)

We also need the total sums of squares (SST), which is just the sum of the test set actual number of points minus the average number of points in the training set.

SST <- sum((mean(NBA$PTS) - NBA_test$PTS)^2)

The \(R^2\) then is calculated as usual, 1 minus the sum of squared errors divided by total sums of squares.

R2 <- 1 - SSE/SST

The Out Of Sample \(R^2\) is 0.8127.

RMSE <- sqrt(SSE/nrow(NBA_test))

At 196.37, it is a little higher than before, But it’s not too bad.
Predicting unseen data we are making an average error of about 196.


APPENDIX : external functions

Additional locally defined functions, from the external file loaded at the beginning.

mypanel <- function(x, y, ...){
   cvec.bg <- c(rgb(1.0, 0.8, 0.9) , rgb(0.9, 0.9, 1.0), rgb(0.8, 1.0, 0.8))
   cvec.border <- c("red", "blue", rgb(0.0, 0.67, 0.0))
   r <- abs(cor(x, y))
   i.color <- 1+(r>0.7)+(r>0.8)
   ll <- par("usr") 
   rect(ll[1], ll[3], ll[2], ll[4], border=cvec.border[i.color], lwd=3)
   points(x, y, ... ) 
   ok <- is.finite(x) & is.finite(y)
   if( any(ok) ) { abline(lm(y[ok] ~ x[ok]), col="red", lty=2, ...) }
}

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, color.bg=FALSE, ...) {
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    cvec.bg <- c(rgb(1.0, 0.8, 0.9) , rgb(0.9, 0.9, 1.0), rgb(0.8, 1.0, 0.8))
    cvec.text <- c("red", "blue", rgb(0.0, 0.67, 0.0))
    i.color <- 1+(r>0.4)+(r>0.6)
    if( color.bg ) {
       ll <- par("usr") 
       rect(ll[1], ll[3], ll[2], ll[4], col=cvec.bg[i.color])
       text(0.5, 0.5, txt, cex = cex.cor * r, col=cvec.text[i.color])
    } else {
       text(0.5, 0.5, txt, cex = cex.cor * r, col=cvec.text[i.color])
    }
}

printStats <- function (m, with.cx=TRUE) {
    if (class(m) != "lm") stop("Not an object of class 'lm' ")
    f <- summary(m)$fstatistic
    p <- pf(f[1], f[2], f[3], lower.tail=FALSE)
    attributes(p) <- NULL
    
    fml <- as.character(formula(m))
    cx <- summary(m)$coeff
    stars <- rep(" ", nrow(cx))
    stars[cx[,4] <= 0.1] <- "."
    stars[cx[,4] <= 0.05] <- "*"
    stars[cx[,4] <= 0.01] <- "**"
    stars[cx[,4] <= 0.001] <- "***"
    cat("MODEL        : ", sprintf("%s", paste(fml[c(2,1,3)], sep=" ", collapse=" ")), "\n", sep="")
    cat("SUMMARY STATS: ")
    cat("R^2 = ",sprintf("%6.4f",summary(m)$r.squared), 
        "  (adj. = ", sprintf("%6.4f",summary(m)$adj.r.squared), ")", sep="")
    cat("\n")
    cat("               ")
    cat("F-stats: ", sprintf("%.3f",f[1]), " on ", f[2], " and ", f[3], " DF,  p-value: ", p, "\n", sep="")
    if( with.cx ) {
        cat("\n")
        print(cbind(format(cx[,c(1,2,4)], scientific=TRUE, justify="right", digits=5), Signif=stars), 
              quote=FALSE, print.gap=3)
    }
}

printStats.cpt <- function (m, with.cx=TRUE) {
    if (class(m) != "lm") stop("Not an object of class 'lm' ")
    f <- summary(m)$fstatistic
    p <- pf(f[1], f[2], f[3], lower.tail=FALSE)
    attributes(p) <- NULL
    
    fml <- as.character(formula(m))
    cx <- summary(m)$coeff
    stars <- rep(" ", nrow(cx))
    stars[cx[,4] <= 0.1] <- "."
    stars[cx[,4] <= 0.05] <- "*"
    stars[cx[,4] <= 0.01] <- "**"
    stars[cx[,4] <= 0.001] <- "***"
    cat("MODEL : ", sprintf("%s", paste(fml[c(2,1,3)], sep=" ", collapse=" ")), "\n", sep="")
    cat("      : ")
    cat("adj. R^2 = ", sprintf("%6.4f",summary(m)$adj.r.squared), 
        " /  F-stats: ", sprintf("%.3f",f[1]), " on ", f[2],",", f[3], " Df,  p-value: ", p, "\n", sep="")
    if( with.cx ) {
        print(cbind(format(cx[,c(1,2,4)], scientific=TRUE, justify="right", digits=5), Signif=stars), 
              quote=FALSE, print.gap=3)
        cat("\n")
    }
}