THE NBA DATA ANALYSIS

The data we have is located in the file NBA_train.csv and contains data from all teams in season since 1980, except for ones with less than 82 games.

NBA = read.csv("NBA_train.csv")
str(NBA)
## 'data.frame':    835 obs. of  20 variables:
##  $ SeasonEnd: int  1980 1980 1980 1980 1980 1980 1980 1980 1980 1980 ...
##  $ Team     : Factor w/ 37 levels "Atlanta Hawks",..: 1 2 5 6 8 9 10 11 12 13 ...
##  $ Playoffs : int  1 1 0 0 0 0 0 1 0 1 ...
##  $ W        : int  50 61 30 37 30 16 24 41 37 47 ...
##  $ PTS      : int  8573 9303 8813 9360 8878 8933 8493 9084 9119 8860 ...
##  $ oppPTS   : int  8334 8664 9035 9332 9240 9609 8853 9070 9176 8603 ...
##  $ FG       : int  3261 3617 3362 3811 3462 3643 3527 3599 3639 3582 ...
##  $ FGA      : int  7027 7387 6943 8041 7470 7596 7318 7496 7689 7489 ...
##  $ X2P      : int  3248 3455 3292 3775 3379 3586 3500 3495 3551 3557 ...
##  $ X2PA     : int  6952 6965 6668 7854 7215 7377 7197 7117 7375 7375 ...
##  $ X3P      : int  13 162 70 36 83 57 27 104 88 25 ...
##  $ X3PA     : int  75 422 275 187 255 219 121 379 314 114 ...
##  $ FT       : int  2038 1907 2019 1702 1871 1590 1412 1782 1753 1671 ...
##  $ FTA      : int  2645 2449 2592 2205 2539 2149 1914 2326 2333 2250 ...
##  $ ORB      : int  1369 1227 1115 1307 1311 1226 1155 1394 1398 1187 ...
##  $ DRB      : int  2406 2457 2465 2381 2524 2415 2437 2217 2326 2429 ...
##  $ AST      : int  1913 2198 2152 2108 2079 1950 2028 2149 2148 2123 ...
##  $ STL      : int  782 809 704 764 746 783 779 782 900 863 ...
##  $ BLK      : int  539 308 392 342 404 562 339 373 530 356 ...
##  $ TOV      : int  1495 1539 1684 1370 1533 1742 1492 1565 1517 1439 ...

We have 835 observations of 20 variables. Let’s take a look at what some of these variables are. * SeasonEnd is the year the season ended. * Team is the name of the team. * playoffs is a binary variable for whether or not a team made it to the playoffs that year. If they made it to the playoffs it’s a 1, if not it’s a 0. * W stands for the number of regular season wins. * PTS stands for points scored during the regular season. * oppPTS stands for opponent points scored during the regular season. And then we’ve got quite a few variables that have the variable name and then the same variable with an ‘A’ afterwards. So we’ve got FG and FGA, X2P, X2PA, X3P, X3PA, FT, and FTA. If there is an ‘A’ it means the number that were attempted. And if not it means the number that were successful. So for example FG is the number of successful field goals, including two and three pointers. Whereas FGA is the number of field goal attempts. So this also contains the number of unsuccessful field goals. So FGA will always be a bigger number than FG.

So moving on to the rest of our variables. * We’ve got ORB and DRB. These are offensive and defensive rebounds. * AST stands for assists. * STL stands for steals. * BLK stands for blocks. * And TOV stands for turnovers.

The goal of a basketball team is to get into the playoffs. So how many games does a team need to win in order to make the playoffs?

table(NBA$W, NBA$Playoffs)
##     
##       0  1
##   11  2  0
##   12  2  0
##   13  2  0
##   14  2  0
##   15 10  0
##   16  2  0
##   17 11  0
##   18  5  0
##   19 10  0
##   20 10  0
##   21 12  0
##   22 11  0
##   23 11  0
##   24 18  0
##   25 11  0
##   26 17  0
##   27 10  0
##   28 18  0
##   29 12  0
##   30 19  1
##   31 15  1
##   32 12  0
##   33 17  0
##   34 16  0
##   35 13  3
##   36 17  4
##   37 15  4
##   38  8  7
##   39 10 10
##   40  9 13
##   41 11 26
##   42  8 29
##   43  2 18
##   44  2 27
##   45  3 22
##   46  1 15
##   47  0 28
##   48  1 14
##   49  0 17
##   50  0 32
##   51  0 12
##   52  0 20
##   53  0 17
##   54  0 18
##   55  0 24
##   56  0 16
##   57  0 23
##   58  0 13
##   59  0 14
##   60  0  8
##   61  0 10
##   62  0 13
##   63  0  7
##   64  0  3
##   65  0  3
##   66  0  2
##   67  0  4
##   69  0  1
##   72  0  1

So it seems like a good goal would be to try to win about 42 games. If a team can win about 42 games then they have a very good chance of making it to the playoffs.

We will use the difference between points scored and points allowed throughout the regular season in order to predict the number of games that a team will win.

NBA$PTSdiff = NBA$PTS - NBA$oppPTS
plot(NBA$PTSdiff, NBA$W)

There’s an incredibly strong linear relationship between these two variables. So it seems like linear regression is going to be a good way to predict how many wins a team will have given the point difference.

Building Regression Model

WinsReg = lm(W ~ PTSdiff, data = NBA)
summary(WinsReg)
## 
## Call:
## lm(formula = W ~ PTSdiff, data = NBA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.7393 -2.1018 -0.0672  2.0265 10.6026 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.100e+01  1.059e-01   387.0   <2e-16 ***
## PTSdiff     3.259e-02  2.793e-04   116.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.061 on 833 degrees of freedom
## Multiple R-squared:  0.9423, Adjusted R-squared:  0.9423 
## F-statistic: 1.361e+04 on 1 and 833 DF,  p-value: < 2.2e-16

An R squared of 0.9423, which is very high which is verifying the scatter plot we saw before that there’s a very strong linear relationship between the wins and the points difference.

Predicting Points Scored

Now let’s build an equation to predict points scored using some common basketball statistics. So our dependent variable would now be PTS, and our independent variables would be some of the common basketball statistics that we have in our data set.

PointsReg = lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + TOV + STL + BLK, data = NBA)
summary(PointsReg)
## 
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + TOV + 
##     STL + BLK, data = NBA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -527.40 -119.83    7.83  120.67  564.71 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.051e+03  2.035e+02 -10.078   <2e-16 ***
## X2PA         1.043e+00  2.957e-02  35.274   <2e-16 ***
## X3PA         1.259e+00  3.843e-02  32.747   <2e-16 ***
## FTA          1.128e+00  3.373e-02  33.440   <2e-16 ***
## AST          8.858e-01  4.396e-02  20.150   <2e-16 ***
## ORB         -9.554e-01  7.792e-02 -12.261   <2e-16 ***
## DRB          3.883e-02  6.157e-02   0.631   0.5285    
## TOV         -2.475e-02  6.118e-02  -0.405   0.6859    
## STL         -1.992e-01  9.181e-02  -2.169   0.0303 *  
## BLK         -5.576e-02  8.782e-02  -0.635   0.5256    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 185.5 on 825 degrees of freedom
## Multiple R-squared:  0.8992, Adjusted R-squared:  0.8981 
## F-statistic: 817.3 on 9 and 825 DF,  p-value: < 2.2e-16

Calculating SSE and RMSE

SSE = sum(PointsReg$residuals^2)
SSE
## [1] 28394314
RMSE = sqrt(SSE/nrow(NBA))
RMSE
## [1] 184.4049
mean(NBA$PTS)
## [1] 8370.24

If we have an average number of points of 8,370, being off by about 184.4 points is really not so bad.

Improving the regression model

To improve our model we have to remove few not so significant variables and the first variable we would want to remove is probably turnovers. And why do I say turnovers? It’s because the p value for turnovers, which you see here in this column, 0.6859, is the highest of all of the p values. So that means that turnovers is the least statistically significant variable in our model. So let’s create a new regression model without turnovers.

PointsReg_impv = lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + STL + BLK, data = NBA)
summary(PointsReg_impv)
## 
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + STL + 
##     BLK, data = NBA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -526.79 -121.09    6.37  120.74  565.94 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.077e+03  1.931e+02 -10.755   <2e-16 ***
## X2PA         1.044e+00  2.951e-02  35.366   <2e-16 ***
## X3PA         1.263e+00  3.703e-02  34.099   <2e-16 ***
## FTA          1.125e+00  3.308e-02  34.023   <2e-16 ***
## AST          8.861e-01  4.393e-02  20.173   <2e-16 ***
## ORB         -9.581e-01  7.758e-02 -12.350   <2e-16 ***
## DRB          3.892e-02  6.154e-02   0.632   0.5273    
## STL         -2.068e-01  8.984e-02  -2.301   0.0216 *  
## BLK         -5.863e-02  8.749e-02  -0.670   0.5029    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 185.4 on 826 degrees of freedom
## Multiple R-squared:  0.8991, Adjusted R-squared:  0.8982 
## F-statistic: 920.4 on 8 and 826 DF,  p-value: < 2.2e-16

So in our first regression model, PointsReg, we had an R-squared of 0.8992. Let’s take a look at the R-squared of PointsReg_impv. And we see that it’s 0.8991. So almost exactly identical. It does go down, as we would expect, but very, very slightly. So it seems that we’re justified in removing turnovers. Let’s see if we can remove another one of the insignificant variables. The next one, based on p-value, that we would want to remove is defensive rebounds.

PointsReg_impv2 = lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + STL, data = NBA)
summary(PointsReg_impv2)
## 
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + STL, data = NBA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -523.33 -122.02    6.93  120.68  568.26 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.033e+03  1.629e+02 -12.475  < 2e-16 ***
## X2PA         1.050e+00  2.829e-02  37.117  < 2e-16 ***
## X3PA         1.273e+00  3.441e-02  37.001  < 2e-16 ***
## FTA          1.127e+00  3.260e-02  34.581  < 2e-16 ***
## AST          8.884e-01  4.292e-02  20.701  < 2e-16 ***
## ORB         -9.743e-01  7.465e-02 -13.051  < 2e-16 ***
## STL         -2.268e-01  8.350e-02  -2.717  0.00673 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 185.3 on 828 degrees of freedom
## Multiple R-squared:  0.8991, Adjusted R-squared:  0.8983 
## F-statistic:  1229 on 6 and 828 DF,  p-value: < 2.2e-16

And again, the R-squared value stayed the same.So now we’ve gotten down to a model which is a bit simpler. All the variables are significant. We’ve still got an R-squared 0.899. And let’s take a look now at the sum of squared errors and the root mean square error, just to make sure we didn’t inflate those too much by removing a few variables.

We will calculate the new sum of squared errors of the new model, PointsReg_impv2.

SSE = sum(PointsReg_impv2$residuals^2)
SSE
## [1] 28421465
RMSE = sqrt(SSE/nrow(NBA))
RMSE
## [1] 184.493

So although we’ve increased the root mean squared error a little bit by removing those variables, it’s really a very, very, small amount. Essentially, we’ve kept the root mean squared error the same. So it seems like we’ve narrowed down on a much better model because it’s simpler, it’s more interpretable, and it’s got just about the same amount of error.

Making Predictions

We’ll try to make predictions for the 2012-2013 season. We’ll need to load our test set because our training set only included data from 1980 up until the 2011-2012 season.

NBA_test = read.csv("NBA_test.csv")
str(NBA_test)
## 'data.frame':    28 obs. of  20 variables:
##  $ SeasonEnd: int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ Team     : Factor w/ 28 levels "Atlanta Hawks",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Playoffs : int  1 1 0 1 0 0 1 0 1 1 ...
##  $ W        : int  44 49 21 45 24 41 57 29 47 45 ...
##  $ PTS      : int  8032 7944 7661 7641 7913 8293 8704 7778 8296 8688 ...
##  $ oppPTS   : int  7999 7798 8418 7615 8297 8342 8287 8105 8223 8403 ...
##  $ FG       : int  3084 2942 2823 2926 2993 3182 3339 2979 3130 3124 ...
##  $ FGA      : int  6644 6544 6649 6698 6901 6892 6983 6638 6840 6782 ...
##  $ X2P      : int  2378 2314 2354 2480 2446 2576 2818 2466 2472 2257 ...
##  $ X2PA     : int  4743 4784 5250 5433 5320 5264 5465 5198 5208 4413 ...
##  $ X3P      : int  706 628 469 446 547 606 521 513 658 867 ...
##  $ X3PA     : int  1901 1760 1399 1265 1581 1628 1518 1440 1632 2369 ...
##  $ FT       : int  1158 1432 1546 1343 1380 1323 1505 1307 1378 1573 ...
##  $ FTA      : int  1619 1958 2060 1738 1826 1669 2148 1870 1744 2087 ...
##  $ ORB      : int  758 1047 917 1026 1004 767 1092 991 885 909 ...
##  $ DRB      : int  2593 2460 2389 2514 2359 2670 2601 2463 2801 2652 ...
##  $ AST      : int  2007 1668 1587 1886 1694 1906 2002 1742 1845 1902 ...
##  $ STL      : int  664 599 591 588 647 648 762 574 567 679 ...
##  $ BLK      : int  369 391 479 417 334 454 533 400 346 359 ...
##  $ TOV      : int  1219 1206 1153 1171 1149 1144 1253 1241 1236 1348 ...
PointsPrediction = predict(PointsReg_impv2, newdata = NBA_test)

OK, so now that we have our prediction, how good is it? ** We can compute the out of sample R-squared. This is a measurement of how well the model predicts on test data. ** The R-squared value we had before from our model, the 0.8991, you might remember, is the measure of an in-sample R-squared, which is how well the model fits the training data. But to get a measure of the predictions goodness of fit, we need to calculate the out of sample R-squared.

SSE = sum((PointsPrediction - NBA_test$PTS)^2)
SST = sum((mean(NBA$PTS) - NBA_test$PTS)^2)
R_square = 1- SSE/SST
R_square
## [1] 0.8127142
RMSE = sqrt(SSE/nrow(NBA_test))
RMSE
## [1] 196.3723