Linear Regression Model On National Basketball Association(NBA) Dataset

Let’s load the data into R console

NBA = read.csv("C:\\Users\\aman96\\Desktop\\the analytics edge\\unit 2\\NBA_train.csv")

Now let’s look at the structure and summary of the data

str(NBA)

## 'data.frame':    835 obs. of  20 variables:
##  $ SeasonEnd: int  1980 1980 1980 1980 1980 1980 1980 1980 1980 1980 ...
##  $ Team     : Factor w/ 37 levels "Atlanta Hawks",..: 1 2 5 6 8 9 10 11 12 13 ...
##  $ Playoffs : int  1 1 0 0 0 0 0 1 0 1 ...
##  $ W        : int  50 61 30 37 30 16 24 41 37 47 ...
##  $ PTS      : int  8573 9303 8813 9360 8878 8933 8493 9084 9119 8860 ...
##  $ oppPTS   : int  8334 8664 9035 9332 9240 9609 8853 9070 9176 8603 ...
##  $ FG       : int  3261 3617 3362 3811 3462 3643 3527 3599 3639 3582 ...
##  $ FGA      : int  7027 7387 6943 8041 7470 7596 7318 7496 7689 7489 ...
##  $ X2P      : int  3248 3455 3292 3775 3379 3586 3500 3495 3551 3557 ...
##  $ X2PA     : int  6952 6965 6668 7854 7215 7377 7197 7117 7375 7375 ...
##  $ X3P      : int  13 162 70 36 83 57 27 104 88 25 ...
##  $ X3PA     : int  75 422 275 187 255 219 121 379 314 114 ...
##  $ FT       : int  2038 1907 2019 1702 1871 1590 1412 1782 1753 1671 ...
##  $ FTA      : int  2645 2449 2592 2205 2539 2149 1914 2326 2333 2250 ...
##  $ ORB      : int  1369 1227 1115 1307 1311 1226 1155 1394 1398 1187 ...
##  $ DRB      : int  2406 2457 2465 2381 2524 2415 2437 2217 2326 2429 ...
##  $ AST      : int  1913 2198 2152 2108 2079 1950 2028 2149 2148 2123 ...
##  $ STL      : int  782 809 704 764 746 783 779 782 900 863 ...
##  $ BLK      : int  539 308 392 342 404 562 339 373 530 356 ...
##  $ TOV      : int  1495 1539 1684 1370 1533 1742 1492 1565 1517 1439 ...

summary(NBA)

##    SeasonEnd                     Team        Playoffs            W       
##  Min.   :1980   Atlanta Hawks      : 31   Min.   :0.0000   Min.   :11.0  
##  1st Qu.:1989   Boston Celtics     : 31   1st Qu.:0.0000   1st Qu.:31.0  
##  Median :1996   Chicago Bulls      : 31   Median :1.0000   Median :42.0  
##  Mean   :1996   Cleveland Cavaliers: 31   Mean   :0.5749   Mean   :41.0  
##  3rd Qu.:2005   Denver Nuggets     : 31   3rd Qu.:1.0000   3rd Qu.:50.5  
##  Max.   :2011   Detroit Pistons    : 31   Max.   :1.0000   Max.   :72.0  
##                 (Other)            :649                                  
##       PTS            oppPTS            FG            FGA      
##  Min.   : 6901   Min.   : 6909   Min.   :2565   Min.   :5972  
##  1st Qu.: 7934   1st Qu.: 7934   1st Qu.:2974   1st Qu.:6564  
##  Median : 8312   Median : 8365   Median :3150   Median :6831  
##  Mean   : 8370   Mean   : 8370   Mean   :3200   Mean   :6873  
##  3rd Qu.: 8784   3rd Qu.: 8768   3rd Qu.:3434   3rd Qu.:7157  
##  Max.   :10371   Max.   :10723   Max.   :3980   Max.   :8868  
##                                                               
##       X2P            X2PA           X3P             X3PA       
##  Min.   :1981   Min.   :4153   Min.   : 10.0   Min.   :  75.0  
##  1st Qu.:2510   1st Qu.:5269   1st Qu.:131.5   1st Qu.: 413.0  
##  Median :2718   Median :5706   Median :329.0   Median : 942.0  
##  Mean   :2881   Mean   :5956   Mean   :319.0   Mean   : 916.9  
##  3rd Qu.:3296   3rd Qu.:6754   3rd Qu.:481.5   3rd Qu.:1347.5  
##  Max.   :3954   Max.   :7873   Max.   :841.0   Max.   :2284.0  
##                                                                
##        FT            FTA            ORB              DRB      
##  Min.   :1189   Min.   :1475   Min.   : 639.0   Min.   :2044  
##  1st Qu.:1502   1st Qu.:2008   1st Qu.: 953.5   1st Qu.:2346  
##  Median :1628   Median :2176   Median :1055.0   Median :2433  
##  Mean   :1650   Mean   :2190   Mean   :1061.6   Mean   :2427  
##  3rd Qu.:1781   3rd Qu.:2352   3rd Qu.:1167.0   3rd Qu.:2516  
##  Max.   :2388   Max.   :3051   Max.   :1520.0   Max.   :2753  
##                                                               
##       AST            STL              BLK             TOV      
##  Min.   :1423   Min.   : 455.0   Min.   :204.0   Min.   : 931  
##  1st Qu.:1735   1st Qu.: 599.0   1st Qu.:359.0   1st Qu.:1192  
##  Median :1899   Median : 658.0   Median :410.0   Median :1289  
##  Mean   :1912   Mean   : 668.4   Mean   :419.8   Mean   :1303  
##  3rd Qu.:2078   3rd Qu.: 729.0   3rd Qu.:469.5   3rd Qu.:1396  
##  Max.   :2575   Max.   :1053.0   Max.   :716.0   Max.   :1873  
##

So here The goal of a basketball team is making the playoffs.

Linear Regression Model

So The question is, how many games does a team need to win in order to make the playoffs?

Let’s use the table command to figure this out for the NBA. For exploring the data and having the basic idea about the data, that how Wins and Playoffs are depending on each others.

table(NBA$W, NBA$Playoffs)

##     
##       0  1
##   11  2  0
##   12  2  0
##   13  2  0
##   14  2  0
##   15 10  0
##   16  2  0
##   17 11  0
##   18  5  0
##   19 10  0
##   20 10  0
##   21 12  0
##   22 11  0
##   23 11  0
##   24 18  0
##   25 11  0
##   26 17  0
##   27 10  0
##   28 18  0
##   29 12  0
##   30 19  1
##   31 15  1
##   32 12  0
##   33 17  0
##   34 16  0
##   35 13  3
##   36 17  4
##   37 15  4
##   38  8  7
##   39 10 10
##   40  9 13
##   41 11 26
##   42  8 29
##   43  2 18
##   44  2 27
##   45  3 22
##   46  1 15
##   47  0 28
##   48  1 14
##   49  0 17
##   50  0 32
##   51  0 12
##   52  0 20
##   53  0 17
##   54  0 18
##   55  0 24
##   56  0 16
##   57  0 23
##   58  0 13
##   59  0 14
##   60  0  8
##   61  0 10
##   62  0 13
##   63  0  7
##   64  0  3
##   65  0  3
##   66  0  2
##   67  0  4
##   69  0  1
##   72  0  1

we can see that a team who wins 35 games or fewer almost never makes it to the playoffs. And a team who wins 45 games almost always make it to the playoffs.

Adding Additional Variables into the dataset

Now we will add a variable that is the difference between points scored and points allowed.

NBA$PTSdiff = NBA$PTS - NBA$oppPTS

Visualizing the linear relationship

Now Let’s make a scatter plot to see if it looks like there’s a linear relationship between the number of wins that a team wins and the point difference.

plot(NBA$PTSdiff, NBA$W, xlab="Point Difference", ylab= "Wins", pch=19, col="blue", main = "PTSdiff Vs Wins")

From the graph we can clearly see that there’s an incredibly strong linear relationship between these two variables. So it seems like linear regression is going to be a good way to predict how many wins a team will have given the point difference.

Linear Regression Model To Predict Wins

WinsReg = lm(W ~ PTSdiff, data=NBA)
summary(WinsReg)

## 
## Call:
## lm(formula = W ~ PTSdiff, data = NBA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.7393 -2.1018 -0.0672  2.0265 10.6026 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.100e+01  1.059e-01   387.0   <2e-16 ***
## PTSdiff     3.259e-02  2.793e-04   116.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.061 on 833 degrees of freedom
## Multiple R-squared:  0.9423, Adjusted R-squared:  0.9423 
## F-statistic: 1.361e+04 on 1 and 833 DF,  p-value: < 2.2e-16

From the summary we notice that we’ve got very significant variables over here. And an R squared of 0.9423, which is very high.

Regression Equation

\[ W = 41 + 0.0326*PTSdiff \]

So we saw earlier with the table that a team would want to win about at least 42 games in order to have a good chance of making it to the playoffs. So what does this mean in terms of their points difference? If we want this to be greater than or equal to 42, that means that the PTSdiff would need to be greater than or equal to 42 minus 41 divided by 0.0326. So if we actually do that calculation, we see that this is equal to 30.67. So we need to score at least 31 more points than we allow in order to win at least 42 games.

Linear Regression Model For Points Scored

Now let’s build an equation to predict points scored using some common basketball statistics.

Understanding the variables of the dataset, X2PA for two-point attempts, X3PA for three-point attempts, FTA for free throw attempts, AST for assists, ORB offensive rebounds, DRB for defensive rebounds, TOV for turnovers, STL for steals and BLK for blocks.

PointsReg = lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + TOV + STL + BLK, data=NBA)
summary(PointsReg)

## 
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + TOV + 
##     STL + BLK, data = NBA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -527.40 -119.83    7.83  120.67  564.71 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.051e+03  2.035e+02 -10.078   <2e-16 ***
## X2PA         1.043e+00  2.957e-02  35.274   <2e-16 ***
## X3PA         1.259e+00  3.843e-02  32.747   <2e-16 ***
## FTA          1.128e+00  3.373e-02  33.440   <2e-16 ***
## AST          8.858e-01  4.396e-02  20.150   <2e-16 ***
## ORB         -9.554e-01  7.792e-02 -12.261   <2e-16 ***
## DRB          3.883e-02  6.157e-02   0.631   0.5285    
## TOV         -2.475e-02  6.118e-02  -0.405   0.6859    
## STL         -1.992e-01  9.181e-02  -2.169   0.0303 *  
## BLK         -5.576e-02  8.782e-02  -0.635   0.5256    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 185.5 on 825 degrees of freedom
## Multiple R-squared:  0.8992, Adjusted R-squared:  0.8981 
## F-statistic: 817.3 on 9 and 825 DF,  p-value: < 2.2e-16

If we take a look at the summary, we can see that some of our variables are indeed very significant. Others are less significant. For example, steals only has one significance star. And some don’t seem to be significant at all. For example, defensive rebounds, turnovers, and blocks. We do have a pretty good R-squared value, 0.8992, so it shows that there really is a linear relationship between points and all of these basketball statistics.

Sum of Squared Errors

SSE = sum(PointsReg$residuals^2)
SSE

## [1] 28394314

SSE is quite a lot here 28,394,314. So the sum of squared errors number is not really a very interpretable quantity.

Root Mean Squared Error

Root Mean Squared Error is much more interpretable. It’s more like the average error we make in our predictions. let’s calculate it here.

RMSE = sqrt(SSE/nrow(NBA))
RMSE

## [1] 184.4049

mean(NBA$PTS)

## [1] 8370.24

the RMSE in our case is 184.4. which makes much more sense. Because mean value is too high with respect to RMSE value.

Removing Insignifcant Variables

Now lets remove some of the insignificant variables one at a time.

If we see the summary of our previous model, we can see that turnovers is not so significant so we can remove it from our model, because it’s p value is much higher 0.6859.

PointsReg2 = lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + STL + BLK, data=NBA)
summary(PointsReg2)

## 
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + STL + 
##     BLK, data = NBA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -526.79 -121.09    6.37  120.74  565.94 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.077e+03  1.931e+02 -10.755   <2e-16 ***
## X2PA         1.044e+00  2.951e-02  35.366   <2e-16 ***
## X3PA         1.263e+00  3.703e-02  34.099   <2e-16 ***
## FTA          1.125e+00  3.308e-02  34.023   <2e-16 ***
## AST          8.861e-01  4.393e-02  20.173   <2e-16 ***
## ORB         -9.581e-01  7.758e-02 -12.350   <2e-16 ***
## DRB          3.892e-02  6.154e-02   0.632   0.5273    
## STL         -2.068e-01  8.984e-02  -2.301   0.0216 *  
## BLK         -5.863e-02  8.749e-02  -0.670   0.5029    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 185.4 on 826 degrees of freedom
## Multiple R-squared:  0.8991, Adjusted R-squared:  0.8982 
## F-statistic: 920.4 on 8 and 826 DF,  p-value: < 2.2e-16

So in our first regression model PointsReg, we had an R-squared of 0.8992. And R-squared of PointsReg2 is 0.8991. So almost exactly identical.

So there is no loss in removing the turnover variable.

Now we will remove defensive rebounds on the basis of higher p value after turnovers.

PointsReg3 = lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + STL + BLK, data=NBA)
summary(PointsReg3)

## 
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + STL + BLK, 
##     data = NBA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -523.79 -121.64    6.07  120.81  573.64 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.015e+03  1.670e+02 -12.068  < 2e-16 ***
## X2PA         1.048e+00  2.852e-02  36.753  < 2e-16 ***
## X3PA         1.271e+00  3.475e-02  36.568  < 2e-16 ***
## FTA          1.128e+00  3.270e-02  34.506  < 2e-16 ***
## AST          8.909e-01  4.326e-02  20.597  < 2e-16 ***
## ORB         -9.702e-01  7.519e-02 -12.903  < 2e-16 ***
## STL         -2.276e-01  8.356e-02  -2.724  0.00659 ** 
## BLK         -3.882e-02  8.165e-02  -0.475  0.63462    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 185.4 on 827 degrees of freedom
## Multiple R-squared:  0.8991, Adjusted R-squared:  0.8982 
## F-statistic:  1053 on 7 and 827 DF,  p-value: < 2.2e-16

Let’s look at the summary again to see if the R-squared has changed. And it’s the same, it’s 0.8991. we also justified removing defensive rebounds.

Now we will remove blocks on the basis of higher p value after defensive rebounds.

PointsReg4 = lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + STL, data=NBA)
summary(PointsReg4)

## 
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + STL, data = NBA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -523.33 -122.02    6.93  120.68  568.26 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.033e+03  1.629e+02 -12.475  < 2e-16 ***
## X2PA         1.050e+00  2.829e-02  37.117  < 2e-16 ***
## X3PA         1.273e+00  3.441e-02  37.001  < 2e-16 ***
## FTA          1.127e+00  3.260e-02  34.581  < 2e-16 ***
## AST          8.884e-01  4.292e-02  20.701  < 2e-16 ***
## ORB         -9.743e-01  7.465e-02 -13.051  < 2e-16 ***
## STL         -2.268e-01  8.350e-02  -2.717  0.00673 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 185.3 on 828 degrees of freedom
## Multiple R-squared:  0.8991, Adjusted R-squared:  0.8983 
## F-statistic:  1229 on 6 and 828 DF,  p-value: < 2.2e-16

So doing this R-squared value stayed the same but our model became much simpler with respect to our first model. Because we removed all the insignificant variables.

Now lets Compute SSE and RMSE for new model

SSE_4 = sum(PointsReg4$residuals^2)
RMSE_4 = sqrt(SSE_4/nrow(NBA))
SSE_4

## [1] 28421465

RMSE_4

## [1] 184.493

If we look at the SSE and RMSE of our new model we can see that the values of them increased slightly but we wouldn’t mind that because our model became much more simpler and we Refrained from the over fitting also.

Loading The Test Dataset

Let’s load the NBA_test data, on which we will apply our prediction model and will try to predict points scored correctly.

NBA_test = read.csv("C:\\Users\\aman96\\Desktop\\the analytics edge\\unit 2\\NBA_test.csv")

Make predictions on test set by applying regression model build previously.

PointsPredictions = predict(PointsReg4, newdata=NBA_test)

The actual test of the accuracy of our model will be on new data set. So let’s find out the SSE and RMSE of our model on the test data set.

SSE = sum((PointsPredictions - NBA_test$PTS)^2)
SST = sum((mean(NBA$PTS) - NBA_test$PTS)^2)
R2 = 1 - SSE/SST
R2

## [1] 0.8127142

Compute the RMSE

RMSE = sqrt(SSE/nrow(NBA_test))
RMSE

## [1] 196.3723

We see that we have an R-squared value of 0.8127. and root mean squared error here is 196.37. So it’s a little bit higher than before. But it’s not too bad. Because mean of points is 8370. And We’re making an average error of about 196 points.