Let’s load the data into R console
NBA = read.csv("C:\\Users\\aman96\\Desktop\\the analytics edge\\unit 2\\NBA_train.csv")
Now let’s look at the structure and summary of the data
str(NBA)
## 'data.frame': 835 obs. of 20 variables:
## $ SeasonEnd: int 1980 1980 1980 1980 1980 1980 1980 1980 1980 1980 ...
## $ Team : Factor w/ 37 levels "Atlanta Hawks",..: 1 2 5 6 8 9 10 11 12 13 ...
## $ Playoffs : int 1 1 0 0 0 0 0 1 0 1 ...
## $ W : int 50 61 30 37 30 16 24 41 37 47 ...
## $ PTS : int 8573 9303 8813 9360 8878 8933 8493 9084 9119 8860 ...
## $ oppPTS : int 8334 8664 9035 9332 9240 9609 8853 9070 9176 8603 ...
## $ FG : int 3261 3617 3362 3811 3462 3643 3527 3599 3639 3582 ...
## $ FGA : int 7027 7387 6943 8041 7470 7596 7318 7496 7689 7489 ...
## $ X2P : int 3248 3455 3292 3775 3379 3586 3500 3495 3551 3557 ...
## $ X2PA : int 6952 6965 6668 7854 7215 7377 7197 7117 7375 7375 ...
## $ X3P : int 13 162 70 36 83 57 27 104 88 25 ...
## $ X3PA : int 75 422 275 187 255 219 121 379 314 114 ...
## $ FT : int 2038 1907 2019 1702 1871 1590 1412 1782 1753 1671 ...
## $ FTA : int 2645 2449 2592 2205 2539 2149 1914 2326 2333 2250 ...
## $ ORB : int 1369 1227 1115 1307 1311 1226 1155 1394 1398 1187 ...
## $ DRB : int 2406 2457 2465 2381 2524 2415 2437 2217 2326 2429 ...
## $ AST : int 1913 2198 2152 2108 2079 1950 2028 2149 2148 2123 ...
## $ STL : int 782 809 704 764 746 783 779 782 900 863 ...
## $ BLK : int 539 308 392 342 404 562 339 373 530 356 ...
## $ TOV : int 1495 1539 1684 1370 1533 1742 1492 1565 1517 1439 ...
summary(NBA)
## SeasonEnd Team Playoffs W
## Min. :1980 Atlanta Hawks : 31 Min. :0.0000 Min. :11.0
## 1st Qu.:1989 Boston Celtics : 31 1st Qu.:0.0000 1st Qu.:31.0
## Median :1996 Chicago Bulls : 31 Median :1.0000 Median :42.0
## Mean :1996 Cleveland Cavaliers: 31 Mean :0.5749 Mean :41.0
## 3rd Qu.:2005 Denver Nuggets : 31 3rd Qu.:1.0000 3rd Qu.:50.5
## Max. :2011 Detroit Pistons : 31 Max. :1.0000 Max. :72.0
## (Other) :649
## PTS oppPTS FG FGA
## Min. : 6901 Min. : 6909 Min. :2565 Min. :5972
## 1st Qu.: 7934 1st Qu.: 7934 1st Qu.:2974 1st Qu.:6564
## Median : 8312 Median : 8365 Median :3150 Median :6831
## Mean : 8370 Mean : 8370 Mean :3200 Mean :6873
## 3rd Qu.: 8784 3rd Qu.: 8768 3rd Qu.:3434 3rd Qu.:7157
## Max. :10371 Max. :10723 Max. :3980 Max. :8868
##
## X2P X2PA X3P X3PA
## Min. :1981 Min. :4153 Min. : 10.0 Min. : 75.0
## 1st Qu.:2510 1st Qu.:5269 1st Qu.:131.5 1st Qu.: 413.0
## Median :2718 Median :5706 Median :329.0 Median : 942.0
## Mean :2881 Mean :5956 Mean :319.0 Mean : 916.9
## 3rd Qu.:3296 3rd Qu.:6754 3rd Qu.:481.5 3rd Qu.:1347.5
## Max. :3954 Max. :7873 Max. :841.0 Max. :2284.0
##
## FT FTA ORB DRB
## Min. :1189 Min. :1475 Min. : 639.0 Min. :2044
## 1st Qu.:1502 1st Qu.:2008 1st Qu.: 953.5 1st Qu.:2346
## Median :1628 Median :2176 Median :1055.0 Median :2433
## Mean :1650 Mean :2190 Mean :1061.6 Mean :2427
## 3rd Qu.:1781 3rd Qu.:2352 3rd Qu.:1167.0 3rd Qu.:2516
## Max. :2388 Max. :3051 Max. :1520.0 Max. :2753
##
## AST STL BLK TOV
## Min. :1423 Min. : 455.0 Min. :204.0 Min. : 931
## 1st Qu.:1735 1st Qu.: 599.0 1st Qu.:359.0 1st Qu.:1192
## Median :1899 Median : 658.0 Median :410.0 Median :1289
## Mean :1912 Mean : 668.4 Mean :419.8 Mean :1303
## 3rd Qu.:2078 3rd Qu.: 729.0 3rd Qu.:469.5 3rd Qu.:1396
## Max. :2575 Max. :1053.0 Max. :716.0 Max. :1873
##
So here The goal of a basketball team is making the playoffs.
Let’s use the table command to figure this out for the NBA. For exploring the data and having the basic idea about the data, that how Wins and Playoffs are depending on each others.
table(NBA$W, NBA$Playoffs)
##
## 0 1
## 11 2 0
## 12 2 0
## 13 2 0
## 14 2 0
## 15 10 0
## 16 2 0
## 17 11 0
## 18 5 0
## 19 10 0
## 20 10 0
## 21 12 0
## 22 11 0
## 23 11 0
## 24 18 0
## 25 11 0
## 26 17 0
## 27 10 0
## 28 18 0
## 29 12 0
## 30 19 1
## 31 15 1
## 32 12 0
## 33 17 0
## 34 16 0
## 35 13 3
## 36 17 4
## 37 15 4
## 38 8 7
## 39 10 10
## 40 9 13
## 41 11 26
## 42 8 29
## 43 2 18
## 44 2 27
## 45 3 22
## 46 1 15
## 47 0 28
## 48 1 14
## 49 0 17
## 50 0 32
## 51 0 12
## 52 0 20
## 53 0 17
## 54 0 18
## 55 0 24
## 56 0 16
## 57 0 23
## 58 0 13
## 59 0 14
## 60 0 8
## 61 0 10
## 62 0 13
## 63 0 7
## 64 0 3
## 65 0 3
## 66 0 2
## 67 0 4
## 69 0 1
## 72 0 1
we can see that a team who wins 35 games or fewer almost never makes it to the playoffs. And a team who wins 45 games almost always make it to the playoffs.
Now we will add a variable that is the difference between points scored and points allowed.
NBA$PTSdiff = NBA$PTS - NBA$oppPTS
Now Let’s make a scatter plot to see if it looks like there’s a linear relationship between the number of wins that a team wins and the point difference.
plot(NBA$PTSdiff, NBA$W, xlab="Point Difference", ylab= "Wins", pch=19, col="blue", main = "PTSdiff Vs Wins")
From the graph we can clearly see that there’s an incredibly strong linear relationship between these two variables. So it seems like linear regression is going to be a good way to predict how many wins a team will have given the point difference.
WinsReg = lm(W ~ PTSdiff, data=NBA)
summary(WinsReg)
##
## Call:
## lm(formula = W ~ PTSdiff, data = NBA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.7393 -2.1018 -0.0672 2.0265 10.6026
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.100e+01 1.059e-01 387.0 <2e-16 ***
## PTSdiff 3.259e-02 2.793e-04 116.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.061 on 833 degrees of freedom
## Multiple R-squared: 0.9423, Adjusted R-squared: 0.9423
## F-statistic: 1.361e+04 on 1 and 833 DF, p-value: < 2.2e-16
From the summary we notice that we’ve got very significant variables over here. And an R squared of 0.9423, which is very high.
\[ W = 41 + 0.0326*PTSdiff \]
So we saw earlier with the table that a team would want to win about at least 42 games in order to have a good chance of making it to the playoffs. So what does this mean in terms of their points difference? If we want this to be greater than or equal to 42, that means that the PTSdiff would need to be greater than or equal to 42 minus 41 divided by 0.0326. So if we actually do that calculation, we see that this is equal to 30.67. So we need to score at least 31 more points than we allow in order to win at least 42 games.
Now let’s build an equation to predict points scored using some common basketball statistics.
Understanding the variables of the dataset, X2PA for two-point attempts, X3PA for three-point attempts, FTA for free throw attempts, AST for assists, ORB offensive rebounds, DRB for defensive rebounds, TOV for turnovers, STL for steals and BLK for blocks.
PointsReg = lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + TOV + STL + BLK, data=NBA)
summary(PointsReg)
##
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + TOV +
## STL + BLK, data = NBA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -527.40 -119.83 7.83 120.67 564.71
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.051e+03 2.035e+02 -10.078 <2e-16 ***
## X2PA 1.043e+00 2.957e-02 35.274 <2e-16 ***
## X3PA 1.259e+00 3.843e-02 32.747 <2e-16 ***
## FTA 1.128e+00 3.373e-02 33.440 <2e-16 ***
## AST 8.858e-01 4.396e-02 20.150 <2e-16 ***
## ORB -9.554e-01 7.792e-02 -12.261 <2e-16 ***
## DRB 3.883e-02 6.157e-02 0.631 0.5285
## TOV -2.475e-02 6.118e-02 -0.405 0.6859
## STL -1.992e-01 9.181e-02 -2.169 0.0303 *
## BLK -5.576e-02 8.782e-02 -0.635 0.5256
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 185.5 on 825 degrees of freedom
## Multiple R-squared: 0.8992, Adjusted R-squared: 0.8981
## F-statistic: 817.3 on 9 and 825 DF, p-value: < 2.2e-16
If we take a look at the summary, we can see that some of our variables are indeed very significant. Others are less significant. For example, steals only has one significance star. And some don’t seem to be significant at all. For example, defensive rebounds, turnovers, and blocks. We do have a pretty good R-squared value, 0.8992, so it shows that there really is a linear relationship between points and all of these basketball statistics.
SSE = sum(PointsReg$residuals^2)
SSE
## [1] 28394314
SSE is quite a lot here 28,394,314. So the sum of squared errors number is not really a very interpretable quantity.
Root Mean Squared Error is much more interpretable. It’s more like the average error we make in our predictions. let’s calculate it here.
RMSE = sqrt(SSE/nrow(NBA))
RMSE
## [1] 184.4049
mean(NBA$PTS)
## [1] 8370.24
the RMSE in our case is 184.4. which makes much more sense. Because mean value is too high with respect to RMSE value.
Now lets remove some of the insignificant variables one at a time.
If we see the summary of our previous model, we can see that turnovers is not so significant so we can remove it from our model, because it’s p value is much higher 0.6859.
PointsReg2 = lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + STL + BLK, data=NBA)
summary(PointsReg2)
##
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + STL +
## BLK, data = NBA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -526.79 -121.09 6.37 120.74 565.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.077e+03 1.931e+02 -10.755 <2e-16 ***
## X2PA 1.044e+00 2.951e-02 35.366 <2e-16 ***
## X3PA 1.263e+00 3.703e-02 34.099 <2e-16 ***
## FTA 1.125e+00 3.308e-02 34.023 <2e-16 ***
## AST 8.861e-01 4.393e-02 20.173 <2e-16 ***
## ORB -9.581e-01 7.758e-02 -12.350 <2e-16 ***
## DRB 3.892e-02 6.154e-02 0.632 0.5273
## STL -2.068e-01 8.984e-02 -2.301 0.0216 *
## BLK -5.863e-02 8.749e-02 -0.670 0.5029
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 185.4 on 826 degrees of freedom
## Multiple R-squared: 0.8991, Adjusted R-squared: 0.8982
## F-statistic: 920.4 on 8 and 826 DF, p-value: < 2.2e-16
So in our first regression model PointsReg, we had an R-squared of 0.8992. And R-squared of PointsReg2 is 0.8991. So almost exactly identical.
So there is no loss in removing the turnover variable.
Now we will remove defensive rebounds on the basis of higher p value after turnovers.
PointsReg3 = lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + STL + BLK, data=NBA)
summary(PointsReg3)
##
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + STL + BLK,
## data = NBA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -523.79 -121.64 6.07 120.81 573.64
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.015e+03 1.670e+02 -12.068 < 2e-16 ***
## X2PA 1.048e+00 2.852e-02 36.753 < 2e-16 ***
## X3PA 1.271e+00 3.475e-02 36.568 < 2e-16 ***
## FTA 1.128e+00 3.270e-02 34.506 < 2e-16 ***
## AST 8.909e-01 4.326e-02 20.597 < 2e-16 ***
## ORB -9.702e-01 7.519e-02 -12.903 < 2e-16 ***
## STL -2.276e-01 8.356e-02 -2.724 0.00659 **
## BLK -3.882e-02 8.165e-02 -0.475 0.63462
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 185.4 on 827 degrees of freedom
## Multiple R-squared: 0.8991, Adjusted R-squared: 0.8982
## F-statistic: 1053 on 7 and 827 DF, p-value: < 2.2e-16
Let’s look at the summary again to see if the R-squared has changed. And it’s the same, it’s 0.8991. we also justified removing defensive rebounds.
Now we will remove blocks on the basis of higher p value after defensive rebounds.
PointsReg4 = lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + STL, data=NBA)
summary(PointsReg4)
##
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + STL, data = NBA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -523.33 -122.02 6.93 120.68 568.26
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.033e+03 1.629e+02 -12.475 < 2e-16 ***
## X2PA 1.050e+00 2.829e-02 37.117 < 2e-16 ***
## X3PA 1.273e+00 3.441e-02 37.001 < 2e-16 ***
## FTA 1.127e+00 3.260e-02 34.581 < 2e-16 ***
## AST 8.884e-01 4.292e-02 20.701 < 2e-16 ***
## ORB -9.743e-01 7.465e-02 -13.051 < 2e-16 ***
## STL -2.268e-01 8.350e-02 -2.717 0.00673 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 185.3 on 828 degrees of freedom
## Multiple R-squared: 0.8991, Adjusted R-squared: 0.8983
## F-statistic: 1229 on 6 and 828 DF, p-value: < 2.2e-16
So doing this R-squared value stayed the same but our model became much simpler with respect to our first model. Because we removed all the insignificant variables.
Now lets Compute SSE and RMSE for new model
SSE_4 = sum(PointsReg4$residuals^2)
RMSE_4 = sqrt(SSE_4/nrow(NBA))
SSE_4
## [1] 28421465
RMSE_4
## [1] 184.493
If we look at the SSE and RMSE of our new model we can see that the values of them increased slightly but we wouldn’t mind that because our model became much more simpler and we Refrained from the over fitting also.
Let’s load the NBA_test data, on which we will apply our prediction model and will try to predict points scored correctly.
NBA_test = read.csv("C:\\Users\\aman96\\Desktop\\the analytics edge\\unit 2\\NBA_test.csv")
Make predictions on test set by applying regression model build previously.
PointsPredictions = predict(PointsReg4, newdata=NBA_test)
The actual test of the accuracy of our model will be on new data set. So let’s find out the SSE and RMSE of our model on the test data set.
SSE = sum((PointsPredictions - NBA_test$PTS)^2)
SST = sum((mean(NBA$PTS) - NBA_test$PTS)^2)
R2 = 1 - SSE/SST
R2
## [1] 0.8127142
RMSE = sqrt(SSE/nrow(NBA_test))
RMSE
## [1] 196.3723
We see that we have an R-squared value of 0.8127. and root mean squared error here is 196.37. So it’s a little bit higher than before. But it’s not too bad. Because mean of points is 8370. And We’re making an average error of about 196 points.