We initially load in our data set.
baseball <- read.csv("C:/Users/salil/Desktop/AnalyticsEdgeFolder/baseball.csv")
This is how our data set look like.
rbind(head(baseball), tail(baseball))
## Team League Year RS RA W OBP SLG BA Playoffs RankSeason
## 1 ARI NL 2012 734 688 81 0.328 0.418 0.259 0 NA
## 2 ATL NL 2012 700 600 94 0.320 0.389 0.247 1 4
## 3 BAL AL 2012 712 705 93 0.311 0.417 0.247 1 5
## 4 BOS AL 2012 734 806 69 0.315 0.415 0.260 0 NA
## 5 CHC NL 2012 613 759 61 0.302 0.378 0.240 0 NA
## 6 CHW AL 2012 748 676 85 0.318 0.422 0.255 0 NA
## 1227 NYY AL 1962 817 680 96 0.337 0.426 0.267 1 2
## 1228 PHI NL 1962 705 759 81 0.330 0.390 0.260 0 NA
## 1229 PIT NL 1962 706 626 93 0.321 0.394 0.268 0 NA
## 1230 SFG NL 1962 878 690 103 0.341 0.441 0.278 1 1
## 1231 STL NL 1962 774 664 84 0.335 0.394 0.271 0 NA
## 1232 WSA AL 1962 599 716 60 0.308 0.373 0.250 0 NA
## RankPlayoffs G OOBP OSLG
## 1 NA 162 0.317 0.415
## 2 5 162 0.306 0.378
## 3 4 162 0.315 0.403
## 4 NA 162 0.331 0.428
## 5 NA 162 0.335 0.424
## 6 NA 162 0.319 0.405
## 1227 1 162 NA NA
## 1228 NA 161 NA NA
## 1229 NA 161 NA NA
## 1230 2 165 NA NA
## 1231 NA 163 NA NA
## 1232 NA 162 NA NA
We also check the structure of the data set using the ‘str’ function.
str(baseball)
## 'data.frame': 1232 obs. of 15 variables:
## $ Team : Factor w/ 39 levels "ANA","ARI","ATL",..: 2 3 4 5 7 8 9 10 11 12 ...
## $ League : Factor w/ 2 levels "AL","NL": 2 2 1 1 2 1 2 1 2 1 ...
## $ Year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
## $ RS : int 734 700 712 734 613 748 669 667 758 726 ...
## $ RA : int 688 600 705 806 759 676 588 845 890 670 ...
## $ W : int 81 94 93 69 61 85 97 68 64 88 ...
## $ OBP : num 0.328 0.32 0.311 0.315 0.302 0.318 0.315 0.324 0.33 0.335 ...
## $ SLG : num 0.418 0.389 0.417 0.415 0.378 0.422 0.411 0.381 0.436 0.422 ...
## $ BA : num 0.259 0.247 0.247 0.26 0.24 0.255 0.251 0.251 0.274 0.268 ...
## $ Playoffs : int 0 1 1 0 0 0 1 0 0 1 ...
## $ RankSeason : int NA 4 5 NA NA NA 2 NA NA 6 ...
## $ RankPlayoffs: int NA 5 4 NA NA NA 4 NA NA 2 ...
## $ G : int 162 162 162 162 162 162 162 162 162 162 ...
## $ OOBP : num 0.317 0.306 0.315 0.331 0.335 0.319 0.305 0.336 0.357 0.314 ...
## $ OSLG : num 0.415 0.378 0.403 0.428 0.424 0.405 0.39 0.43 0.47 0.402 ...
We will be briefly discussing about every column and variable as and when required.
Moneyball is a book written by Michael Lewis which discusses how sports analytics changed baseball. Since we are confirming the claims made in moneyball as per data available in 2002, we would be building our model using the same. We use the ‘subset’ function to get data of observations prior to 2002.
moneyball <- subset(baseball,baseball$Year<2002)
str(moneyball)
## 'data.frame': 902 obs. of 15 variables:
## $ Team : Factor w/ 39 levels "ANA","ARI","ATL",..: 1 2 3 4 5 7 8 9 10 11 ...
## $ League : Factor w/ 2 levels "AL","NL": 1 2 2 1 1 2 1 2 1 2 ...
## $ Year : int 2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
## $ RS : int 691 818 729 687 772 777 798 735 897 923 ...
## $ RA : int 730 677 643 829 745 701 795 850 821 906 ...
## $ W : int 75 92 88 63 82 88 83 66 91 73 ...
## $ OBP : num 0.327 0.341 0.324 0.319 0.334 0.336 0.334 0.324 0.35 0.354 ...
## $ SLG : num 0.405 0.442 0.412 0.38 0.439 0.43 0.451 0.419 0.458 0.483 ...
## $ BA : num 0.261 0.267 0.26 0.248 0.266 0.261 0.268 0.262 0.278 0.292 ...
## $ Playoffs : int 0 1 1 0 0 0 0 0 1 0 ...
## $ RankSeason : int NA 5 7 NA NA NA NA NA 6 NA ...
## $ RankPlayoffs: int NA 1 3 NA NA NA NA NA 4 NA ...
## $ G : int 162 162 162 162 161 162 162 162 162 162 ...
## $ OOBP : num 0.331 0.311 0.314 0.337 0.329 0.321 0.334 0.341 0.341 0.35 ...
## $ OSLG : num 0.412 0.404 0.384 0.439 0.393 0.398 0.427 0.455 0.417 0.48 ...
We first wish to predict wins by building a linear regression model where the number of wins will be the dependent variable and the difference in ‘runs scored’ and ‘runs allowed’ i.e. ‘RS’ and ‘RA’ will be the independent variables. To avoid using two independent variables, we create a new column which reveals the difference the two.
moneyball$RD <- moneyball$RS - moneyball$RA
We now use the ‘lm’ function to run a regression where ‘W’ will be our dependent variable and ‘RD’ will be our independent variable.
WinsReg <- lm(moneyball$W~moneyball$RD,data = moneyball)
We now check the summary of the model created.
summary(WinsReg)
##
## Call:
## lm(formula = moneyball$W ~ moneyball$RD, data = moneyball)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.2662 -2.6509 0.1234 2.9364 11.6570
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80.881375 0.131157 616.67 <2e-16 ***
## moneyball$RD 0.105766 0.001297 81.55 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.939 on 900 degrees of freedom
## Multiple R-squared: 0.8808, Adjusted R-squared: 0.8807
## F-statistic: 6651 on 1 and 900 DF, p-value: < 2.2e-16
We can see that both the intercept and the ‘RD’ variable are significant as there are 3 asterisks marked beside. Also, the R sq is high and equal to approximately 88%.
Now we have to find out whether the claim that the runs diff should be 135 to win at least 95 games and qualify for the playoffs.
From the Linear Regression model, the equation we get is
W=80.881375+0.105766(RD)
According to the claim, the number of wins have to be ‘95’. So, 95=80.881375+0.105766(RD)
After a bit of manipulation, we get ‘Runs Difference’ equal to 133.4 which is approximately equal to 135. So the claim was true.
For the ‘Runs Difference’ to be equal to be equal to ‘135’, we need to predict the ‘Runs Scored’ and the ‘Runs Allowed’ using batting and pitching statistics respectively.
RunsScored1 <- lm(RS~OBP+SLG+BA,data = moneyball)
summary(RunsScored1)
##
## Call:
## lm(formula = RS ~ OBP + SLG + BA, data = moneyball)
##
## Residuals:
## Min 1Q Median 3Q Max
## -70.941 -17.247 -0.621 16.754 90.998
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -788.46 19.70 -40.029 < 2e-16 ***
## OBP 2917.42 110.47 26.410 < 2e-16 ***
## SLG 1637.93 45.99 35.612 < 2e-16 ***
## BA -368.97 130.58 -2.826 0.00482 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.69 on 898 degrees of freedom
## Multiple R-squared: 0.9302, Adjusted R-squared: 0.93
## F-statistic: 3989 on 3 and 898 DF, p-value: < 2.2e-16
‘BA’ or ‘Batting Average’ has 2 asterisks marks beside it. This is an important variable in determining the ‘Runs Scored’ but not as important as the ‘On-Base Percentage(OPB)’ or the ‘Slugging Percentage(SP)’,
Also the coefficient of ‘BA’ is negative which is very contradictory. This happens probably because of multicollinearity between the independent variables. Let us check the correlation coefficient between the variables.
cor(moneyball$OBP,moneyball$BA)
## [1] 0.8540549
cor(moneyball$SLG,moneyball$BA)
## [1] 0.8140681
cor(moneyball$OBP,moneyball$SLG)
## [1] 0.8061539
We therefore build a different model with the ‘BA’ variable and check the R sq value.
RunsScored2 <- lm(RS~OBP+SLG,data = moneyball)
summary(RunsScored2)
##
## Call:
## lm(formula = RS ~ OBP + SLG, data = moneyball)
##
## Residuals:
## Min 1Q Median 3Q Max
## -70.838 -17.174 -1.108 16.770 90.036
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -804.63 18.92 -42.53 <2e-16 ***
## OBP 2737.77 90.68 30.19 <2e-16 ***
## SLG 1584.91 42.16 37.60 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.79 on 899 degrees of freedom
## Multiple R-squared: 0.9296, Adjusted R-squared: 0.9294
## F-statistic: 5934 on 2 and 899 DF, p-value: < 2.2e-16
Here the R sq value is almost the same and the coefficient of ‘OBP’ is higher than that of ‘SP’. This means that ‘OBP’ is an extremely important variable and ‘SP’ is an important variable.
We now try builing models by keeping ‘BA’ and removing each of the other two variables.
RunsScored3 <- lm(RS~SLG+BA,data = moneyball)
RunsScored4 <- lm(RS~OBP+BA,data = moneyball)
summary(RunsScored3)
##
## Call:
## lm(formula = RS ~ SLG + BA, data = moneyball)
##
## Residuals:
## Min 1Q Median 3Q Max
## -84.636 -23.771 -2.377 20.603 112.223
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -526.90 22.68 -23.23 <2e-16 ***
## SLG 2083.82 57.00 36.56 <2e-16 ***
## BA 1615.95 142.25 11.36 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.9 on 899 degrees of freedom
## Multiple R-squared: 0.876, Adjusted R-squared: 0.8757
## F-statistic: 3175 on 2 and 899 DF, p-value: < 2.2e-16
summary(RunsScored4)
##
## Call:
## lm(formula = RS ~ OBP + BA, data = moneyball)
##
## Residuals:
## Min 1Q Median 3Q Max
## -112.558 -27.370 1.166 26.810 111.769
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1107.98 27.22 -40.706 < 2e-16 ***
## OBP 4361.44 159.50 27.344 < 2e-16 ***
## BA 1528.12 185.06 8.257 5.29e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38.33 on 899 degrees of freedom
## Multiple R-squared: 0.8316, Adjusted R-squared: 0.8313
## F-statistic: 2220 on 2 and 899 DF, p-value: < 2.2e-16
So in both the cases, R sq reduces by greater magnitudes. So the ‘RunsScored2’ model is the final model.
RS=-804.63+2737.77(OBP)+1584.91(SLG)
We now repeat the same procedure to predict ‘Runs Allowed’.
RunsAcq1 <- lm(RA~OOBP+OSLG,data = moneyball)
summary(RunsAcq1)
##
## Call:
## lm(formula = RA ~ OOBP + OSLG, data = moneyball)
##
## Residuals:
## Min 1Q Median 3Q Max
## -82.397 -15.178 -0.129 17.679 60.955
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -837.38 60.26 -13.897 < 2e-16 ***
## OOBP 2913.60 291.97 9.979 4.46e-16 ***
## OSLG 1514.29 175.43 8.632 2.55e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.67 on 87 degrees of freedom
## (812 observations deleted due to missingness)
## Multiple R-squared: 0.9073, Adjusted R-squared: 0.9052
## F-statistic: 425.8 on 2 and 87 DF, p-value: < 2.2e-16
The R sq value is high i.e. equal to 0.9052.
The final model we get is, RA=-837.38+2913.6(OOBP)+1514.29(OSLG)
Using these models, we would try to predict the number of wins of any team.
For this we need to predict the runs scored by the team team and the runs allowed by the team. As every year there are new players in the team, we need to estimate the new team statistics using past player performance.
Let us consider the case of Oakland As.
Using the 2001 regular season statistics for these players, team OBP=0.339 and team SLG=0.430.
So in 2002, using the model, RS=-804.63+2737.77OBP+1584.91SLG. So we get runs scored=805.
Similarly using the other model and the values of OOBP=0.307 and OSLG=0.373 we get RA=622.
Next we want to predict the number of wins. So Wins=80.8814+0.1058(RS-RA)
Wins=100.
We will now use the similar method to make prediction in the NBA data available. This time we will also be checking if the basic assumptions of the multiple linear regression model hold true or not.
We first load in our training data set.
NBAtrain <- read.csv("C:/Users/salil/Desktop/AnalyticsEdgeFolder/NBA_train.csv")
str(NBAtrain)
## 'data.frame': 835 obs. of 20 variables:
## $ SeasonEnd: int 1980 1980 1980 1980 1980 1980 1980 1980 1980 1980 ...
## $ Team : Factor w/ 37 levels "Atlanta Hawks",..: 1 2 5 6 8 9 10 11 12 13 ...
## $ Playoffs : int 1 1 0 0 0 0 0 1 0 1 ...
## $ W : int 50 61 30 37 30 16 24 41 37 47 ...
## $ PTS : int 8573 9303 8813 9360 8878 8933 8493 9084 9119 8860 ...
## $ oppPTS : int 8334 8664 9035 9332 9240 9609 8853 9070 9176 8603 ...
## $ FG : int 3261 3617 3362 3811 3462 3643 3527 3599 3639 3582 ...
## $ FGA : int 7027 7387 6943 8041 7470 7596 7318 7496 7689 7489 ...
## $ X2P : int 3248 3455 3292 3775 3379 3586 3500 3495 3551 3557 ...
## $ X2PA : int 6952 6965 6668 7854 7215 7377 7197 7117 7375 7375 ...
## $ X3P : int 13 162 70 36 83 57 27 104 88 25 ...
## $ X3PA : int 75 422 275 187 255 219 121 379 314 114 ...
## $ FT : int 2038 1907 2019 1702 1871 1590 1412 1782 1753 1671 ...
## $ FTA : int 2645 2449 2592 2205 2539 2149 1914 2326 2333 2250 ...
## $ ORB : int 1369 1227 1115 1307 1311 1226 1155 1394 1398 1187 ...
## $ DRB : int 2406 2457 2465 2381 2524 2415 2437 2217 2326 2429 ...
## $ AST : int 1913 2198 2152 2108 2079 1950 2028 2149 2148 2123 ...
## $ STL : int 782 809 704 764 746 783 779 782 900 863 ...
## $ BLK : int 539 308 392 342 404 562 339 373 530 356 ...
## $ TOV : int 1495 1539 1684 1370 1533 1742 1492 1565 1517 1439 ...
Each of these variables have a meaning and they will be explained as and when required. The variables where there is a ‘X’ in the begining, it happens because R puts one in front of the names of the variables which start with a number.
Now let us see the number of games a team would probably need to win in order to make it to the playoffs.
table(NBAtrain$W,NBAtrain$Playoffs)
##
## 0 1
## 11 2 0
## 12 2 0
## 13 2 0
## 14 2 0
## 15 10 0
## 16 2 0
## 17 11 0
## 18 5 0
## 19 10 0
## 20 10 0
## 21 12 0
## 22 11 0
## 23 11 0
## 24 18 0
## 25 11 0
## 26 17 0
## 27 10 0
## 28 18 0
## 29 12 0
## 30 19 1
## 31 15 1
## 32 12 0
## 33 17 0
## 34 16 0
## 35 13 3
## 36 17 4
## 37 15 4
## 38 8 7
## 39 10 10
## 40 9 13
## 41 11 26
## 42 8 29
## 43 2 18
## 44 2 27
## 45 3 22
## 46 1 15
## 47 0 28
## 48 1 14
## 49 0 17
## 50 0 32
## 51 0 12
## 52 0 20
## 53 0 17
## 54 0 18
## 55 0 24
## 56 0 16
## 57 0 23
## 58 0 13
## 59 0 14
## 60 0 8
## 61 0 10
## 62 0 13
## 63 0 7
## 64 0 3
## 65 0 3
## 66 0 2
## 67 0 4
## 69 0 1
## 72 0 1
‘0’ means that any team has not made it to the playoffs and ‘1’ means that the team has made it to the playoffs.
So according to this, if a team wins 37 or less games, there is hardly any chance of making it to the playoffs. If a team wins 42 or more games, there is a good amount of chance of making it to the playoffs. Anything in betwwen defines ambiguity i.e. there is a good chance of the team’s playoff entry going either ways.
Let us now check if a team’s total number of wins depends on the points difference throughout the season that is the difference in points scored by themselves and their opponents.
NBAtrain$PtsDiff <- NBAtrain$PTS - NBAtrain$oppPTS
If we plot this on a graph, we get
plot(NBAtrain$W,NBAtrain$PtsDiff)
So we can observe a strong linear relationship between the two. So, higher the difference in points, higher are the number of wins. We now create a model regarding the same.
WinsNBA <- lm(W~PtsDiff,data = NBAtrain)
summary(WinsNBA)
##
## Call:
## lm(formula = W ~ PtsDiff, data = NBAtrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.7393 -2.1018 -0.0672 2.0265 10.6026
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.100e+01 1.059e-01 387.0 <2e-16 ***
## PtsDiff 3.259e-02 2.793e-04 116.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.061 on 833 degrees of freedom
## Multiple R-squared: 0.9423, Adjusted R-squared: 0.9423
## F-statistic: 1.361e+04 on 1 and 833 DF, p-value: < 2.2e-16
So the R sq is pretty high equal to 0.94. Also, both the intercept variable and the ‘Points Diff’ variable are strongly significant.
From the regression model we get,
W=41+0.0326(PtsDiff)
This is the model and we can see that the a team has to win somewhere about 42 games to have a good chance of making it to the playoffs. We can then calculate the points difference as per the number of wins we want. If we solve by assuming the number of W=43, then we get PtsDiff approximately equal to 31. So to have a comfortable position, this should be the minimum point difference.
Now, let us determine the points scored using the common basketball statistics.
PointsReg <- lm(PTS~X2PA+X3PA+FTA+AST+ORB+DRB+TOV+STL+BLK,data = NBAtrain)
summary(PointsReg)
##
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + TOV +
## STL + BLK, data = NBAtrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -527.40 -119.83 7.83 120.67 564.71
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.051e+03 2.035e+02 -10.078 <2e-16 ***
## X2PA 1.043e+00 2.957e-02 35.274 <2e-16 ***
## X3PA 1.259e+00 3.843e-02 32.747 <2e-16 ***
## FTA 1.128e+00 3.373e-02 33.440 <2e-16 ***
## AST 8.858e-01 4.396e-02 20.150 <2e-16 ***
## ORB -9.554e-01 7.792e-02 -12.261 <2e-16 ***
## DRB 3.883e-02 6.157e-02 0.631 0.5285
## TOV -2.475e-02 6.118e-02 -0.405 0.6859
## STL -1.992e-01 9.181e-02 -2.169 0.0303 *
## BLK -5.576e-02 8.782e-02 -0.635 0.5256
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 185.5 on 825 degrees of freedom
## Multiple R-squared: 0.8992, Adjusted R-squared: 0.8981
## F-statistic: 817.3 on 9 and 825 DF, p-value: < 2.2e-16
‘DRB’ of ‘Defensive Rebounds’, ‘TOV’ or ‘Turnovers’ and ‘BLK’ or ‘Throw in Blocks’ are not statistically significant at all as there are no ‘asterisks’ marks present beside them. We will now try to remove them individually one by one and check the interpretability of the created models.
The estimate value of the intercept tells the value of Y when the value of X is equal to zero. The estimate value of the independent variables reveal the average value of the change in Y when there is a one unit change in X. Std Error is the estimated variability in a coeff due to sampling variability.t-value is the ratio of the first two values i.e. Estimate/Std.Error. So larger the absolute t-value, greater is the magnitude of the estimated term relative to the error and better the model and vice-versa.
Null hypothesis is that the value of the slope parameter is either equal to or approximately equal to zero. Alternative hypothesis would mean that the slope parameter is significant. The p-value reveals the statistical significance of the variable in the model. In simple words, p-value is the probability of the null hypothesis being true.
So, as p-value of ‘TOV’ is high, we first remove it from the model.
PointsRegA <- lm(PTS~X2PA+X3PA+FTA+AST+ORB+DRB+STL+BLK,data = NBAtrain)
summary(PointsRegA)
##
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + STL +
## BLK, data = NBAtrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -526.79 -121.09 6.37 120.74 565.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.077e+03 1.931e+02 -10.755 <2e-16 ***
## X2PA 1.044e+00 2.951e-02 35.366 <2e-16 ***
## X3PA 1.263e+00 3.703e-02 34.099 <2e-16 ***
## FTA 1.125e+00 3.308e-02 34.023 <2e-16 ***
## AST 8.861e-01 4.393e-02 20.173 <2e-16 ***
## ORB -9.581e-01 7.758e-02 -12.350 <2e-16 ***
## DRB 3.892e-02 6.154e-02 0.632 0.5273
## STL -2.068e-01 8.984e-02 -2.301 0.0216 *
## BLK -5.863e-02 8.749e-02 -0.670 0.5029
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 185.4 on 826 degrees of freedom
## Multiple R-squared: 0.8991, Adjusted R-squared: 0.8982
## F-statistic: 920.4 on 8 and 826 DF, p-value: < 2.2e-16
The r-sq value almost does not fluctuate. So it is safe to remove the variable. Now let us remove the variable DRB as the next highest p-value is that of ‘DRB’.
PointsRegB <- lm(PTS~X2PA+X3PA+FTA+AST+ORB+STL+BLK,data = NBAtrain)
summary(PointsRegB)
##
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + STL + BLK,
## data = NBAtrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -523.79 -121.64 6.07 120.81 573.64
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.015e+03 1.670e+02 -12.068 < 2e-16 ***
## X2PA 1.048e+00 2.852e-02 36.753 < 2e-16 ***
## X3PA 1.271e+00 3.475e-02 36.568 < 2e-16 ***
## FTA 1.128e+00 3.270e-02 34.506 < 2e-16 ***
## AST 8.909e-01 4.326e-02 20.597 < 2e-16 ***
## ORB -9.702e-01 7.519e-02 -12.903 < 2e-16 ***
## STL -2.276e-01 8.356e-02 -2.724 0.00659 **
## BLK -3.882e-02 8.165e-02 -0.475 0.63462
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 185.4 on 827 degrees of freedom
## Multiple R-squared: 0.8991, Adjusted R-squared: 0.8982
## F-statistic: 1053 on 7 and 827 DF, p-value: < 2.2e-16
The r-sq value is still the same. Now let us remove the variable BLK.
PointsRegC <- lm(PTS~X2PA+X3PA+FTA+AST+ORB+STL,data = NBAtrain)
summary(PointsRegC)
##
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + STL, data = NBAtrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -523.33 -122.02 6.93 120.68 568.26
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.033e+03 1.629e+02 -12.475 < 2e-16 ***
## X2PA 1.050e+00 2.829e-02 37.117 < 2e-16 ***
## X3PA 1.273e+00 3.441e-02 37.001 < 2e-16 ***
## FTA 1.127e+00 3.260e-02 34.581 < 2e-16 ***
## AST 8.884e-01 4.292e-02 20.701 < 2e-16 ***
## ORB -9.743e-01 7.465e-02 -13.051 < 2e-16 ***
## STL -2.268e-01 8.350e-02 -2.717 0.00673 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 185.3 on 828 degrees of freedom
## Multiple R-squared: 0.8991, Adjusted R-squared: 0.8983
## F-statistic: 1229 on 6 and 828 DF, p-value: < 2.2e-16
We now finally remove ‘STL’ whose level of significance and p-values are low.
PointsRegD <- lm(PTS~X2PA+X3PA+FTA+AST+ORB,data = NBAtrain)
summary(PointsRegD)
##
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB, data = NBAtrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -537.29 -123.28 7.56 122.02 572.92
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.039e+03 1.635e+02 -12.47 <2e-16 ***
## X2PA 1.045e+00 2.835e-02 36.88 <2e-16 ***
## X3PA 1.267e+00 3.446e-02 36.76 <2e-16 ***
## FTA 1.118e+00 3.253e-02 34.36 <2e-16 ***
## AST 8.651e-01 4.221e-02 20.50 <2e-16 ***
## ORB -1.018e+00 7.320e-02 -13.90 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 186 on 829 degrees of freedom
## Multiple R-squared: 0.8982, Adjusted R-squared: 0.8975
## F-statistic: 1462 on 5 and 829 DF, p-value: < 2.2e-16
So the value of R sq is 0.89, which is good, using only 5 basketball variables.
We now check the root mean square error of the initial model. and the final model.
SSE <- sum(PointsReg$residuals^2)
SSE
## [1] 28394314
RMSE <- sqrt(SSE/nrow(NBAtrain))
RMSE
## [1] 184.4049
Now the Root Mean Square error of other models.
SSEA <- sum(PointsRegA$residuals^2)
RMSEA <- sqrt(SSEA/nrow(NBAtrain))
RMSEA
## [1] 184.4232
SSEB <- sum(PointsRegB$residuals^2)
RMSEB <- sqrt(SSEB/nrow(NBAtrain))
RMSEB
## [1] 184.4678
SSEC <- sum(PointsRegC$residuals^2)
RMSEC <- sqrt(SSEC/nrow(NBAtrain))
RMSEC
## [1] 184.493
And then that of the final model.
SSED <- sum(PointsRegD$residuals^2)
SSED
## [1] 28674769
RMSED <- sqrt(SSED/nrow(NBAtrain))
RMSED
## [1] 185.3134
So the root mean square error doesn’t change much. It is almost the same. The mean number of points in a season is
mean(NBAtrain$PTS)
## [1] 8370.24
So the RMSE being approximately equal to 185 is not so significant in context of the average number of points scored in the season. So the magnitude of error is acceptable.
Now as we establish a good model, we try to check if the assumptions of multiple linear regression hold true.
mean(PointsRegD$residuals)
## [1] 1.700298e-14
This value is almost equal to zero. We can also use a histogram to check the normality of errors.
hist(PointsRegD$residuals)
So the errors are almost normally distributed.
As we can observe, the absolute values of the slope parameters for all the 5 explanatory variables is almost equal to 1. So, there is a linear relationship between every independent variable and the dependent variable.
plot(NBAtrain$X2PA, NBAtrain$PTS)
plot(NBAtrain$X3PA, NBAtrain$PTS)
plot(NBAtrain$FTA, NBAtrain$PTS)
plot(NBAtrain$AST, NBAtrain$PTS)
plot(NBAtrain$ORB, NBAtrain$PTS)
So for all the variables, the variance in errors is almost similar. So homoscedasticity holds as well.
cov(PointsRegD$residuals, NBAtrain$X3PA)
## [1] 7.117181e-12
cov(PointsRegD$residuals, NBAtrain$X2PA)
## [1] -6.011589e-12
cov(PointsRegD$residuals, NBAtrain$AST)
## [1] -4.699731e-13
cov(PointsRegD$residuals, NBAtrain$FTA)
## [1] -3.732276e-12
cov(PointsRegD$residuals, NBAtrain$ORB)
## [1] 1.885113e-12
This is also almost equal to zero for all the variables.
Now, as we have finally found a good interpretable model using the training data set, we now check the deployability of the model on a test set.
We load in the data set.
NBAtest <- read.csv("C:/Users/salil/Desktop/AnalyticsEdgeFolder/NBA_test.csv")
rbind(head(NBAtest), tail(NBAtest))
## SeasonEnd Team Playoffs W PTS oppPTS FG FGA X2P X2PA
## 1 2013 Atlanta Hawks 1 44 8032 7999 3084 6644 2378 4743
## 2 2013 Brooklyn Nets 1 49 7944 7798 2942 6544 2314 4784
## 3 2013 Charlotte Bobcats 0 21 7661 8418 2823 6649 2354 5250
## 4 2013 Chicago Bulls 1 45 7641 7615 2926 6698 2480 5433
## 5 2013 Cleveland Cavaliers 0 24 7913 8297 2993 6901 2446 5320
## 6 2013 Dallas Mavericks 0 41 8293 8342 3182 6892 2576 5264
## 23 2013 Portland Trail Blazers 0 33 7995 8255 3009 6715 2336 4811
## 24 2013 Sacramento Kings 0 28 8219 8619 3086 6904 2476 5223
## 25 2013 San Antonio Spurs 1 58 8448 7923 3210 6675 2547 4911
## 26 2013 Toronto Raptors 0 34 7971 8092 2979 6685 2408 5020
## 27 2013 Utah Jazz 0 43 8038 8045 3046 6710 2539 5325
## 28 2013 Washington Wizards 0 29 7644 7852 2910 6693 2365 5198
## X3P X3PA FT FTA ORB DRB AST STL BLK TOV
## 1 706 1901 1158 1619 758 2593 2007 664 369 1219
## 2 628 1760 1432 1958 1047 2460 1668 599 391 1206
## 3 469 1399 1546 2060 917 2389 1587 591 479 1153
## 4 446 1265 1343 1738 1026 2514 1886 588 417 1171
## 5 547 1581 1380 1826 1004 2359 1694 647 334 1149
## 6 606 1628 1323 1669 767 2670 1906 648 454 1144
## 23 673 1904 1304 1680 874 2474 1784 538 353 1203
## 24 610 1681 1437 1869 943 2385 1708 671 342 1199
## 25 663 1764 1365 1725 666 2721 2058 695 446 1206
## 26 571 1665 1442 1831 871 2426 1765 595 392 1124
## 27 507 1385 1439 1883 989 2457 1859 690 515 1210
## 28 545 1495 1279 1746 887 2652 1775 598 376 1238
These are the basketball statistics of the 2012-13 season.
Now using our model, we try to predict how many points we can expect to see in the 2012-13 season.
PointsPredict <- predict(PointsRegD, newdata = NBAtest)
We now measure the out of sample R sq which tells us how well the model predicts on test data.
We know the formula for R sq is equal to 1-(Sum of Squared Errors)/(Total Sum of Squares)
SSEE <- sum((PointsPredict - NBAtest$PTS)^2)
SSTE <- sum((mean(NBAtrain$PTS)-NBAtest$PTS)^2)
R2 <- 1 - SSEE/SSTE
R2
## [1] 0.8231327
So the value of R sq is 0.82 which is good. The root mean sqaure error is
RMSEE = sqrt(SSEE/nrow(NBAtest))
RMSEE
## [1] 190.8322
This value doesn’t deviate much as well.
So we have finally established a good, predictable and deployable model.