We initially load in our data set.

baseball <- read.csv("C:/Users/salil/Desktop/AnalyticsEdgeFolder/baseball.csv")

This is how our data set look like.

rbind(head(baseball), tail(baseball))
##      Team League Year  RS  RA   W   OBP   SLG    BA Playoffs RankSeason
## 1     ARI     NL 2012 734 688  81 0.328 0.418 0.259        0         NA
## 2     ATL     NL 2012 700 600  94 0.320 0.389 0.247        1          4
## 3     BAL     AL 2012 712 705  93 0.311 0.417 0.247        1          5
## 4     BOS     AL 2012 734 806  69 0.315 0.415 0.260        0         NA
## 5     CHC     NL 2012 613 759  61 0.302 0.378 0.240        0         NA
## 6     CHW     AL 2012 748 676  85 0.318 0.422 0.255        0         NA
## 1227  NYY     AL 1962 817 680  96 0.337 0.426 0.267        1          2
## 1228  PHI     NL 1962 705 759  81 0.330 0.390 0.260        0         NA
## 1229  PIT     NL 1962 706 626  93 0.321 0.394 0.268        0         NA
## 1230  SFG     NL 1962 878 690 103 0.341 0.441 0.278        1          1
## 1231  STL     NL 1962 774 664  84 0.335 0.394 0.271        0         NA
## 1232  WSA     AL 1962 599 716  60 0.308 0.373 0.250        0         NA
##      RankPlayoffs   G  OOBP  OSLG
## 1              NA 162 0.317 0.415
## 2               5 162 0.306 0.378
## 3               4 162 0.315 0.403
## 4              NA 162 0.331 0.428
## 5              NA 162 0.335 0.424
## 6              NA 162 0.319 0.405
## 1227            1 162    NA    NA
## 1228           NA 161    NA    NA
## 1229           NA 161    NA    NA
## 1230            2 165    NA    NA
## 1231           NA 163    NA    NA
## 1232           NA 162    NA    NA

We also check the structure of the data set using the ‘str’ function.

str(baseball)
## 'data.frame':    1232 obs. of  15 variables:
##  $ Team        : Factor w/ 39 levels "ANA","ARI","ATL",..: 2 3 4 5 7 8 9 10 11 12 ...
##  $ League      : Factor w/ 2 levels "AL","NL": 2 2 1 1 2 1 2 1 2 1 ...
##  $ Year        : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
##  $ RS          : int  734 700 712 734 613 748 669 667 758 726 ...
##  $ RA          : int  688 600 705 806 759 676 588 845 890 670 ...
##  $ W           : int  81 94 93 69 61 85 97 68 64 88 ...
##  $ OBP         : num  0.328 0.32 0.311 0.315 0.302 0.318 0.315 0.324 0.33 0.335 ...
##  $ SLG         : num  0.418 0.389 0.417 0.415 0.378 0.422 0.411 0.381 0.436 0.422 ...
##  $ BA          : num  0.259 0.247 0.247 0.26 0.24 0.255 0.251 0.251 0.274 0.268 ...
##  $ Playoffs    : int  0 1 1 0 0 0 1 0 0 1 ...
##  $ RankSeason  : int  NA 4 5 NA NA NA 2 NA NA 6 ...
##  $ RankPlayoffs: int  NA 5 4 NA NA NA 4 NA NA 2 ...
##  $ G           : int  162 162 162 162 162 162 162 162 162 162 ...
##  $ OOBP        : num  0.317 0.306 0.315 0.331 0.335 0.319 0.305 0.336 0.357 0.314 ...
##  $ OSLG        : num  0.415 0.378 0.403 0.428 0.424 0.405 0.39 0.43 0.47 0.402 ...

We will be briefly discussing about every column and variable as and when required.

Moneyball is a book written by Michael Lewis which discusses how sports analytics changed baseball. Since we are confirming the claims made in moneyball as per data available in 2002, we would be building our model using the same. We use the ‘subset’ function to get data of observations prior to 2002.

moneyball <- subset(baseball,baseball$Year<2002)
str(moneyball)
## 'data.frame':    902 obs. of  15 variables:
##  $ Team        : Factor w/ 39 levels "ANA","ARI","ATL",..: 1 2 3 4 5 7 8 9 10 11 ...
##  $ League      : Factor w/ 2 levels "AL","NL": 1 2 2 1 1 2 1 2 1 2 ...
##  $ Year        : int  2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
##  $ RS          : int  691 818 729 687 772 777 798 735 897 923 ...
##  $ RA          : int  730 677 643 829 745 701 795 850 821 906 ...
##  $ W           : int  75 92 88 63 82 88 83 66 91 73 ...
##  $ OBP         : num  0.327 0.341 0.324 0.319 0.334 0.336 0.334 0.324 0.35 0.354 ...
##  $ SLG         : num  0.405 0.442 0.412 0.38 0.439 0.43 0.451 0.419 0.458 0.483 ...
##  $ BA          : num  0.261 0.267 0.26 0.248 0.266 0.261 0.268 0.262 0.278 0.292 ...
##  $ Playoffs    : int  0 1 1 0 0 0 0 0 1 0 ...
##  $ RankSeason  : int  NA 5 7 NA NA NA NA NA 6 NA ...
##  $ RankPlayoffs: int  NA 1 3 NA NA NA NA NA 4 NA ...
##  $ G           : int  162 162 162 162 161 162 162 162 162 162 ...
##  $ OOBP        : num  0.331 0.311 0.314 0.337 0.329 0.321 0.334 0.341 0.341 0.35 ...
##  $ OSLG        : num  0.412 0.404 0.384 0.439 0.393 0.398 0.427 0.455 0.417 0.48 ...

We first wish to predict wins by building a linear regression model where the number of wins will be the dependent variable and the difference in ‘runs scored’ and ‘runs allowed’ i.e. ‘RS’ and ‘RA’ will be the independent variables. To avoid using two independent variables, we create a new column which reveals the difference the two.

moneyball$RD <- moneyball$RS - moneyball$RA

We now use the ‘lm’ function to run a regression where ‘W’ will be our dependent variable and ‘RD’ will be our independent variable.

WinsReg <- lm(moneyball$W~moneyball$RD,data = moneyball)

We now check the summary of the model created.

summary(WinsReg)
## 
## Call:
## lm(formula = moneyball$W ~ moneyball$RD, data = moneyball)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.2662  -2.6509   0.1234   2.9364  11.6570 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  80.881375   0.131157  616.67   <2e-16 ***
## moneyball$RD  0.105766   0.001297   81.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.939 on 900 degrees of freedom
## Multiple R-squared:  0.8808, Adjusted R-squared:  0.8807 
## F-statistic:  6651 on 1 and 900 DF,  p-value: < 2.2e-16

We can see that both the intercept and the ‘RD’ variable are significant as there are 3 asterisks marked beside. Also, the R sq is high and equal to approximately 88%.

Now we have to find out whether the claim that the runs diff should be 135 to win at least 95 games and qualify for the playoffs.

From the Linear Regression model, the equation we get is

W=80.881375+0.105766(RD)

According to the claim, the number of wins have to be ‘95’. So, 95=80.881375+0.105766(RD)

After a bit of manipulation, we get ‘Runs Difference’ equal to 133.4 which is approximately equal to 135. So the claim was true.

For the ‘Runs Difference’ to be equal to be equal to ‘135’, we need to predict the ‘Runs Scored’ and the ‘Runs Allowed’ using batting and pitching statistics respectively.

RunsScored1 <- lm(RS~OBP+SLG+BA,data = moneyball)
summary(RunsScored1)
## 
## Call:
## lm(formula = RS ~ OBP + SLG + BA, data = moneyball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -70.941 -17.247  -0.621  16.754  90.998 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -788.46      19.70 -40.029  < 2e-16 ***
## OBP          2917.42     110.47  26.410  < 2e-16 ***
## SLG          1637.93      45.99  35.612  < 2e-16 ***
## BA           -368.97     130.58  -2.826  0.00482 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.69 on 898 degrees of freedom
## Multiple R-squared:  0.9302, Adjusted R-squared:   0.93 
## F-statistic:  3989 on 3 and 898 DF,  p-value: < 2.2e-16

‘BA’ or ‘Batting Average’ has 2 asterisks marks beside it. This is an important variable in determining the ‘Runs Scored’ but not as important as the ‘On-Base Percentage(OPB)’ or the ‘Slugging Percentage(SP)’,

Also the coefficient of ‘BA’ is negative which is very contradictory. This happens probably because of multicollinearity between the independent variables. Let us check the correlation coefficient between the variables.

cor(moneyball$OBP,moneyball$BA)
## [1] 0.8540549
cor(moneyball$SLG,moneyball$BA)
## [1] 0.8140681
cor(moneyball$OBP,moneyball$SLG)
## [1] 0.8061539

We therefore build a different model with the ‘BA’ variable and check the R sq value.

RunsScored2 <- lm(RS~OBP+SLG,data = moneyball)
summary(RunsScored2)
## 
## Call:
## lm(formula = RS ~ OBP + SLG, data = moneyball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -70.838 -17.174  -1.108  16.770  90.036 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -804.63      18.92  -42.53   <2e-16 ***
## OBP          2737.77      90.68   30.19   <2e-16 ***
## SLG          1584.91      42.16   37.60   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.79 on 899 degrees of freedom
## Multiple R-squared:  0.9296, Adjusted R-squared:  0.9294 
## F-statistic:  5934 on 2 and 899 DF,  p-value: < 2.2e-16

Here the R sq value is almost the same and the coefficient of ‘OBP’ is higher than that of ‘SP’. This means that ‘OBP’ is an extremely important variable and ‘SP’ is an important variable.

We now try builing models by keeping ‘BA’ and removing each of the other two variables.

RunsScored3 <- lm(RS~SLG+BA,data = moneyball)
RunsScored4 <- lm(RS~OBP+BA,data = moneyball)
summary(RunsScored3)
## 
## Call:
## lm(formula = RS ~ SLG + BA, data = moneyball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -84.636 -23.771  -2.377  20.603 112.223 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -526.90      22.68  -23.23   <2e-16 ***
## SLG          2083.82      57.00   36.56   <2e-16 ***
## BA           1615.95     142.25   11.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.9 on 899 degrees of freedom
## Multiple R-squared:  0.876,  Adjusted R-squared:  0.8757 
## F-statistic:  3175 on 2 and 899 DF,  p-value: < 2.2e-16
summary(RunsScored4)
## 
## Call:
## lm(formula = RS ~ OBP + BA, data = moneyball)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -112.558  -27.370    1.166   26.810  111.769 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1107.98      27.22 -40.706  < 2e-16 ***
## OBP          4361.44     159.50  27.344  < 2e-16 ***
## BA           1528.12     185.06   8.257 5.29e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.33 on 899 degrees of freedom
## Multiple R-squared:  0.8316, Adjusted R-squared:  0.8313 
## F-statistic:  2220 on 2 and 899 DF,  p-value: < 2.2e-16

So in both the cases, R sq reduces by greater magnitudes. So the ‘RunsScored2’ model is the final model.

RS=-804.63+2737.77(OBP)+1584.91(SLG)

We now repeat the same procedure to predict ‘Runs Allowed’.

RunsAcq1 <- lm(RA~OOBP+OSLG,data = moneyball)
summary(RunsAcq1)
## 
## Call:
## lm(formula = RA ~ OOBP + OSLG, data = moneyball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -82.397 -15.178  -0.129  17.679  60.955 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -837.38      60.26 -13.897  < 2e-16 ***
## OOBP         2913.60     291.97   9.979 4.46e-16 ***
## OSLG         1514.29     175.43   8.632 2.55e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.67 on 87 degrees of freedom
##   (812 observations deleted due to missingness)
## Multiple R-squared:  0.9073, Adjusted R-squared:  0.9052 
## F-statistic: 425.8 on 2 and 87 DF,  p-value: < 2.2e-16

The R sq value is high i.e. equal to 0.9052.

The final model we get is, RA=-837.38+2913.6(OOBP)+1514.29(OSLG)

Using these models, we would try to predict the number of wins of any team.

For this we need to predict the runs scored by the team team and the runs allowed by the team. As every year there are new players in the team, we need to estimate the new team statistics using past player performance.

Let us consider the case of Oakland As.

Using the 2001 regular season statistics for these players, team OBP=0.339 and team SLG=0.430.

So in 2002, using the model, RS=-804.63+2737.77OBP+1584.91SLG. So we get runs scored=805.

Similarly using the other model and the values of OOBP=0.307 and OSLG=0.373 we get RA=622.

Next we want to predict the number of wins. So Wins=80.8814+0.1058(RS-RA)

Wins=100.

We will now use the similar method to make prediction in the NBA data available. This time we will also be checking if the basic assumptions of the multiple linear regression model hold true or not.

We first load in our training data set.

NBAtrain <- read.csv("C:/Users/salil/Desktop/AnalyticsEdgeFolder/NBA_train.csv")
str(NBAtrain)
## 'data.frame':    835 obs. of  20 variables:
##  $ SeasonEnd: int  1980 1980 1980 1980 1980 1980 1980 1980 1980 1980 ...
##  $ Team     : Factor w/ 37 levels "Atlanta Hawks",..: 1 2 5 6 8 9 10 11 12 13 ...
##  $ Playoffs : int  1 1 0 0 0 0 0 1 0 1 ...
##  $ W        : int  50 61 30 37 30 16 24 41 37 47 ...
##  $ PTS      : int  8573 9303 8813 9360 8878 8933 8493 9084 9119 8860 ...
##  $ oppPTS   : int  8334 8664 9035 9332 9240 9609 8853 9070 9176 8603 ...
##  $ FG       : int  3261 3617 3362 3811 3462 3643 3527 3599 3639 3582 ...
##  $ FGA      : int  7027 7387 6943 8041 7470 7596 7318 7496 7689 7489 ...
##  $ X2P      : int  3248 3455 3292 3775 3379 3586 3500 3495 3551 3557 ...
##  $ X2PA     : int  6952 6965 6668 7854 7215 7377 7197 7117 7375 7375 ...
##  $ X3P      : int  13 162 70 36 83 57 27 104 88 25 ...
##  $ X3PA     : int  75 422 275 187 255 219 121 379 314 114 ...
##  $ FT       : int  2038 1907 2019 1702 1871 1590 1412 1782 1753 1671 ...
##  $ FTA      : int  2645 2449 2592 2205 2539 2149 1914 2326 2333 2250 ...
##  $ ORB      : int  1369 1227 1115 1307 1311 1226 1155 1394 1398 1187 ...
##  $ DRB      : int  2406 2457 2465 2381 2524 2415 2437 2217 2326 2429 ...
##  $ AST      : int  1913 2198 2152 2108 2079 1950 2028 2149 2148 2123 ...
##  $ STL      : int  782 809 704 764 746 783 779 782 900 863 ...
##  $ BLK      : int  539 308 392 342 404 562 339 373 530 356 ...
##  $ TOV      : int  1495 1539 1684 1370 1533 1742 1492 1565 1517 1439 ...

Each of these variables have a meaning and they will be explained as and when required. The variables where there is a ‘X’ in the begining, it happens because R puts one in front of the names of the variables which start with a number.

Now let us see the number of games a team would probably need to win in order to make it to the playoffs.

table(NBAtrain$W,NBAtrain$Playoffs)
##     
##       0  1
##   11  2  0
##   12  2  0
##   13  2  0
##   14  2  0
##   15 10  0
##   16  2  0
##   17 11  0
##   18  5  0
##   19 10  0
##   20 10  0
##   21 12  0
##   22 11  0
##   23 11  0
##   24 18  0
##   25 11  0
##   26 17  0
##   27 10  0
##   28 18  0
##   29 12  0
##   30 19  1
##   31 15  1
##   32 12  0
##   33 17  0
##   34 16  0
##   35 13  3
##   36 17  4
##   37 15  4
##   38  8  7
##   39 10 10
##   40  9 13
##   41 11 26
##   42  8 29
##   43  2 18
##   44  2 27
##   45  3 22
##   46  1 15
##   47  0 28
##   48  1 14
##   49  0 17
##   50  0 32
##   51  0 12
##   52  0 20
##   53  0 17
##   54  0 18
##   55  0 24
##   56  0 16
##   57  0 23
##   58  0 13
##   59  0 14
##   60  0  8
##   61  0 10
##   62  0 13
##   63  0  7
##   64  0  3
##   65  0  3
##   66  0  2
##   67  0  4
##   69  0  1
##   72  0  1

‘0’ means that any team has not made it to the playoffs and ‘1’ means that the team has made it to the playoffs.

So according to this, if a team wins 37 or less games, there is hardly any chance of making it to the playoffs. If a team wins 42 or more games, there is a good amount of chance of making it to the playoffs. Anything in betwwen defines ambiguity i.e. there is a good chance of the team’s playoff entry going either ways.

Let us now check if a team’s total number of wins depends on the points difference throughout the season that is the difference in points scored by themselves and their opponents.

NBAtrain$PtsDiff <- NBAtrain$PTS - NBAtrain$oppPTS

If we plot this on a graph, we get

plot(NBAtrain$W,NBAtrain$PtsDiff)

So we can observe a strong linear relationship between the two. So, higher the difference in points, higher are the number of wins. We now create a model regarding the same.

WinsNBA <- lm(W~PtsDiff,data = NBAtrain)
summary(WinsNBA)
## 
## Call:
## lm(formula = W ~ PtsDiff, data = NBAtrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.7393 -2.1018 -0.0672  2.0265 10.6026 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.100e+01  1.059e-01   387.0   <2e-16 ***
## PtsDiff     3.259e-02  2.793e-04   116.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.061 on 833 degrees of freedom
## Multiple R-squared:  0.9423, Adjusted R-squared:  0.9423 
## F-statistic: 1.361e+04 on 1 and 833 DF,  p-value: < 2.2e-16

So the R sq is pretty high equal to 0.94. Also, both the intercept variable and the ‘Points Diff’ variable are strongly significant.

From the regression model we get,

W=41+0.0326(PtsDiff)

This is the model and we can see that the a team has to win somewhere about 42 games to have a good chance of making it to the playoffs. We can then calculate the points difference as per the number of wins we want. If we solve by assuming the number of W=43, then we get PtsDiff approximately equal to 31. So to have a comfortable position, this should be the minimum point difference.

Now, let us determine the points scored using the common basketball statistics.

PointsReg <- lm(PTS~X2PA+X3PA+FTA+AST+ORB+DRB+TOV+STL+BLK,data = NBAtrain)
summary(PointsReg)
## 
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + TOV + 
##     STL + BLK, data = NBAtrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -527.40 -119.83    7.83  120.67  564.71 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.051e+03  2.035e+02 -10.078   <2e-16 ***
## X2PA         1.043e+00  2.957e-02  35.274   <2e-16 ***
## X3PA         1.259e+00  3.843e-02  32.747   <2e-16 ***
## FTA          1.128e+00  3.373e-02  33.440   <2e-16 ***
## AST          8.858e-01  4.396e-02  20.150   <2e-16 ***
## ORB         -9.554e-01  7.792e-02 -12.261   <2e-16 ***
## DRB          3.883e-02  6.157e-02   0.631   0.5285    
## TOV         -2.475e-02  6.118e-02  -0.405   0.6859    
## STL         -1.992e-01  9.181e-02  -2.169   0.0303 *  
## BLK         -5.576e-02  8.782e-02  -0.635   0.5256    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 185.5 on 825 degrees of freedom
## Multiple R-squared:  0.8992, Adjusted R-squared:  0.8981 
## F-statistic: 817.3 on 9 and 825 DF,  p-value: < 2.2e-16

‘DRB’ of ‘Defensive Rebounds’, ‘TOV’ or ‘Turnovers’ and ‘BLK’ or ‘Throw in Blocks’ are not statistically significant at all as there are no ‘asterisks’ marks present beside them. We will now try to remove them individually one by one and check the interpretability of the created models.

The estimate value of the intercept tells the value of Y when the value of X is equal to zero. The estimate value of the independent variables reveal the average value of the change in Y when there is a one unit change in X. Std Error is the estimated variability in a coeff due to sampling variability.t-value is the ratio of the first two values i.e. Estimate/Std.Error. So larger the absolute t-value, greater is the magnitude of the estimated term relative to the error and better the model and vice-versa.

Null hypothesis is that the value of the slope parameter is either equal to or approximately equal to zero. Alternative hypothesis would mean that the slope parameter is significant. The p-value reveals the statistical significance of the variable in the model. In simple words, p-value is the probability of the null hypothesis being true.

So, as p-value of ‘TOV’ is high, we first remove it from the model.

PointsRegA <- lm(PTS~X2PA+X3PA+FTA+AST+ORB+DRB+STL+BLK,data = NBAtrain)
summary(PointsRegA)
## 
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + STL + 
##     BLK, data = NBAtrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -526.79 -121.09    6.37  120.74  565.94 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.077e+03  1.931e+02 -10.755   <2e-16 ***
## X2PA         1.044e+00  2.951e-02  35.366   <2e-16 ***
## X3PA         1.263e+00  3.703e-02  34.099   <2e-16 ***
## FTA          1.125e+00  3.308e-02  34.023   <2e-16 ***
## AST          8.861e-01  4.393e-02  20.173   <2e-16 ***
## ORB         -9.581e-01  7.758e-02 -12.350   <2e-16 ***
## DRB          3.892e-02  6.154e-02   0.632   0.5273    
## STL         -2.068e-01  8.984e-02  -2.301   0.0216 *  
## BLK         -5.863e-02  8.749e-02  -0.670   0.5029    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 185.4 on 826 degrees of freedom
## Multiple R-squared:  0.8991, Adjusted R-squared:  0.8982 
## F-statistic: 920.4 on 8 and 826 DF,  p-value: < 2.2e-16

The r-sq value almost does not fluctuate. So it is safe to remove the variable. Now let us remove the variable DRB as the next highest p-value is that of ‘DRB’.

PointsRegB <- lm(PTS~X2PA+X3PA+FTA+AST+ORB+STL+BLK,data = NBAtrain)
summary(PointsRegB)
## 
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + STL + BLK, 
##     data = NBAtrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -523.79 -121.64    6.07  120.81  573.64 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.015e+03  1.670e+02 -12.068  < 2e-16 ***
## X2PA         1.048e+00  2.852e-02  36.753  < 2e-16 ***
## X3PA         1.271e+00  3.475e-02  36.568  < 2e-16 ***
## FTA          1.128e+00  3.270e-02  34.506  < 2e-16 ***
## AST          8.909e-01  4.326e-02  20.597  < 2e-16 ***
## ORB         -9.702e-01  7.519e-02 -12.903  < 2e-16 ***
## STL         -2.276e-01  8.356e-02  -2.724  0.00659 ** 
## BLK         -3.882e-02  8.165e-02  -0.475  0.63462    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 185.4 on 827 degrees of freedom
## Multiple R-squared:  0.8991, Adjusted R-squared:  0.8982 
## F-statistic:  1053 on 7 and 827 DF,  p-value: < 2.2e-16

The r-sq value is still the same. Now let us remove the variable BLK.

PointsRegC <- lm(PTS~X2PA+X3PA+FTA+AST+ORB+STL,data = NBAtrain)
summary(PointsRegC)
## 
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + STL, data = NBAtrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -523.33 -122.02    6.93  120.68  568.26 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.033e+03  1.629e+02 -12.475  < 2e-16 ***
## X2PA         1.050e+00  2.829e-02  37.117  < 2e-16 ***
## X3PA         1.273e+00  3.441e-02  37.001  < 2e-16 ***
## FTA          1.127e+00  3.260e-02  34.581  < 2e-16 ***
## AST          8.884e-01  4.292e-02  20.701  < 2e-16 ***
## ORB         -9.743e-01  7.465e-02 -13.051  < 2e-16 ***
## STL         -2.268e-01  8.350e-02  -2.717  0.00673 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 185.3 on 828 degrees of freedom
## Multiple R-squared:  0.8991, Adjusted R-squared:  0.8983 
## F-statistic:  1229 on 6 and 828 DF,  p-value: < 2.2e-16

We now finally remove ‘STL’ whose level of significance and p-values are low.

PointsRegD <- lm(PTS~X2PA+X3PA+FTA+AST+ORB,data = NBAtrain)
summary(PointsRegD)
## 
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB, data = NBAtrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -537.29 -123.28    7.56  122.02  572.92 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.039e+03  1.635e+02  -12.47   <2e-16 ***
## X2PA         1.045e+00  2.835e-02   36.88   <2e-16 ***
## X3PA         1.267e+00  3.446e-02   36.76   <2e-16 ***
## FTA          1.118e+00  3.253e-02   34.36   <2e-16 ***
## AST          8.651e-01  4.221e-02   20.50   <2e-16 ***
## ORB         -1.018e+00  7.320e-02  -13.90   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 186 on 829 degrees of freedom
## Multiple R-squared:  0.8982, Adjusted R-squared:  0.8975 
## F-statistic:  1462 on 5 and 829 DF,  p-value: < 2.2e-16

So the value of R sq is 0.89, which is good, using only 5 basketball variables.

We now check the root mean square error of the initial model. and the final model.

SSE <- sum(PointsReg$residuals^2)
SSE
## [1] 28394314
RMSE <- sqrt(SSE/nrow(NBAtrain))
RMSE
## [1] 184.4049

Now the Root Mean Square error of other models.

SSEA <- sum(PointsRegA$residuals^2)
RMSEA <- sqrt(SSEA/nrow(NBAtrain))
RMSEA
## [1] 184.4232
SSEB <- sum(PointsRegB$residuals^2)
RMSEB <- sqrt(SSEB/nrow(NBAtrain))
RMSEB
## [1] 184.4678
SSEC <- sum(PointsRegC$residuals^2)
RMSEC <- sqrt(SSEC/nrow(NBAtrain))
RMSEC
## [1] 184.493

And then that of the final model.

SSED <- sum(PointsRegD$residuals^2)
SSED
## [1] 28674769
RMSED <- sqrt(SSED/nrow(NBAtrain))
RMSED
## [1] 185.3134

So the root mean square error doesn’t change much. It is almost the same. The mean number of points in a season is

mean(NBAtrain$PTS)
## [1] 8370.24

So the RMSE being approximately equal to 185 is not so significant in context of the average number of points scored in the season. So the magnitude of error is acceptable.

Now as we establish a good model, we try to check if the assumptions of multiple linear regression hold true.

  1. The mean value of the error should be equal to ‘zero’. So
mean(PointsRegD$residuals)
## [1] 1.700298e-14

This value is almost equal to zero. We can also use a histogram to check the normality of errors.

hist(PointsRegD$residuals)

So the errors are almost normally distributed.

  1. Linear relation between each explanatory variable.

As we can observe, the absolute values of the slope parameters for all the 5 explanatory variables is almost equal to 1. So, there is a linear relationship between every independent variable and the dependent variable.

  1. Now, we check for the assumption of homoscedasticity using graphs for the explanatory variables one by one.
plot(NBAtrain$X2PA, NBAtrain$PTS)

plot(NBAtrain$X3PA, NBAtrain$PTS)

plot(NBAtrain$FTA, NBAtrain$PTS)

plot(NBAtrain$AST, NBAtrain$PTS)

plot(NBAtrain$ORB, NBAtrain$PTS)

So for all the variables, the variance in errors is almost similar. So homoscedasticity holds as well.

  1. We now check if the covariance between the error terms and each of the explanatory variables is equal to ‘zero’.
cov(PointsRegD$residuals, NBAtrain$X3PA)
## [1] 7.117181e-12
cov(PointsRegD$residuals, NBAtrain$X2PA)
## [1] -6.011589e-12
cov(PointsRegD$residuals, NBAtrain$AST)
## [1] -4.699731e-13
cov(PointsRegD$residuals, NBAtrain$FTA)
## [1] -3.732276e-12
cov(PointsRegD$residuals, NBAtrain$ORB)
## [1] 1.885113e-12

This is also almost equal to zero for all the variables.

Now, as we have finally found a good interpretable model using the training data set, we now check the deployability of the model on a test set.

We load in the data set.

NBAtest <- read.csv("C:/Users/salil/Desktop/AnalyticsEdgeFolder/NBA_test.csv")
rbind(head(NBAtest), tail(NBAtest))
##    SeasonEnd                   Team Playoffs  W  PTS oppPTS   FG  FGA  X2P X2PA
## 1       2013          Atlanta Hawks        1 44 8032   7999 3084 6644 2378 4743
## 2       2013          Brooklyn Nets        1 49 7944   7798 2942 6544 2314 4784
## 3       2013      Charlotte Bobcats        0 21 7661   8418 2823 6649 2354 5250
## 4       2013          Chicago Bulls        1 45 7641   7615 2926 6698 2480 5433
## 5       2013    Cleveland Cavaliers        0 24 7913   8297 2993 6901 2446 5320
## 6       2013       Dallas Mavericks        0 41 8293   8342 3182 6892 2576 5264
## 23      2013 Portland Trail Blazers        0 33 7995   8255 3009 6715 2336 4811
## 24      2013       Sacramento Kings        0 28 8219   8619 3086 6904 2476 5223
## 25      2013      San Antonio Spurs        1 58 8448   7923 3210 6675 2547 4911
## 26      2013        Toronto Raptors        0 34 7971   8092 2979 6685 2408 5020
## 27      2013              Utah Jazz        0 43 8038   8045 3046 6710 2539 5325
## 28      2013     Washington Wizards        0 29 7644   7852 2910 6693 2365 5198
##    X3P X3PA   FT  FTA  ORB  DRB  AST STL BLK  TOV
## 1  706 1901 1158 1619  758 2593 2007 664 369 1219
## 2  628 1760 1432 1958 1047 2460 1668 599 391 1206
## 3  469 1399 1546 2060  917 2389 1587 591 479 1153
## 4  446 1265 1343 1738 1026 2514 1886 588 417 1171
## 5  547 1581 1380 1826 1004 2359 1694 647 334 1149
## 6  606 1628 1323 1669  767 2670 1906 648 454 1144
## 23 673 1904 1304 1680  874 2474 1784 538 353 1203
## 24 610 1681 1437 1869  943 2385 1708 671 342 1199
## 25 663 1764 1365 1725  666 2721 2058 695 446 1206
## 26 571 1665 1442 1831  871 2426 1765 595 392 1124
## 27 507 1385 1439 1883  989 2457 1859 690 515 1210
## 28 545 1495 1279 1746  887 2652 1775 598 376 1238

These are the basketball statistics of the 2012-13 season.

Now using our model, we try to predict how many points we can expect to see in the 2012-13 season.

PointsPredict <- predict(PointsRegD, newdata = NBAtest)

We now measure the out of sample R sq which tells us how well the model predicts on test data.

We know the formula for R sq is equal to 1-(Sum of Squared Errors)/(Total Sum of Squares)

SSEE <- sum((PointsPredict - NBAtest$PTS)^2)
SSTE <- sum((mean(NBAtrain$PTS)-NBAtest$PTS)^2)
R2 <- 1 - SSEE/SSTE
R2
## [1] 0.8231327

So the value of R sq is 0.82 which is good. The root mean sqaure error is

RMSEE = sqrt(SSEE/nrow(NBAtest))
RMSEE
## [1] 190.8322

This value doesn’t deviate much as well.

So we have finally established a good, predictable and deployable model.