NBA dataset:

MyData_train <- read.csv(file= "/Users/GD/Desktop/national-basketball-associationnba-dataset/NBA_train.csv", header=TRUE, sep=",")
summary(MyData_train)
##    SeasonEnd                     Team        Playoffs            W       
##  Min.   :1980   Atlanta Hawks      : 31   Min.   :0.0000   Min.   :11.0  
##  1st Qu.:1989   Boston Celtics     : 31   1st Qu.:0.0000   1st Qu.:31.0  
##  Median :1996   Chicago Bulls      : 31   Median :1.0000   Median :42.0  
##  Mean   :1996   Cleveland Cavaliers: 31   Mean   :0.5749   Mean   :41.0  
##  3rd Qu.:2005   Denver Nuggets     : 31   3rd Qu.:1.0000   3rd Qu.:50.5  
##  Max.   :2011   Detroit Pistons    : 31   Max.   :1.0000   Max.   :72.0  
##                 (Other)            :649                                  
##       PTS            oppPTS            FG            FGA      
##  Min.   : 6901   Min.   : 6909   Min.   :2565   Min.   :5972  
##  1st Qu.: 7934   1st Qu.: 7934   1st Qu.:2974   1st Qu.:6564  
##  Median : 8312   Median : 8365   Median :3150   Median :6831  
##  Mean   : 8370   Mean   : 8370   Mean   :3200   Mean   :6873  
##  3rd Qu.: 8784   3rd Qu.: 8768   3rd Qu.:3434   3rd Qu.:7157  
##  Max.   :10371   Max.   :10723   Max.   :3980   Max.   :8868  
##                                                               
##       X2P            X2PA           X3P             X3PA       
##  Min.   :1981   Min.   :4153   Min.   : 10.0   Min.   :  75.0  
##  1st Qu.:2510   1st Qu.:5269   1st Qu.:131.5   1st Qu.: 413.0  
##  Median :2718   Median :5706   Median :329.0   Median : 942.0  
##  Mean   :2881   Mean   :5956   Mean   :319.0   Mean   : 916.9  
##  3rd Qu.:3296   3rd Qu.:6754   3rd Qu.:481.5   3rd Qu.:1347.5  
##  Max.   :3954   Max.   :7873   Max.   :841.0   Max.   :2284.0  
##                                                                
##        FT            FTA            ORB              DRB      
##  Min.   :1189   Min.   :1475   Min.   : 639.0   Min.   :2044  
##  1st Qu.:1502   1st Qu.:2008   1st Qu.: 953.5   1st Qu.:2346  
##  Median :1628   Median :2176   Median :1055.0   Median :2433  
##  Mean   :1650   Mean   :2190   Mean   :1061.6   Mean   :2427  
##  3rd Qu.:1781   3rd Qu.:2352   3rd Qu.:1167.0   3rd Qu.:2516  
##  Max.   :2388   Max.   :3051   Max.   :1520.0   Max.   :2753  
##                                                               
##       AST            STL              BLK             TOV      
##  Min.   :1423   Min.   : 455.0   Min.   :204.0   Min.   : 931  
##  1st Qu.:1735   1st Qu.: 599.0   1st Qu.:359.0   1st Qu.:1192  
##  Median :1899   Median : 658.0   Median :410.0   Median :1289  
##  Mean   :1912   Mean   : 668.4   Mean   :419.8   Mean   :1303  
##  3rd Qu.:2078   3rd Qu.: 729.0   3rd Qu.:469.5   3rd Qu.:1396  
##  Max.   :2575   Max.   :1053.0   Max.   :716.0   Max.   :1873  
## 
str(MyData_train)
## 'data.frame':    835 obs. of  20 variables:
##  $ SeasonEnd: int  1980 1980 1980 1980 1980 1980 1980 1980 1980 1980 ...
##  $ Team     : Factor w/ 37 levels "Atlanta Hawks",..: 1 2 5 6 8 9 10 11 12 13 ...
##  $ Playoffs : int  1 1 0 0 0 0 0 1 0 1 ...
##  $ W        : int  50 61 30 37 30 16 24 41 37 47 ...
##  $ PTS      : int  8573 9303 8813 9360 8878 8933 8493 9084 9119 8860 ...
##  $ oppPTS   : int  8334 8664 9035 9332 9240 9609 8853 9070 9176 8603 ...
##  $ FG       : int  3261 3617 3362 3811 3462 3643 3527 3599 3639 3582 ...
##  $ FGA      : int  7027 7387 6943 8041 7470 7596 7318 7496 7689 7489 ...
##  $ X2P      : int  3248 3455 3292 3775 3379 3586 3500 3495 3551 3557 ...
##  $ X2PA     : int  6952 6965 6668 7854 7215 7377 7197 7117 7375 7375 ...
##  $ X3P      : int  13 162 70 36 83 57 27 104 88 25 ...
##  $ X3PA     : int  75 422 275 187 255 219 121 379 314 114 ...
##  $ FT       : int  2038 1907 2019 1702 1871 1590 1412 1782 1753 1671 ...
##  $ FTA      : int  2645 2449 2592 2205 2539 2149 1914 2326 2333 2250 ...
##  $ ORB      : int  1369 1227 1115 1307 1311 1226 1155 1394 1398 1187 ...
##  $ DRB      : int  2406 2457 2465 2381 2524 2415 2437 2217 2326 2429 ...
##  $ AST      : int  1913 2198 2152 2108 2079 1950 2028 2149 2148 2123 ...
##  $ STL      : int  782 809 704 764 746 783 779 782 900 863 ...
##  $ BLK      : int  539 308 392 342 404 562 339 373 530 356 ...
##  $ TOV      : int  1495 1539 1684 1370 1533 1742 1492 1565 1517 1439 ...

To get basic insights of data:

table(MyData_train$W,MyData_train$Playoffs)
##     
##       0  1
##   11  2  0
##   12  2  0
##   13  2  0
##   14  2  0
##   15 10  0
##   16  2  0
##   17 11  0
##   18  5  0
##   19 10  0
##   20 10  0
##   21 12  0
##   22 11  0
##   23 11  0
##   24 18  0
##   25 11  0
##   26 17  0
##   27 10  0
##   28 18  0
##   29 12  0
##   30 19  1
##   31 15  1
##   32 12  0
##   33 17  0
##   34 16  0
##   35 13  3
##   36 17  4
##   37 15  4
##   38  8  7
##   39 10 10
##   40  9 13
##   41 11 26
##   42  8 29
##   43  2 18
##   44  2 27
##   45  3 22
##   46  1 15
##   47  0 28
##   48  1 14
##   49  0 17
##   50  0 32
##   51  0 12
##   52  0 20
##   53  0 17
##   54  0 18
##   55  0 24
##   56  0 16
##   57  0 23
##   58  0 13
##   59  0 14
##   60  0  8
##   61  0 10
##   62  0 13
##   63  0  7
##   64  0  3
##   65  0  3
##   66  0  2
##   67  0  4
##   69  0  1
##   72  0  1

Adding the variable ‘diff’ to get difference between point scored and points allowed.

MyData_train$diff = MyData_train$PTS - MyData_train$oppPTS

Building a linear model

model <- lm(W ~ diff,data = MyData_train)
summary(model)
## 
## Call:
## lm(formula = W ~ diff, data = MyData_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.7393 -2.1018 -0.0672  2.0265 10.6026 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.100e+01  1.059e-01   387.0   <2e-16 ***
## diff        3.259e-02  2.793e-04   116.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.061 on 833 degrees of freedom
## Multiple R-squared:  0.9423, Adjusted R-squared:  0.9423 
## F-statistic: 1.361e+04 on 1 and 833 DF,  p-value: < 2.2e-16

If we take a look at R-squared the value is 0.9423 close to 1.

plot(model)

Observation for linear model

  1. Predictor variable = Diff
  2. Response variable = W
model$coefficients
## (Intercept)        diff 
## 41.00000000  0.03258633

Prediction with regression

The equation for regression becomes

w = 41 + 0.032586 * (diff)

As R-square is close to 1,the residual variance is lower and indicates that model is a good fit.

Data source : www.kaggle.com