Introduction

Simple linear regression is one of the most useful machine learning algorithm in the field of data science. To demonstrate its effectiveness, using data from the 2018-2019 NBA regular season, we will construct a linear regression model to determine a team’s wins projection based on their point differential over the course of the season. Upon further examination we will be able to determine the presence of and the strength of the relationship between these variables.

Using the ballr package in R, we will be able to extract NBA Team data from basketballreference.com directly into the console.

NBA Statistics in R

Now that we have called all of the necessary dependencies. We can now view and use the final standings of the 2018-2019 NBA regular season.

NBA Standings
eastern_conference w l w_lpercent gb pw pl ps_g pa_g
Milwaukee Bucks* 60 22 0.732 61 21 118.1 109.3
Toronto Raptors* 58 24 0.707 2.0 56 26 114.4 108.4
Philadelphia 76ers* 51 31 0.622 9.0 48 34 115.2 112.5
Boston Celtics* 49 33 0.598 11.0 52 30 112.4 108.0
Indiana Pacers* 48 34 0.585 12.0 50 32 108.0 104.7
Orlando Magic* 42 40 0.512 18.0 43 39 107.3 106.6
Brooklyn Nets* 42 40 0.512 18.0 41 41 112.2 112.3
Detroit Pistons* 41 41 0.500 19.0 40 42 107.0 107.3
Charlotte Hornets 39 43 0.476 21.0 38 44 110.7 111.8
Miami Heat 39 43 0.476 21.0 40 42 105.7 105.9
Washington Wizards 32 50 0.390 28.0 34 48 114.0 116.9
Atlanta Hawks 29 53 0.354 31.0 27 55 113.3 119.4
Chicago Bulls 22 60 0.268 38.0 21 61 104.9 113.4
Cleveland Cavaliers 19 63 0.232 41.0 19 63 104.5 114.1
New York Knicks 17 65 0.207 43.0 19 63 104.6 113.8
western_conference w l w_lpercent gb pw pl ps_g pa_g
Golden State Warriors* 57 25 0.695 56 26 117.7 111.2
Denver Nuggets* 54 28 0.659 3.0 51 31 110.7 106.7
Houston Rockets* 53 29 0.646 4.0 53 29 113.9 109.1
Portland Trail Blazers* 53 29 0.646 4.0 51 31 114.7 110.5
Utah Jazz* 50 32 0.610 7.0 54 28 111.7 106.5
Oklahoma City Thunder* 49 33 0.598 8.0 50 32 114.5 111.1
Los Angeles Clippers* 48 34 0.585 9.0 43 39 115.1 114.3
San Antonio Spurs* 48 34 0.585 9.0 45 37 111.7 110.0
Sacramento Kings 39 43 0.476 18.0 38 44 114.2 115.3
Los Angeles Lakers 37 45 0.451 20.0 37 45 111.8 113.5
Minnesota Timberwolves 36 46 0.439 21.0 37 45 112.5 114.0
Dallas Mavericks 33 49 0.402 24.0 38 44 108.9 110.1
New Orleans Pelicans 33 49 0.402 24.0 38 44 115.4 116.8
Memphis Grizzlies 33 49 0.402 24.0 34 48 103.5 106.1
Phoenix Suns 19 63 0.232 38.0 19 63 107.5 116.8

A brief synopsis of each category’s abbreviations. Separated by

• w- number of team’s wins

• l- number of team’s losses

• w_lpercent – the team’s win loss percentage determine by number of wins divided by 82(total number of games played)

• gb- games behind the first place team in that respective division

• ps_g- total team’s points scored per game

• ps_a- total team’s points allowed per game

• pw, pl will not be used in the project.

As indicated by the asterisks the top 8 teams of each conference makes the playoffs.

Feature Engineer

In order to conduct some exploratory data analysis we will first to need to transform this data into a more manageable data frames.

First, let’s examine the structure of the data frame for the final standings for the 2018-2019 NBA season.

## List of 2
##  $ East:'data.frame':    15 obs. of  9 variables:
##   ..$ eastern_conference: chr [1:15] "Milwaukee Bucks*" "Toronto Raptors*" "Philadelphia 76ers*" "Boston Celtics*" ...
##   ..$ w                 : int [1:15] 60 58 51 49 48 42 42 41 39 39 ...
##   ..$ l                 : int [1:15] 22 24 31 33 34 40 40 41 43 43 ...
##   ..$ w_lpercent        : num [1:15] 0.732 0.707 0.622 0.598 0.585 0.512 0.512 0.5 0.476 0.476 ...
##   ..$ gb                : chr [1:15] "—" "2.0" "9.0" "11.0" ...
##   ..$ pw                : int [1:15] 61 56 48 52 50 43 41 40 38 40 ...
##   ..$ pl                : int [1:15] 21 26 34 30 32 39 41 42 44 42 ...
##   ..$ ps_g              : num [1:15] 118 114 115 112 108 ...
##   ..$ pa_g              : num [1:15] 109 108 112 108 105 ...
##  $ West:'data.frame':    15 obs. of  9 variables:
##   ..$ western_conference: chr [1:15] "Golden State Warriors*" "Denver Nuggets*" "Houston Rockets*" "Portland Trail Blazers*" ...
##   ..$ w                 : int [1:15] 57 54 53 53 50 49 48 48 39 37 ...
##   ..$ l                 : int [1:15] 25 28 29 29 32 33 34 34 43 45 ...
##   ..$ w_lpercent        : num [1:15] 0.695 0.659 0.646 0.646 0.61 0.598 0.585 0.585 0.476 0.451 ...
##   ..$ gb                : chr [1:15] "—" "3.0" "4.0" "4.0" ...
##   ..$ pw                : int [1:15] 56 51 53 51 54 50 43 45 38 37 ...
##   ..$ pl                : int [1:15] 26 31 29 31 28 32 39 37 44 45 ...
##   ..$ ps_g              : num [1:15] 118 111 114 115 112 ...
##   ..$ pa_g              : num [1:15] 111 107 109 110 106 ...

The schedules for each team are imbalanced with respect to their own conference, thus we will examine both the Eastern and the Western conferences separately.

Next we will create a new column for each team’s point differential, called pdf, calculated by subtracting the points scored column from the points against column.

Eastern Conference
eastern_conference w l w_lpercent gb pw pl ps_g pa_g pdf
Milwaukee Bucks* 60 22 0.732 61 21 118.1 109.3 8.8
Toronto Raptors* 58 24 0.707 2.0 56 26 114.4 108.4 6.0
Philadelphia 76ers* 51 31 0.622 9.0 48 34 115.2 112.5 2.7
Boston Celtics* 49 33 0.598 11.0 52 30 112.4 108.0 4.4
Indiana Pacers* 48 34 0.585 12.0 50 32 108.0 104.7 3.3
Orlando Magic* 42 40 0.512 18.0 43 39 107.3 106.6 0.7
Brooklyn Nets* 42 40 0.512 18.0 41 41 112.2 112.3 -0.1
Detroit Pistons* 41 41 0.500 19.0 40 42 107.0 107.3 -0.3
Charlotte Hornets 39 43 0.476 21.0 38 44 110.7 111.8 -1.1
Miami Heat 39 43 0.476 21.0 40 42 105.7 105.9 -0.2
Washington Wizards 32 50 0.390 28.0 34 48 114.0 116.9 -2.9
Atlanta Hawks 29 53 0.354 31.0 27 55 113.3 119.4 -6.1
Chicago Bulls 22 60 0.268 38.0 21 61 104.9 113.4 -8.5
Cleveland Cavaliers 19 63 0.232 41.0 19 63 104.5 114.1 -9.6
New York Knicks 17 65 0.207 43.0 19 63 104.6 113.8 -9.2
Western Conference
western_conference w l w_lpercent gb pw pl ps_g pa_g pdf
Golden State Warriors* 57 25 0.695 56 26 117.7 111.2 6.5
Denver Nuggets* 54 28 0.659 3.0 51 31 110.7 106.7 4.0
Houston Rockets* 53 29 0.646 4.0 53 29 113.9 109.1 4.8
Portland Trail Blazers* 53 29 0.646 4.0 51 31 114.7 110.5 4.2
Utah Jazz* 50 32 0.610 7.0 54 28 111.7 106.5 5.2
Oklahoma City Thunder* 49 33 0.598 8.0 50 32 114.5 111.1 3.4
Los Angeles Clippers* 48 34 0.585 9.0 43 39 115.1 114.3 0.8
San Antonio Spurs* 48 34 0.585 9.0 45 37 111.7 110.0 1.7
Sacramento Kings 39 43 0.476 18.0 38 44 114.2 115.3 -1.1
Los Angeles Lakers 37 45 0.451 20.0 37 45 111.8 113.5 -1.7
Minnesota Timberwolves 36 46 0.439 21.0 37 45 112.5 114.0 -1.5
Dallas Mavericks 33 49 0.402 24.0 38 44 108.9 110.1 -1.2
New Orleans Pelicans 33 49 0.402 24.0 38 44 115.4 116.8 -1.4
Memphis Grizzlies 33 49 0.402 24.0 34 48 103.5 106.1 -2.6
Phoenix Suns 19 63 0.232 38.0 19 63 107.5 116.8 -9.3

Let’s take a closer look at the of the Eastern and Western conference data structures.

Eastern Confernce
eastern_conference w l w_lpercent gb pw pl ps_g pa_g pdf
Length:15 Min. :17.0 Min. :22.0 Min. :0.2070 Length:15 Min. :19.00 Min. :21.00 Min. :104.5 Min. :104.7 Min. :-9.6000
Class :character 1st Qu.:30.5 1st Qu.:33.5 1st Qu.:0.3720 Class :character 1st Qu.:30.50 1st Qu.:33.00 1st Qu.:106.3 1st Qu.:107.7 1st Qu.:-4.5000
Mode :character Median :41.0 Median :41.0 Median :0.5000 Mode :character Median :40.00 Median :42.00 Median :110.7 Median :111.8 Median :-0.2000
NA Mean :39.2 Mean :42.8 Mean :0.4781 NA Mean :39.27 Mean :42.73 Mean :110.2 Mean :111.0 Mean :-0.8067
NA 3rd Qu.:48.5 3rd Qu.:51.5 3rd Qu.:0.5915 NA 3rd Qu.:49.00 3rd Qu.:51.50 3rd Qu.:113.7 3rd Qu.:113.6 3rd Qu.: 3.0000
NA Max. :60.0 Max. :65.0 Max. :0.7320 NA Max. :61.00 Max. :63.00 Max. :118.1 Max. :119.4 Max. : 8.8000

As we can see the median number of wins in the Eastern Conference is 39.2, with the average point differential being -0.8067 points among all teams.

Western Conference
western_conference w l w_lpercent gb pw pl ps_g pa_g pdf
Length:15 Min. :19.0 Min. :25.0 Min. :0.2320 Length:15 Min. :19.00 Min. :26.00 Min. :103.5 Min. :106.1 Min. :-9.3000
Class :character 1st Qu.:34.5 1st Qu.:30.5 1st Qu.:0.4205 Class :character 1st Qu.:37.50 1st Qu.:31.00 1st Qu.:111.2 1st Qu.:109.5 1st Qu.:-1.4500
Mode :character Median :48.0 Median :34.0 Median :0.5850 Mode :character Median :43.00 Median :39.00 Median :112.5 Median :111.1 Median : 0.8000
NA Mean :42.8 Mean :39.2 Mean :0.5219 NA Mean :42.93 Mean :39.07 Mean :112.3 Mean :111.5 Mean : 0.7867
NA 3rd Qu.:51.5 3rd Qu.:47.5 3rd Qu.:0.6280 NA 3rd Qu.:51.00 3rd Qu.:44.50 3rd Qu.:114.6 3rd Qu.:114.2 3rd Qu.: 4.1000
NA Max. :57.0 Max. :63.0 Max. :0.6950 NA Max. :56.00 Max. :63.00 Max. :117.7 Max. :116.8 Max. : 6.5000

The median number of wins is higher in the Western Conference at 42.8, with the average point differential being 0.7867 points among all teams.

Exploratory Data Analysis

An analysis will be conducted to determine the accuracy and strength of the relationship between wins and point differential.

Graph these two variable in the respective conferences will result in the following scatter plots.

As you can see there is a clear linear relationship between point differential and wins for both conferences.

Below is the numerical correlation between point differential and wins.

Eastern Conference
pdf w
pdf 1.0000000 0.9892606
w 0.9892606 1.0000000
Western Conference
pdf w
pdf 1.0000000 0.9654652
w 0.9654652 1.0000000

As expected there is a strong linear relationship between these two variables. Linear regression is the best way to predict how many wins a team will have given the point differential.

Linear Regression Model To Predict Wins

We can now start to build our linear regression model, first we will need to create a training set and a testing set for both the Eastern and Western Conference data.

Train and Test Data

We will split the data accordingly, in order to test the accuracy of our model.

We will use the lm() function to fit a simple linear regression lm() model. The simple syntax for lm(y∼x,data), where y is the response variable wins , x is the predictor variable point differential. We will construction a simple linear regression model for both conferences.

For the Western Conference

## 
## Call:
## lm(formula = w ~ pdf, data = Wtrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5033 -1.0871 -0.5612  1.5801  3.9548 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  42.1437     0.8281   50.89 2.46e-11 ***
## pdf           2.3768     0.1813   13.11 1.09e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.498 on 8 degrees of freedom
## Multiple R-squared:  0.9555, Adjusted R-squared:   0.95 
## F-statistic: 171.8 on 1 and 8 DF,  p-value: 1.09e-06

For the Eastern Conference

## 
## Call:
## lm(formula = w ~ pdf, data = Etrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1766 -1.1911 -0.3647  1.1894  2.9488 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  40.5214     0.5834   69.46 2.05e-12 ***
## pdf           2.4216     0.1216   19.92 4.21e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.783 on 8 degrees of freedom
## Multiple R-squared:  0.9802, Adjusted R-squared:  0.9778 
## F-statistic: 396.6 on 1 and 8 DF,  p-value: 4.212e-08

Interpretation

The R-squared values is the proportion of variance explained and so it always takes on a value between 0 and 1, and is independent of the scale of Y. Getting values of 0.9555 and 0.9802 are very high indicating that our model fits well.

Regression Equation

Below are the regression equations for the Eastern Conference

Wins= 41.1 + 2.35 pdf

and for the Western Conference,

Wins = 40.8 + 2.54 pdf

The intercept of 41.1 wins for the Eastern Conference is greater than the mean number of wins, 39.2. The intercept of 40.8 for the Western conference is below the mean number wins, 42.8. In both cases the point differential is greater than the average. These results suggest that the Western conference overall is performing better than expected in comparison to the Eastern conference. This linear model indicates that the larger the point differential for a team is the more wins they are projected to have.

Prediction

Lastly, we will test our model using our test set for each conference to see how well our wins projection compares to the actual amount.

Eastern Conference
Team Projected Actual
3 Philadelphia 76ers* 47.0597978990053 51
11 Washington Wizards 33.4986630456518 32
12 Atlanta Hawks 25.7494431294497 29
13 Chicago Bulls 19.9375281922982 22
14 Cleveland Cavaliers 17.2737338461038 19
Western Conference
Team Projected Actual
3 Houston Rockets* 53.5525308020923 53
11 Minnesota Timberwolves 38.578494634984 36
12 Dallas Mavericks 39.2915439762749 33
13 New Orleans Pelicans 38.8161777487477 33
14 Memphis Grizzlies 35.9639803835842 33

Here are some summary statistics for our linear regression model.

Sum of Squared Errors

SSE is the measure of the discrepancy between the data and an estimation model.

For the Eastern Conference

## [1] 25.43554

For the Western Conference

## [1] 49.92625

Root Mean Squared Error

RMSE is used to measure the differences between values predicted by a model or an estimator and the values observed.

For the Eastern Conference

## [1] 1.302191

For the Western Conference

## [1] 1.824395