Simple linear regression is one of the most useful machine learning algorithm in the field of data science. To demonstrate its effectiveness, using data from the 2018-2019 NBA regular season, we will construct a linear regression model to determine a team’s wins projection based on their point differential over the course of the season. Upon further examination we will be able to determine the presence of and the strength of the relationship between these variables.
Using the ballr package in R, we will be able to extract NBA Team data from basketballreference.com directly into the console.
Now that we have called all of the necessary dependencies. We can now view and use the final standings of the 2018-2019 NBA regular season.
|
|
A brief synopsis of each category’s abbreviations. Separated by
• w- number of team’s wins
• l- number of team’s losses
• w_lpercent – the team’s win loss percentage determine by number of wins divided by 82(total number of games played)
• gb- games behind the first place team in that respective division
• ps_g- total team’s points scored per game
• ps_a- total team’s points allowed per game
• pw, pl will not be used in the project.
As indicated by the asterisks the top 8 teams of each conference makes the playoffs.
In order to conduct some exploratory data analysis we will first to need to transform this data into a more manageable data frames.
First, let’s examine the structure of the data frame for the final standings for the 2018-2019 NBA season.
## List of 2
## $ East:'data.frame': 15 obs. of 9 variables:
## ..$ eastern_conference: chr [1:15] "Milwaukee Bucks*" "Toronto Raptors*" "Philadelphia 76ers*" "Boston Celtics*" ...
## ..$ w : int [1:15] 60 58 51 49 48 42 42 41 39 39 ...
## ..$ l : int [1:15] 22 24 31 33 34 40 40 41 43 43 ...
## ..$ w_lpercent : num [1:15] 0.732 0.707 0.622 0.598 0.585 0.512 0.512 0.5 0.476 0.476 ...
## ..$ gb : chr [1:15] "—" "2.0" "9.0" "11.0" ...
## ..$ pw : int [1:15] 61 56 48 52 50 43 41 40 38 40 ...
## ..$ pl : int [1:15] 21 26 34 30 32 39 41 42 44 42 ...
## ..$ ps_g : num [1:15] 118 114 115 112 108 ...
## ..$ pa_g : num [1:15] 109 108 112 108 105 ...
## $ West:'data.frame': 15 obs. of 9 variables:
## ..$ western_conference: chr [1:15] "Golden State Warriors*" "Denver Nuggets*" "Houston Rockets*" "Portland Trail Blazers*" ...
## ..$ w : int [1:15] 57 54 53 53 50 49 48 48 39 37 ...
## ..$ l : int [1:15] 25 28 29 29 32 33 34 34 43 45 ...
## ..$ w_lpercent : num [1:15] 0.695 0.659 0.646 0.646 0.61 0.598 0.585 0.585 0.476 0.451 ...
## ..$ gb : chr [1:15] "—" "3.0" "4.0" "4.0" ...
## ..$ pw : int [1:15] 56 51 53 51 54 50 43 45 38 37 ...
## ..$ pl : int [1:15] 26 31 29 31 28 32 39 37 44 45 ...
## ..$ ps_g : num [1:15] 118 111 114 115 112 ...
## ..$ pa_g : num [1:15] 111 107 109 110 106 ...
The schedules for each team are imbalanced with respect to their own conference, thus we will examine both the Eastern and the Western conferences separately.
Next we will create a new column for each team’s point differential, called pdf, calculated by subtracting the points scored column from the points against column.
| eastern_conference | w | l | w_lpercent | gb | pw | pl | ps_g | pa_g | |
|---|---|---|---|---|---|---|---|---|---|
| Milwaukee Bucks* | 60 | 22 | 0.732 | — | 61 | 21 | 118.1 | 109.3 | 8.8 |
| Toronto Raptors* | 58 | 24 | 0.707 | 2.0 | 56 | 26 | 114.4 | 108.4 | 6.0 |
| Philadelphia 76ers* | 51 | 31 | 0.622 | 9.0 | 48 | 34 | 115.2 | 112.5 | 2.7 |
| Boston Celtics* | 49 | 33 | 0.598 | 11.0 | 52 | 30 | 112.4 | 108.0 | 4.4 |
| Indiana Pacers* | 48 | 34 | 0.585 | 12.0 | 50 | 32 | 108.0 | 104.7 | 3.3 |
| Orlando Magic* | 42 | 40 | 0.512 | 18.0 | 43 | 39 | 107.3 | 106.6 | 0.7 |
| Brooklyn Nets* | 42 | 40 | 0.512 | 18.0 | 41 | 41 | 112.2 | 112.3 | -0.1 |
| Detroit Pistons* | 41 | 41 | 0.500 | 19.0 | 40 | 42 | 107.0 | 107.3 | -0.3 |
| Charlotte Hornets | 39 | 43 | 0.476 | 21.0 | 38 | 44 | 110.7 | 111.8 | -1.1 |
| Miami Heat | 39 | 43 | 0.476 | 21.0 | 40 | 42 | 105.7 | 105.9 | -0.2 |
| Washington Wizards | 32 | 50 | 0.390 | 28.0 | 34 | 48 | 114.0 | 116.9 | -2.9 |
| Atlanta Hawks | 29 | 53 | 0.354 | 31.0 | 27 | 55 | 113.3 | 119.4 | -6.1 |
| Chicago Bulls | 22 | 60 | 0.268 | 38.0 | 21 | 61 | 104.9 | 113.4 | -8.5 |
| Cleveland Cavaliers | 19 | 63 | 0.232 | 41.0 | 19 | 63 | 104.5 | 114.1 | -9.6 |
| New York Knicks | 17 | 65 | 0.207 | 43.0 | 19 | 63 | 104.6 | 113.8 | -9.2 |
| western_conference | w | l | w_lpercent | gb | pw | pl | ps_g | pa_g | |
|---|---|---|---|---|---|---|---|---|---|
| Golden State Warriors* | 57 | 25 | 0.695 | — | 56 | 26 | 117.7 | 111.2 | 6.5 |
| Denver Nuggets* | 54 | 28 | 0.659 | 3.0 | 51 | 31 | 110.7 | 106.7 | 4.0 |
| Houston Rockets* | 53 | 29 | 0.646 | 4.0 | 53 | 29 | 113.9 | 109.1 | 4.8 |
| Portland Trail Blazers* | 53 | 29 | 0.646 | 4.0 | 51 | 31 | 114.7 | 110.5 | 4.2 |
| Utah Jazz* | 50 | 32 | 0.610 | 7.0 | 54 | 28 | 111.7 | 106.5 | 5.2 |
| Oklahoma City Thunder* | 49 | 33 | 0.598 | 8.0 | 50 | 32 | 114.5 | 111.1 | 3.4 |
| Los Angeles Clippers* | 48 | 34 | 0.585 | 9.0 | 43 | 39 | 115.1 | 114.3 | 0.8 |
| San Antonio Spurs* | 48 | 34 | 0.585 | 9.0 | 45 | 37 | 111.7 | 110.0 | 1.7 |
| Sacramento Kings | 39 | 43 | 0.476 | 18.0 | 38 | 44 | 114.2 | 115.3 | -1.1 |
| Los Angeles Lakers | 37 | 45 | 0.451 | 20.0 | 37 | 45 | 111.8 | 113.5 | -1.7 |
| Minnesota Timberwolves | 36 | 46 | 0.439 | 21.0 | 37 | 45 | 112.5 | 114.0 | -1.5 |
| Dallas Mavericks | 33 | 49 | 0.402 | 24.0 | 38 | 44 | 108.9 | 110.1 | -1.2 |
| New Orleans Pelicans | 33 | 49 | 0.402 | 24.0 | 38 | 44 | 115.4 | 116.8 | -1.4 |
| Memphis Grizzlies | 33 | 49 | 0.402 | 24.0 | 34 | 48 | 103.5 | 106.1 | -2.6 |
| Phoenix Suns | 19 | 63 | 0.232 | 38.0 | 19 | 63 | 107.5 | 116.8 | -9.3 |
Let’s take a closer look at the of the Eastern and Western conference data structures.
| eastern_conference | w | l | w_lpercent | gb | pw | pl | ps_g | pa_g | ||
|---|---|---|---|---|---|---|---|---|---|---|
| Length:15 | Min. :17.0 | Min. :22.0 | Min. :0.2070 | Length:15 | Min. :19.00 | Min. :21.00 | Min. :104.5 | Min. :104.7 | Min. :-9.6000 | |
| Class :character | 1st Qu.:30.5 | 1st Qu.:33.5 | 1st Qu.:0.3720 | Class :character | 1st Qu.:30.50 | 1st Qu.:33.00 | 1st Qu.:106.3 | 1st Qu.:107.7 | 1st Qu.:-4.5000 | |
| Mode :character | Median :41.0 | Median :41.0 | Median :0.5000 | Mode :character | Median :40.00 | Median :42.00 | Median :110.7 | Median :111.8 | Median :-0.2000 | |
| NA | Mean :39.2 | Mean :42.8 | Mean :0.4781 | NA | Mean :39.27 | Mean :42.73 | Mean :110.2 | Mean :111.0 | Mean :-0.8067 | |
| NA | 3rd Qu.:48.5 | 3rd Qu.:51.5 | 3rd Qu.:0.5915 | NA | 3rd Qu.:49.00 | 3rd Qu.:51.50 | 3rd Qu.:113.7 | 3rd Qu.:113.6 | 3rd Qu.: 3.0000 | |
| NA | Max. :60.0 | Max. :65.0 | Max. :0.7320 | NA | Max. :61.00 | Max. :63.00 | Max. :118.1 | Max. :119.4 | Max. : 8.8000 |
As we can see the median number of wins in the Eastern Conference is 39.2, with the average point differential being -0.8067 points among all teams.
| western_conference | w | l | w_lpercent | gb | pw | pl | ps_g | pa_g | ||
|---|---|---|---|---|---|---|---|---|---|---|
| Length:15 | Min. :19.0 | Min. :25.0 | Min. :0.2320 | Length:15 | Min. :19.00 | Min. :26.00 | Min. :103.5 | Min. :106.1 | Min. :-9.3000 | |
| Class :character | 1st Qu.:34.5 | 1st Qu.:30.5 | 1st Qu.:0.4205 | Class :character | 1st Qu.:37.50 | 1st Qu.:31.00 | 1st Qu.:111.2 | 1st Qu.:109.5 | 1st Qu.:-1.4500 | |
| Mode :character | Median :48.0 | Median :34.0 | Median :0.5850 | Mode :character | Median :43.00 | Median :39.00 | Median :112.5 | Median :111.1 | Median : 0.8000 | |
| NA | Mean :42.8 | Mean :39.2 | Mean :0.5219 | NA | Mean :42.93 | Mean :39.07 | Mean :112.3 | Mean :111.5 | Mean : 0.7867 | |
| NA | 3rd Qu.:51.5 | 3rd Qu.:47.5 | 3rd Qu.:0.6280 | NA | 3rd Qu.:51.00 | 3rd Qu.:44.50 | 3rd Qu.:114.6 | 3rd Qu.:114.2 | 3rd Qu.: 4.1000 | |
| NA | Max. :57.0 | Max. :63.0 | Max. :0.6950 | NA | Max. :56.00 | Max. :63.00 | Max. :117.7 | Max. :116.8 | Max. : 6.5000 |
The median number of wins is higher in the Western Conference at 42.8, with the average point differential being 0.7867 points among all teams.
An analysis will be conducted to determine the accuracy and strength of the relationship between wins and point differential.
Graph these two variable in the respective conferences will result in the following scatter plots.
As you can see there is a clear linear relationship between point differential and wins for both conferences.
Below is the numerical correlation between point differential and wins.
| w | ||
|---|---|---|
| 1.0000000 | 0.9892606 | |
| w | 0.9892606 | 1.0000000 |
| w | ||
|---|---|---|
| 1.0000000 | 0.9654652 | |
| w | 0.9654652 | 1.0000000 |
As expected there is a strong linear relationship between these two variables. Linear regression is the best way to predict how many wins a team will have given the point differential.
We can now start to build our linear regression model, first we will need to create a training set and a testing set for both the Eastern and Western Conference data.
We will split the data accordingly, in order to test the accuracy of our model.
We will use the lm() function to fit a simple linear regression lm() model. The simple syntax for lm(y∼x,data), where y is the response variable wins , x is the predictor variable point differential. We will construction a simple linear regression model for both conferences.
For the Western Conference
##
## Call:
## lm(formula = w ~ pdf, data = Wtrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5033 -1.0871 -0.5612 1.5801 3.9548
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.1437 0.8281 50.89 2.46e-11 ***
## pdf 2.3768 0.1813 13.11 1.09e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.498 on 8 degrees of freedom
## Multiple R-squared: 0.9555, Adjusted R-squared: 0.95
## F-statistic: 171.8 on 1 and 8 DF, p-value: 1.09e-06
For the Eastern Conference
##
## Call:
## lm(formula = w ~ pdf, data = Etrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1766 -1.1911 -0.3647 1.1894 2.9488
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.5214 0.5834 69.46 2.05e-12 ***
## pdf 2.4216 0.1216 19.92 4.21e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.783 on 8 degrees of freedom
## Multiple R-squared: 0.9802, Adjusted R-squared: 0.9778
## F-statistic: 396.6 on 1 and 8 DF, p-value: 4.212e-08
The R-squared values is the proportion of variance explained and so it always takes on a value between 0 and 1, and is independent of the scale of Y. Getting values of 0.9555 and 0.9802 are very high indicating that our model fits well.
Below are the regression equations for the Eastern Conference
Wins= 41.1 + 2.35 pdf
and for the Western Conference,
Wins = 40.8 + 2.54 pdf
The intercept of 41.1 wins for the Eastern Conference is greater than the mean number of wins, 39.2. The intercept of 40.8 for the Western conference is below the mean number wins, 42.8. In both cases the point differential is greater than the average. These results suggest that the Western conference overall is performing better than expected in comparison to the Eastern conference. This linear model indicates that the larger the point differential for a team is the more wins they are projected to have.
Lastly, we will test our model using our test set for each conference to see how well our wins projection compares to the actual amount.
| Team | Projected | Actual | |
|---|---|---|---|
| 3 | Philadelphia 76ers* | 47.0597978990053 | 51 |
| 11 | Washington Wizards | 33.4986630456518 | 32 |
| 12 | Atlanta Hawks | 25.7494431294497 | 29 |
| 13 | Chicago Bulls | 19.9375281922982 | 22 |
| 14 | Cleveland Cavaliers | 17.2737338461038 | 19 |
| Team | Projected | Actual | |
|---|---|---|---|
| 3 | Houston Rockets* | 53.5525308020923 | 53 |
| 11 | Minnesota Timberwolves | 38.578494634984 | 36 |
| 12 | Dallas Mavericks | 39.2915439762749 | 33 |
| 13 | New Orleans Pelicans | 38.8161777487477 | 33 |
| 14 | Memphis Grizzlies | 35.9639803835842 | 33 |
Here are some summary statistics for our linear regression model.
SSE is the measure of the discrepancy between the data and an estimation model.
For the Eastern Conference
## [1] 25.43554
For the Western Conference
## [1] 49.92625
RMSE is used to measure the differences between values predicted by a model or an estimator and the values observed.
For the Eastern Conference
## [1] 1.302191
For the Western Conference
## [1] 1.824395