Introduction

This project with the use the computer program R will use the K-Nearest-Neighbours algorithm to predict the position of NBA players based on their regular season statistics.

The k-nearest neighbours algorithm predicts unknown values by matching them alongside very or “most” similar known values. Since the KKN model is a straight forward approach based on memory it cannot be summarized in closed-form fashion. Meaning the training samples are required at run-time and predictions are made upon relationship among samples. The value of \(k\) in this algorithm can be any value less than the number of rows in the data frame. The aim is to examine a small number of neighbours for better algorithm performance, due to the fact that the less similar the neighbours are to in the data, the worse the prediction will be.

In order to find the most similar NBA players we will use the principle of Euclidean distance, simply measuring the straight-line distance between two players using data from basketballreference.com.

NBA Statistics in R

We call the per game statistics from the 2018-2019 NBA regular season. Based on their respective statistics that have been tracked and documented throughout the season.

The statistics include:

• Games

• Games started

• Minutes played

• 2pt & 3pt field goals/attempts

• Free throws/attempts

• Assists, turnovers

• Offensive/ Defensive and Total Rebounds

• Fouls

• Blocks

• Steals

• Total points

In order to view the per game statistics of each player in the NBA for 2019 regular season we must call the function

rk player pos age tm g gs mp fg fga fgpercent x3p x3pa x3ppercent x2p x2pa x2ppercent efgpercent ft fta ftpercent orb drb trb ast stl blk tov pf pts link
1 Álex Abrines SG 25 OKC 31 2 19.0 1.8 5.1 0.357 1.3 4.1 0.323 0.5 1.0 0.500 0.487 0.4 0.4 0.923 0.2 1.4 1.5 0.6 0.5 0.2 0.5 1.7 5.3 /players/a/abrinal01.html
2 Quincy Acy PF 28 PHO 10 0 12.3 0.4 1.8 0.222 0.2 1.5 0.133 0.2 0.3 0.667 0.278 0.7 1.0 0.700 0.3 2.2 2.5 0.8 0.1 0.4 0.4 2.4 1.7 /players/a/acyqu01.html
3 Jaylen Adams PG 22 ATL 34 1 12.6 1.1 3.2 0.345 0.7 2.2 0.338 0.4 1.1 0.361 0.459 0.2 0.3 0.778 0.3 1.4 1.8 1.9 0.4 0.1 0.8 1.3 3.2 /players/a/adamsja01.html
4 Steven Adams C 25 OKC 80 80 33.4 6.0 10.1 0.595 0.0 0.0 0.000 6.0 10.1 0.596 0.595 1.8 3.7 0.500 4.9 4.6 9.5 1.6 1.5 1.0 1.7 2.6 13.9 /players/a/adamsst01.html
5 Bam Adebayo C 21 MIA 82 28 23.3 3.4 5.9 0.576 0.0 0.2 0.200 3.4 5.7 0.588 0.579 2.0 2.8 0.735 2.0 5.3 7.3 2.2 0.9 0.8 1.5 2.5 8.9 /players/a/adebaba01.html
6 Deng Adel SF 21 CLE 19 3 10.2 0.6 1.9 0.306 0.3 1.2 0.261 0.3 0.7 0.385 0.389 0.2 0.2 1.000 0.2 0.8 1.0 0.3 0.1 0.2 0.3 0.7 1.7 /players/a/adelde01.html

Here are the column names for our data frame

##  [1] "rk"         "player"     "pos"        "age"        "tm"        
##  [6] "g"          "gs"         "mp"         "fg"         "fga"       
## [11] "fgpercent"  "x3p"        "x3pa"       "x3ppercent" "x2p"       
## [16] "x2pa"       "x2ppercent" "efgpercent" "ft"         "fta"       
## [21] "ftpercent"  "orb"        "drb"        "trb"        "ast"       
## [26] "stl"        "blk"        "tov"        "pf"         "pts"       
## [31] "link"

Now we will create a data frame with only numeric values and the player’s name and position. Using the following columns

• “pos”,

• “g”

• “gs”

• “player”

• “pos”

• “age”

• “mp”

• “fg”

• “fga”

• “fgpercent”

• “x3p”

• “x3pa”

• “x3ppercent”

• “x2p”

• “x2pa”

• “x2ppercent”

• “efgpercent”

• “ft”

• “fta”

• “ftpercent”

• “orb”

• “drb”

• “trb”

• “ast”

• “stl”

• “blk”

• “tov”

• “pf”

• “pts”

• “tpts”

player pos g gs age mp fg fga fgpercent x3p x3pa x3ppercent x2p x2pa x2ppercent efgpercent ft fta ftpercent orb drb trb ast stl blk tov pf pts
Álex Abrines SG 31 2 25 19.0 1.8 5.1 0.357 1.3 4.1 0.323 0.5 1.0 0.500 0.487 0.4 0.4 0.923 0.2 1.4 1.5 0.6 0.5 0.2 0.5 1.7 5.3
Quincy Acy PF 10 0 28 12.3 0.4 1.8 0.222 0.2 1.5 0.133 0.2 0.3 0.667 0.278 0.7 1.0 0.700 0.3 2.2 2.5 0.8 0.1 0.4 0.4 2.4 1.7
Jaylen Adams PG 34 1 22 12.6 1.1 3.2 0.345 0.7 2.2 0.338 0.4 1.1 0.361 0.459 0.2 0.3 0.778 0.3 1.4 1.8 1.9 0.4 0.1 0.8 1.3 3.2
Steven Adams C 80 80 25 33.4 6.0 10.1 0.595 0.0 0.0 0.000 6.0 10.1 0.596 0.595 1.8 3.7 0.500 4.9 4.6 9.5 1.6 1.5 1.0 1.7 2.6 13.9
Bam Adebayo C 82 28 21 23.3 3.4 5.9 0.576 0.0 0.2 0.200 3.4 5.7 0.588 0.579 2.0 2.8 0.735 2.0 5.3 7.3 2.2 0.9 0.8 1.5 2.5 8.9
Deng Adel SF 19 3 21 10.2 0.6 1.9 0.306 0.3 1.2 0.261 0.3 0.7 0.385 0.389 0.2 0.2 1.000 0.2 0.8 1.0 0.3 0.1 0.2 0.3 0.7 1.7

Here is the row for the 2019 NBA MVP Giannis Antetokounmpo statistics for the regular season.

player pos g gs age mp fg fga fgpercent x3p x3pa x3ppercent x2p x2pa x2ppercent efgpercent ft fta ftpercent orb drb trb ast stl blk tov pf pts
Giannis Antetokounmpo PF 72 72 24 32.8 10 17.3 0.578 0.7 2.8 0.256 9.3 14.5 0.641 0.599 6.9 9.5 0.729 2.2 10.3 12.5 5.9 1.3 1.5 3.7 3.2 27.7

Now we must normalize and standardize each column because larger variables have a greater effect on the distance between the observations, and hence on the overall KNN classifier, than smaller variables. Thus each variable will have a mean of 0 and a standard deviation of 1.

Since we are trying to categorizes players by the total amount of points scored, we will save the ‘tpts’ column separately.

Standardizing Data

Using the scale function in R, we will now standardize the rest of the PerGame dataset

player pos g gs age mp fg fga fgpercent x3p x3pa x3ppercent x2p x2pa x2ppercent efgpercent ft fta ftpercent orb drb trb ast stl blk tov pf pts
Álex Abrines SG -0.4521250 -0.6901915 -0.2759252 -0.0293946 -0.6104422 -0.4060435 -0.7300242 0.5271520 0.7483735 0.0569648 -0.9716835 -1.0309463 0.0654535 -0.1120092 -0.7383836 -0.8650535 1.2778046 -0.8061146 -0.7157392 -0.8267956 -0.7516947 -0.2607481 -0.4397157 -0.7125465 -0.0306029 -0.5292066
Quincy Acy PF -1.2511496 -0.7675349 0.4485064 -0.7718669 -1.2719380 -1.1567002 -1.9573290 -0.8673656 -0.5142110 -1.3036675 -1.1423630 -1.2477850 1.4879883 -1.8809823 -0.4965445 -0.4768076 -0.2761475 -0.6764329 -0.2880174 -0.4249205 -0.6299238 -1.2111196 0.0765674 -0.8461136 0.8311461 -1.1468930
Jaylen Adams PG -0.3379786 -0.7288632 -1.0003568 -0.7386219 -0.9411901 -0.8382398 -0.8391180 -0.2334940 -0.1742844 0.1643831 -1.0285767 -0.9999693 -1.1185725 -0.3490008 -0.8996097 -0.9297611 0.2673873 -0.6764329 -0.7157392 -0.7062330 0.0398163 -0.4983410 -0.6978572 -0.3118452 -0.5230308 -0.8895237
Steven Adams C 1.4122658 2.3262009 -0.2759252 1.5663668 1.3740455 0.7313152 1.4336686 -1.1209143 -1.2426251 -2.2561101 2.1574396 1.7879565 0.8831981 0.8021014 0.3901990 1.2702988 -1.6698266 5.2889251 0.9951478 2.3882051 -0.1428400 2.1151806 1.6254167 0.8902587 1.0773601 0.9463776
Bam Adebayo C 1.4883634 0.3152727 -1.2418341 0.4471175 0.1455531 -0.2240661 1.2609368 -1.1209143 -1.1455032 -0.8238656 0.6782177 0.4249706 0.8150527 0.6666776 0.5514251 0.6879300 -0.0322537 1.5281559 1.3694043 1.5040799 0.2224727 0.6896234 1.1091336 0.6231245 0.9542531 0.0884798
Deng Adel SF -0.9087105 -0.6515198 -1.2418341 -1.0045821 -1.1774386 -1.1339530 -1.1936727 -0.7405913 -0.6598938 -0.3870310 -1.0854698 -1.1238771 -0.9141364 -0.9414798 -0.8996097 -0.9944688 1.8143710 -0.8061146 -1.0365305 -1.0277331 -0.9343511 -1.2111196 -0.4397157 -0.9796807 -1.2616728 -1.1468930

Lets do a quick check on variance of the feild goal column, ‘fg’ to verify that the scaling worked.

##    fg
## fg  1

We must create training and testing set to run the KNN algorithm. Using random sampling, from the standardized.PerGame data frame, and then pick rows using the randomly shuffled values. The removal of all the rows that have a NA value must occur in order for the KNN function will not work.

Training and Test Set

We will create training (70%) set and 30% test set to use for our KNN model using caTools.

KNN model

Now we can construct our K Nearest Neighbour model. Using the knn function to predict the total amount of points a player scored on our test set. For simplicity sake we will use k=1.

##   [1] SG    C     SF    PG    PF    C     PF    C     PG    PF    PG    SG   
##  [13] SG    SF    PG    SG    PF    PG    C     C     SG    SG    PF    SG   
##  [25] SG    SG    SF    SG    SG    PG    PF    SG    SF    SF    SG    PG   
##  [37] SG    SG    PF    PG    PF    PG    PF    SG    C     PG    SF    PF   
##  [49] SF    SF    PG    SF    C     PF    SG    SG    SG    C     SF    PG   
##  [61] C     PF    PG    SF    SG    SG    PF    C     PF    SF    SG    SG   
##  [73] PF    SG    PF    SG    SG    SG    C     PG    SF    C     SF    SG   
##  [85] SG    SG    SF    SF    SF    SG    SG    SG    SG    PF-SF SG    SF   
##  [97] PF    PF    SG    PG    C     C     SG    SG-PF SF    SF    SF    SG   
## [109] PF    SG    SG    SF    PF    SG    C     PF    PG    SG    SG    SG   
## [121] SG    SF    SF    SG    SG    SF    PF    SG    C     C     SG    SG   
## [133] PF    SF    PG    SG    PG    SF    PG    SF    PF    C     SF    C    
## [145] PF    PF    PF    PG    PG    SF    PF    C     PG    SF    SF    SF   
## [157] SF    PF    PG    SG    PF    PF    SF    SF    SG    SG    SG    PF   
## [169] C     PG    SF    SF    PF    PG    SG    C     PG    SG    PF    SG   
## [181] PG    PF    PG    SG    C     PF    PF    C    
## Levels: C C-PF PF PF-C PF-SF PG SF SF-SG SG SG-PF SG-SF

The position generated were as followed:

• Centre

• Centre/Power Forward

• Power Forward

• Power Forward/Small Forward

• Point Guard

• Small Forward

• Small Forward/Shooting Guard

• Shooting Guard

• Shooting Guard/Power Forward

• Shooting Guard/Small Forward

Misclassification Rate

Now we can calculate the misclassification rate of our model

## [1] 0.4946809

This value is a bit high which is to be expected with k=1 and such a high number of data points in our dataframe.

Choosing a more appropriate K value is key to the viable of our model, creating a chart of the error misclassification rate for k values, will help visualize the right choice of K. The K values in the chart will range from 1 to 10. This process can be automated with a for() loop.

##  [1] 0.4361702 0.5159574 0.4521277 0.4521277 0.4946809 0.4893617 0.4840426
##  [8] 0.4414894 0.4468085 0.4627660
##    error.rate k.values
## 1   0.4361702        1
## 2   0.5159574        2
## 3   0.4521277        3
## 4   0.4521277        4
## 5   0.4946809        5
## 6   0.4893617        6
## 7   0.4840426        7
## 8   0.4414894        8
## 9   0.4468085        9
## 10  0.4627660       10
## 11  0.4361702       11
## 12  0.5159574       12
## 13  0.4521277       13
## 14  0.4521277       14
## 15  0.4946809       15
## 16  0.4893617       16
## 17  0.4840426       17
## 18  0.4414894       18
## 19  0.4468085       19
## 20  0.4627660       20

Here we can clearly see that increasing beyond K=11 does not help our misclassification at all. So we can set that as the K for our model during training.