This project with the use the computer program R will use the K-Nearest-Neighbours algorithm to predict the position of NBA players based on their regular season statistics.
The k-nearest neighbours algorithm predicts unknown values by matching them alongside very or “most” similar known values. Since the KKN model is a straight forward approach based on memory it cannot be summarized in closed-form fashion. Meaning the training samples are required at run-time and predictions are made upon relationship among samples. The value of \(k\) in this algorithm can be any value less than the number of rows in the data frame. The aim is to examine a small number of neighbours for better algorithm performance, due to the fact that the less similar the neighbours are to in the data, the worse the prediction will be.
In order to find the most similar NBA players we will use the principle of Euclidean distance, simply measuring the straight-line distance between two players using data from basketballreference.com.
We call the per game statistics from the 2018-2019 NBA regular season. Based on their respective statistics that have been tracked and documented throughout the season.
The statistics include:
• Games
• Games started
• Minutes played
• 2pt & 3pt field goals/attempts
• Free throws/attempts
• Assists, turnovers
• Offensive/ Defensive and Total Rebounds
• Fouls
• Blocks
• Steals
• Total points
In order to view the per game statistics of each player in the NBA for 2019 regular season we must call the function
| rk | player | pos | age | tm | g | gs | mp | fg | fga | fgpercent | x3p | x3pa | x3ppercent | x2p | x2pa | x2ppercent | efgpercent | ft | fta | ftpercent | orb | drb | trb | ast | stl | blk | tov | pf | pts | link |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Álex Abrines | SG | 25 | OKC | 31 | 2 | 19.0 | 1.8 | 5.1 | 0.357 | 1.3 | 4.1 | 0.323 | 0.5 | 1.0 | 0.500 | 0.487 | 0.4 | 0.4 | 0.923 | 0.2 | 1.4 | 1.5 | 0.6 | 0.5 | 0.2 | 0.5 | 1.7 | 5.3 | /players/a/abrinal01.html |
| 2 | Quincy Acy | PF | 28 | PHO | 10 | 0 | 12.3 | 0.4 | 1.8 | 0.222 | 0.2 | 1.5 | 0.133 | 0.2 | 0.3 | 0.667 | 0.278 | 0.7 | 1.0 | 0.700 | 0.3 | 2.2 | 2.5 | 0.8 | 0.1 | 0.4 | 0.4 | 2.4 | 1.7 | /players/a/acyqu01.html |
| 3 | Jaylen Adams | PG | 22 | ATL | 34 | 1 | 12.6 | 1.1 | 3.2 | 0.345 | 0.7 | 2.2 | 0.338 | 0.4 | 1.1 | 0.361 | 0.459 | 0.2 | 0.3 | 0.778 | 0.3 | 1.4 | 1.8 | 1.9 | 0.4 | 0.1 | 0.8 | 1.3 | 3.2 | /players/a/adamsja01.html |
| 4 | Steven Adams | C | 25 | OKC | 80 | 80 | 33.4 | 6.0 | 10.1 | 0.595 | 0.0 | 0.0 | 0.000 | 6.0 | 10.1 | 0.596 | 0.595 | 1.8 | 3.7 | 0.500 | 4.9 | 4.6 | 9.5 | 1.6 | 1.5 | 1.0 | 1.7 | 2.6 | 13.9 | /players/a/adamsst01.html |
| 5 | Bam Adebayo | C | 21 | MIA | 82 | 28 | 23.3 | 3.4 | 5.9 | 0.576 | 0.0 | 0.2 | 0.200 | 3.4 | 5.7 | 0.588 | 0.579 | 2.0 | 2.8 | 0.735 | 2.0 | 5.3 | 7.3 | 2.2 | 0.9 | 0.8 | 1.5 | 2.5 | 8.9 | /players/a/adebaba01.html |
| 6 | Deng Adel | SF | 21 | CLE | 19 | 3 | 10.2 | 0.6 | 1.9 | 0.306 | 0.3 | 1.2 | 0.261 | 0.3 | 0.7 | 0.385 | 0.389 | 0.2 | 0.2 | 1.000 | 0.2 | 0.8 | 1.0 | 0.3 | 0.1 | 0.2 | 0.3 | 0.7 | 1.7 | /players/a/adelde01.html |
Here are the column names for our data frame
## [1] "rk" "player" "pos" "age" "tm"
## [6] "g" "gs" "mp" "fg" "fga"
## [11] "fgpercent" "x3p" "x3pa" "x3ppercent" "x2p"
## [16] "x2pa" "x2ppercent" "efgpercent" "ft" "fta"
## [21] "ftpercent" "orb" "drb" "trb" "ast"
## [26] "stl" "blk" "tov" "pf" "pts"
## [31] "link"
Now we will create a data frame with only numeric values and the player’s name and position. Using the following columns
• “pos”,
• “g”
• “gs”
• “player”
• “pos”
• “age”
• “mp”
• “fg”
• “fga”
• “fgpercent”
• “x3p”
• “x3pa”
• “x3ppercent”
• “x2p”
• “x2pa”
• “x2ppercent”
• “efgpercent”
• “ft”
• “fta”
• “ftpercent”
• “orb”
• “drb”
• “trb”
• “ast”
• “stl”
• “blk”
• “tov”
• “pf”
• “pts”
• “tpts”
| player | pos | g | gs | age | mp | fg | fga | fgpercent | x3p | x3pa | x3ppercent | x2p | x2pa | x2ppercent | efgpercent | ft | fta | ftpercent | orb | drb | trb | ast | stl | blk | tov | pf | pts |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Álex Abrines | SG | 31 | 2 | 25 | 19.0 | 1.8 | 5.1 | 0.357 | 1.3 | 4.1 | 0.323 | 0.5 | 1.0 | 0.500 | 0.487 | 0.4 | 0.4 | 0.923 | 0.2 | 1.4 | 1.5 | 0.6 | 0.5 | 0.2 | 0.5 | 1.7 | 5.3 |
| Quincy Acy | PF | 10 | 0 | 28 | 12.3 | 0.4 | 1.8 | 0.222 | 0.2 | 1.5 | 0.133 | 0.2 | 0.3 | 0.667 | 0.278 | 0.7 | 1.0 | 0.700 | 0.3 | 2.2 | 2.5 | 0.8 | 0.1 | 0.4 | 0.4 | 2.4 | 1.7 |
| Jaylen Adams | PG | 34 | 1 | 22 | 12.6 | 1.1 | 3.2 | 0.345 | 0.7 | 2.2 | 0.338 | 0.4 | 1.1 | 0.361 | 0.459 | 0.2 | 0.3 | 0.778 | 0.3 | 1.4 | 1.8 | 1.9 | 0.4 | 0.1 | 0.8 | 1.3 | 3.2 |
| Steven Adams | C | 80 | 80 | 25 | 33.4 | 6.0 | 10.1 | 0.595 | 0.0 | 0.0 | 0.000 | 6.0 | 10.1 | 0.596 | 0.595 | 1.8 | 3.7 | 0.500 | 4.9 | 4.6 | 9.5 | 1.6 | 1.5 | 1.0 | 1.7 | 2.6 | 13.9 |
| Bam Adebayo | C | 82 | 28 | 21 | 23.3 | 3.4 | 5.9 | 0.576 | 0.0 | 0.2 | 0.200 | 3.4 | 5.7 | 0.588 | 0.579 | 2.0 | 2.8 | 0.735 | 2.0 | 5.3 | 7.3 | 2.2 | 0.9 | 0.8 | 1.5 | 2.5 | 8.9 |
| Deng Adel | SF | 19 | 3 | 21 | 10.2 | 0.6 | 1.9 | 0.306 | 0.3 | 1.2 | 0.261 | 0.3 | 0.7 | 0.385 | 0.389 | 0.2 | 0.2 | 1.000 | 0.2 | 0.8 | 1.0 | 0.3 | 0.1 | 0.2 | 0.3 | 0.7 | 1.7 |
Here is the row for the 2019 NBA MVP Giannis Antetokounmpo statistics for the regular season.
| player | pos | g | gs | age | mp | fg | fga | fgpercent | x3p | x3pa | x3ppercent | x2p | x2pa | x2ppercent | efgpercent | ft | fta | ftpercent | orb | drb | trb | ast | stl | blk | tov | pf | pts |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Giannis Antetokounmpo | PF | 72 | 72 | 24 | 32.8 | 10 | 17.3 | 0.578 | 0.7 | 2.8 | 0.256 | 9.3 | 14.5 | 0.641 | 0.599 | 6.9 | 9.5 | 0.729 | 2.2 | 10.3 | 12.5 | 5.9 | 1.3 | 1.5 | 3.7 | 3.2 | 27.7 |
Now we must normalize and standardize each column because larger variables have a greater effect on the distance between the observations, and hence on the overall KNN classifier, than smaller variables. Thus each variable will have a mean of 0 and a standard deviation of 1.
Since we are trying to categorizes players by the total amount of points scored, we will save the ‘tpts’ column separately.
Using the scale function in R, we will now standardize the rest of the PerGame dataset
| player | pos | g | gs | age | mp | fg | fga | fgpercent | x3p | x3pa | x3ppercent | x2p | x2pa | x2ppercent | efgpercent | ft | fta | ftpercent | orb | drb | trb | ast | stl | blk | tov | pf | pts |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Álex Abrines | SG | -0.4521250 | -0.6901915 | -0.2759252 | -0.0293946 | -0.6104422 | -0.4060435 | -0.7300242 | 0.5271520 | 0.7483735 | 0.0569648 | -0.9716835 | -1.0309463 | 0.0654535 | -0.1120092 | -0.7383836 | -0.8650535 | 1.2778046 | -0.8061146 | -0.7157392 | -0.8267956 | -0.7516947 | -0.2607481 | -0.4397157 | -0.7125465 | -0.0306029 | -0.5292066 |
| Quincy Acy | PF | -1.2511496 | -0.7675349 | 0.4485064 | -0.7718669 | -1.2719380 | -1.1567002 | -1.9573290 | -0.8673656 | -0.5142110 | -1.3036675 | -1.1423630 | -1.2477850 | 1.4879883 | -1.8809823 | -0.4965445 | -0.4768076 | -0.2761475 | -0.6764329 | -0.2880174 | -0.4249205 | -0.6299238 | -1.2111196 | 0.0765674 | -0.8461136 | 0.8311461 | -1.1468930 |
| Jaylen Adams | PG | -0.3379786 | -0.7288632 | -1.0003568 | -0.7386219 | -0.9411901 | -0.8382398 | -0.8391180 | -0.2334940 | -0.1742844 | 0.1643831 | -1.0285767 | -0.9999693 | -1.1185725 | -0.3490008 | -0.8996097 | -0.9297611 | 0.2673873 | -0.6764329 | -0.7157392 | -0.7062330 | 0.0398163 | -0.4983410 | -0.6978572 | -0.3118452 | -0.5230308 | -0.8895237 |
| Steven Adams | C | 1.4122658 | 2.3262009 | -0.2759252 | 1.5663668 | 1.3740455 | 0.7313152 | 1.4336686 | -1.1209143 | -1.2426251 | -2.2561101 | 2.1574396 | 1.7879565 | 0.8831981 | 0.8021014 | 0.3901990 | 1.2702988 | -1.6698266 | 5.2889251 | 0.9951478 | 2.3882051 | -0.1428400 | 2.1151806 | 1.6254167 | 0.8902587 | 1.0773601 | 0.9463776 |
| Bam Adebayo | C | 1.4883634 | 0.3152727 | -1.2418341 | 0.4471175 | 0.1455531 | -0.2240661 | 1.2609368 | -1.1209143 | -1.1455032 | -0.8238656 | 0.6782177 | 0.4249706 | 0.8150527 | 0.6666776 | 0.5514251 | 0.6879300 | -0.0322537 | 1.5281559 | 1.3694043 | 1.5040799 | 0.2224727 | 0.6896234 | 1.1091336 | 0.6231245 | 0.9542531 | 0.0884798 |
| Deng Adel | SF | -0.9087105 | -0.6515198 | -1.2418341 | -1.0045821 | -1.1774386 | -1.1339530 | -1.1936727 | -0.7405913 | -0.6598938 | -0.3870310 | -1.0854698 | -1.1238771 | -0.9141364 | -0.9414798 | -0.8996097 | -0.9944688 | 1.8143710 | -0.8061146 | -1.0365305 | -1.0277331 | -0.9343511 | -1.2111196 | -0.4397157 | -0.9796807 | -1.2616728 | -1.1468930 |
Lets do a quick check on variance of the feild goal column, ‘fg’ to verify that the scaling worked.
## fg
## fg 1
We must create training and testing set to run the KNN algorithm. Using random sampling, from the standardized.PerGame data frame, and then pick rows using the randomly shuffled values. The removal of all the rows that have a NA value must occur in order for the KNN function will not work.
We will create training (70%) set and 30% test set to use for our KNN model using caTools.
Now we can construct our K Nearest Neighbour model. Using the knn function to predict the total amount of points a player scored on our test set. For simplicity sake we will use k=1.
## [1] SG C SF PG PF C PF C PG PF PG SG
## [13] SG SF PG SG PF PG C C SG SG PF SG
## [25] SG SG SF SG SG PG PF SG SF SF SG PG
## [37] SG SG PF PG PF PG PF SG C PG SF PF
## [49] SF SF PG SF C PF SG SG SG C SF PG
## [61] C PF PG SF SG SG PF C PF SF SG SG
## [73] PF SG PF SG SG SG C PG SF C SF SG
## [85] SG SG SF SF SF SG SG SG SG PF-SF SG SF
## [97] PF PF SG PG C C SG SG-PF SF SF SF SG
## [109] PF SG SG SF PF SG C PF PG SG SG SG
## [121] SG SF SF SG SG SF PF SG C C SG SG
## [133] PF SF PG SG PG SF PG SF PF C SF C
## [145] PF PF PF PG PG SF PF C PG SF SF SF
## [157] SF PF PG SG PF PF SF SF SG SG SG PF
## [169] C PG SF SF PF PG SG C PG SG PF SG
## [181] PG PF PG SG C PF PF C
## Levels: C C-PF PF PF-C PF-SF PG SF SF-SG SG SG-PF SG-SF
The position generated were as followed:
• Centre
• Centre/Power Forward
• Power Forward
• Power Forward/Small Forward
• Point Guard
• Small Forward
• Small Forward/Shooting Guard
• Shooting Guard
• Shooting Guard/Power Forward
• Shooting Guard/Small Forward
Now we can calculate the misclassification rate of our model
## [1] 0.4946809
This value is a bit high which is to be expected with k=1 and such a high number of data points in our dataframe.
Choosing a more appropriate K value is key to the viable of our model, creating a chart of the error misclassification rate for k values, will help visualize the right choice of K. The K values in the chart will range from 1 to 10. This process can be automated with a for() loop.
## [1] 0.4361702 0.5159574 0.4521277 0.4521277 0.4946809 0.4893617 0.4840426
## [8] 0.4414894 0.4468085 0.4627660
## error.rate k.values
## 1 0.4361702 1
## 2 0.5159574 2
## 3 0.4521277 3
## 4 0.4521277 4
## 5 0.4946809 5
## 6 0.4893617 6
## 7 0.4840426 7
## 8 0.4414894 8
## 9 0.4468085 9
## 10 0.4627660 10
## 11 0.4361702 11
## 12 0.5159574 12
## 13 0.4521277 13
## 14 0.4521277 14
## 15 0.4946809 15
## 16 0.4893617 16
## 17 0.4840426 17
## 18 0.4414894 18
## 19 0.4468085 19
## 20 0.4627660 20
Here we can clearly see that increasing beyond K=11 does not help our misclassification at all. So we can set that as the K for our model during training.