K Nearest Neighbours Using NBA Player Data

Introduction

This project with the use the computer program R will use the K-Nearest-Neighbours algorithm to predict the position of NBA players based on their regular season statistics.

The k-nearest neighbours algorithm predicts unknown values by matching them alongside very or “most” similar known values. Since the KKN model is a straight forward approach based on memory it cannot be summarized in closed-form fashion. Meaning the training samples are required at run-time and predictions are made upon relationship among samples. The value of \(k\) in this algorithm can be any value less than the number of rows in the data frame. The aim is to examine a small number of neighbours for better algorithm performance, due to the fact that the less similar the neighbours are to in the data, the worse the prediction will be.

In order to find the most similar NBA players we will use the principle of Euclidean distance, simply measuring the straight-line distance between two players using data from basketballreference.com.

NBA Statistics in R

We call the per game statistics from the 2018-2019 NBA regular season. Based on their respective statistics that have been tracked and documented throughout the season.

The statistics include:

• Games

• Games started

• Minutes played

• 2pt & 3pt field goals/attempts

• Free throws/attempts

• Assists, turnovers

• Offensive/ Defensive and Total Rebounds

• Fouls

• Blocks

• Steals

• Total points

In order to view the per game statistics of each player in the NBA for 2019 regular season we must call the function

rk	player	pos	age	tm	g	gs	mp	fg	fga	fgpercent	x3p	x3pa	x3ppercent	x2p	x2pa	x2ppercent	efgpercent	ft	fta	ftpercent	orb	drb	trb	ast	stl	blk	tov	pf	pts	link
1	Álex Abrines	SG	25	OKC	31	2	19.0	1.8	5.1	0.357	1.3	4.1	0.323	0.5	1.0	0.500	0.487	0.4	0.4	0.923	0.2	1.4	1.5	0.6	0.5	0.2	0.5	1.7	5.3	/players/a/abrinal01.html
2	Quincy Acy	PF	28	PHO	10	0	12.3	0.4	1.8	0.222	0.2	1.5	0.133	0.2	0.3	0.667	0.278	0.7	1.0	0.700	0.3	2.2	2.5	0.8	0.1	0.4	0.4	2.4	1.7	/players/a/acyqu01.html
3	Jaylen Adams	PG	22	ATL	34	1	12.6	1.1	3.2	0.345	0.7	2.2	0.338	0.4	1.1	0.361	0.459	0.2	0.3	0.778	0.3	1.4	1.8	1.9	0.4	0.1	0.8	1.3	3.2	/players/a/adamsja01.html
4	Steven Adams	C	25	OKC	80	80	33.4	6.0	10.1	0.595	0.0	0.0	0.000	6.0	10.1	0.596	0.595	1.8	3.7	0.500	4.9	4.6	9.5	1.6	1.5	1.0	1.7	2.6	13.9	/players/a/adamsst01.html
5	Bam Adebayo	C	21	MIA	82	28	23.3	3.4	5.9	0.576	0.0	0.2	0.200	3.4	5.7	0.588	0.579	2.0	2.8	0.735	2.0	5.3	7.3	2.2	0.9	0.8	1.5	2.5	8.9	/players/a/adebaba01.html
6	Deng Adel	SF	21	CLE	19	3	10.2	0.6	1.9	0.306	0.3	1.2	0.261	0.3	0.7	0.385	0.389	0.2	0.2	1.000	0.2	0.8	1.0	0.3	0.1	0.2	0.3	0.7	1.7	/players/a/adelde01.html

Here are the column names for our data frame

##  [1] "rk"         "player"     "pos"        "age"        "tm"        
##  [6] "g"          "gs"         "mp"         "fg"         "fga"       
## [11] "fgpercent"  "x3p"        "x3pa"       "x3ppercent" "x2p"       
## [16] "x2pa"       "x2ppercent" "efgpercent" "ft"         "fta"       
## [21] "ftpercent"  "orb"        "drb"        "trb"        "ast"       
## [26] "stl"        "blk"        "tov"        "pf"         "pts"       
## [31] "link"

Now we will create a data frame with only numeric values and the player’s name and position. Using the following columns

• “pos”,

• “g”

• “gs”

• “player”

• “pos”

• “age”

• “mp”

• “fg”

• “fga”

• “fgpercent”

• “x3p”

• “x3pa”

• “x3ppercent”

• “x2p”

• “x2pa”

• “x2ppercent”

• “efgpercent”

• “ft”

• “fta”

• “ftpercent”

• “orb”

• “drb”

• “trb”

• “ast”

• “stl”

• “blk”

• “tov”

• “pf”

• “pts”

• “tpts”

player	pos	g	gs	age	mp	fg	fga	fgpercent	x3p	x3pa	x3ppercent	x2p	x2pa	x2ppercent	efgpercent	ft	fta	ftpercent	orb	drb	trb	ast	stl	blk	tov	pf	pts
Álex Abrines	SG	31	2	25	19.0	1.8	5.1	0.357	1.3	4.1	0.323	0.5	1.0	0.500	0.487	0.4	0.4	0.923	0.2	1.4	1.5	0.6	0.5	0.2	0.5	1.7	5.3
Quincy Acy	PF	10	0	28	12.3	0.4	1.8	0.222	0.2	1.5	0.133	0.2	0.3	0.667	0.278	0.7	1.0	0.700	0.3	2.2	2.5	0.8	0.1	0.4	0.4	2.4	1.7
Jaylen Adams	PG	34	1	22	12.6	1.1	3.2	0.345	0.7	2.2	0.338	0.4	1.1	0.361	0.459	0.2	0.3	0.778	0.3	1.4	1.8	1.9	0.4	0.1	0.8	1.3	3.2
Steven Adams	C	80	80	25	33.4	6.0	10.1	0.595	0.0	0.0	0.000	6.0	10.1	0.596	0.595	1.8	3.7	0.500	4.9	4.6	9.5	1.6	1.5	1.0	1.7	2.6	13.9
Bam Adebayo	C	82	28	21	23.3	3.4	5.9	0.576	0.0	0.2	0.200	3.4	5.7	0.588	0.579	2.0	2.8	0.735	2.0	5.3	7.3	2.2	0.9	0.8	1.5	2.5	8.9
Deng Adel	SF	19	3	21	10.2	0.6	1.9	0.306	0.3	1.2	0.261	0.3	0.7	0.385	0.389	0.2	0.2	1.000	0.2	0.8	1.0	0.3	0.1	0.2	0.3	0.7	1.7

Here is the row for the 2019 NBA MVP Giannis Antetokounmpo statistics for the regular season.

player	pos	g	gs	age	mp	fg	fga	fgpercent	x3p	x3pa	x3ppercent	x2p	x2pa	x2ppercent	efgpercent	ft	fta	ftpercent	orb	drb	trb	ast	stl	blk	tov	pf	pts
Giannis Antetokounmpo	PF	72	72	24	32.8	10	17.3	0.578	0.7	2.8	0.256	9.3	14.5	0.641	0.599	6.9	9.5	0.729	2.2	10.3	12.5	5.9	1.3	1.5	3.7	3.2	27.7

Now we must normalize and standardize each column because larger variables have a greater effect on the distance between the observations, and hence on the overall KNN classifier, than smaller variables. Thus each variable will have a mean of 0 and a standard deviation of 1.

Since we are trying to categorizes players by the total amount of points scored, we will save the ‘tpts’ column separately.

Standardizing Data

Using the scale function in R, we will now standardize the rest of the PerGame dataset

player	pos	g	gs	age	mp	fg	fga	fgpercent	x3p	x3pa	x3ppercent	x2p	x2pa	x2ppercent	efgpercent	ft	fta	ftpercent	orb	drb	trb	ast	stl	blk	tov	pf	pts
Álex Abrines	SG	-0.4521250	-0.6901915	-0.2759252	-0.0293946	-0.6104422	-0.4060435	-0.7300242	0.5271520	0.7483735	0.0569648	-0.9716835	-1.0309463	0.0654535	-0.1120092	-0.7383836	-0.8650535	1.2778046	-0.8061146	-0.7157392	-0.8267956	-0.7516947	-0.2607481	-0.4397157	-0.7125465	-0.0306029	-0.5292066
Quincy Acy	PF	-1.2511496	-0.7675349	0.4485064	-0.7718669	-1.2719380	-1.1567002	-1.9573290	-0.8673656	-0.5142110	-1.3036675	-1.1423630	-1.2477850	1.4879883	-1.8809823	-0.4965445	-0.4768076	-0.2761475	-0.6764329	-0.2880174	-0.4249205	-0.6299238	-1.2111196	0.0765674	-0.8461136	0.8311461	-1.1468930
Jaylen Adams	PG	-0.3379786	-0.7288632	-1.0003568	-0.7386219	-0.9411901	-0.8382398	-0.8391180	-0.2334940	-0.1742844	0.1643831	-1.0285767	-0.9999693	-1.1185725	-0.3490008	-0.8996097	-0.9297611	0.2673873	-0.6764329	-0.7157392	-0.7062330	0.0398163	-0.4983410	-0.6978572	-0.3118452	-0.5230308	-0.8895237
Steven Adams	C	1.4122658	2.3262009	-0.2759252	1.5663668	1.3740455	0.7313152	1.4336686	-1.1209143	-1.2426251	-2.2561101	2.1574396	1.7879565	0.8831981	0.8021014	0.3901990	1.2702988	-1.6698266	5.2889251	0.9951478	2.3882051	-0.1428400	2.1151806	1.6254167	0.8902587	1.0773601	0.9463776
Bam Adebayo	C	1.4883634	0.3152727	-1.2418341	0.4471175	0.1455531	-0.2240661	1.2609368	-1.1209143	-1.1455032	-0.8238656	0.6782177	0.4249706	0.8150527	0.6666776	0.5514251	0.6879300	-0.0322537	1.5281559	1.3694043	1.5040799	0.2224727	0.6896234	1.1091336	0.6231245	0.9542531	0.0884798
Deng Adel	SF	-0.9087105	-0.6515198	-1.2418341	-1.0045821	-1.1774386	-1.1339530	-1.1936727	-0.7405913	-0.6598938	-0.3870310	-1.0854698	-1.1238771	-0.9141364	-0.9414798	-0.8996097	-0.9944688	1.8143710	-0.8061146	-1.0365305	-1.0277331	-0.9343511	-1.2111196	-0.4397157	-0.9796807	-1.2616728	-1.1468930

Lets do a quick check on variance of the feild goal column, ‘fg’ to verify that the scaling worked.

##    fg
## fg  1

We must create training and testing set to run the KNN algorithm. Using random sampling, from the standardized.PerGame data frame, and then pick rows using the randomly shuffled values. The removal of all the rows that have a NA value must occur in order for the KNN function will not work.

Training and Test Set

We will create training (70%) set and 30% test set to use for our KNN model using caTools.

KNN model

Now we can construct our K Nearest Neighbour model. Using the knn function to predict the total amount of points a player scored on our test set. For simplicity sake we will use k=1.

##   [1] SG    C     SF    PG    PF    C     PF    C     PG    PF    PG    SG   
##  [13] SG    SF    PG    SG    PF    PG    C     C     SG    SG    PF    SG   
##  [25] SG    SG    SF    SG    SG    PG    PF    SG    SF    SF    SG    PG   
##  [37] SG    SG    PF    PG    PF    PG    PF    SG    C     PG    SF    PF   
##  [49] SF    SF    PG    SF    C     PF    SG    SG    SG    C     SF    PG   
##  [61] C     PF    PG    SF    SG    SG    PF    C     PF    SF    SG    SG   
##  [73] PF    SG    PF    SG    SG    SG    C     PG    SF    C     SF    SG   
##  [85] SG    SG    SF    SF    SF    SG    SG    SG    SG    PF-SF SG    SF   
##  [97] PF    PF    SG    PG    C     C     SG    SG-PF SF    SF    SF    SG   
## [109] PF    SG    SG    SF    PF    SG    C     PF    PG    SG    SG    SG   
## [121] SG    SF    SF    SG    SG    SF    PF    SG    C     C     SG    SG   
## [133] PF    SF    PG    SG    PG    SF    PG    SF    PF    C     SF    C    
## [145] PF    PF    PF    PG    PG    SF    PF    C     PG    SF    SF    SF   
## [157] SF    PF    PG    SG    PF    PF    SF    SF    SG    SG    SG    PF   
## [169] C     PG    SF    SF    PF    PG    SG    C     PG    SG    PF    SG   
## [181] PG    PF    PG    SG    C     PF    PF    C    
## Levels: C C-PF PF PF-C PF-SF PG SF SF-SG SG SG-PF SG-SF

The position generated were as followed:

• Centre

• Centre/Power Forward

• Power Forward

• Power Forward/Small Forward

• Point Guard

• Small Forward

• Small Forward/Shooting Guard

• Shooting Guard

• Shooting Guard/Power Forward

• Shooting Guard/Small Forward

Misclassification Rate

Now we can calculate the misclassification rate of our model

## [1] 0.4946809

This value is a bit high which is to be expected with k=1 and such a high number of data points in our dataframe.

Choosing a more appropriate K value is key to the viable of our model, creating a chart of the error misclassification rate for k values, will help visualize the right choice of K. The K values in the chart will range from 1 to 10. This process can be automated with a for() loop.

##  [1] 0.4361702 0.5159574 0.4521277 0.4521277 0.4946809 0.4893617 0.4840426
##  [8] 0.4414894 0.4468085 0.4627660

##    error.rate k.values
## 1   0.4361702        1
## 2   0.5159574        2
## 3   0.4521277        3
## 4   0.4521277        4
## 5   0.4946809        5
## 6   0.4893617        6
## 7   0.4840426        7
## 8   0.4414894        8
## 9   0.4468085        9
## 10  0.4627660       10
## 11  0.4361702       11
## 12  0.5159574       12
## 13  0.4521277       13
## 14  0.4521277       14
## 15  0.4946809       15
## 16  0.4893617       16
## 17  0.4840426       17
## 18  0.4414894       18
## 19  0.4468085       19
## 20  0.4627660       20

Here we can clearly see that increasing beyond K=11 does not help our misclassification at all. So we can set that as the K for our model during training.