Clustering Basketball Players by Position

Introduction

This project will perform K-means clustering of NBA data to determine various position groups among players. K-means clustering is a straight forward unsupervised learning approach for sectioning data sets into \(K\) distinct, non-overlapping clusters based upon similarities. In order to perform K-means clustering, first we need to determine the appropriate number of clusters \(K\). The K-means algorithm will then assign each observation to exactly one of the \(K\) clusters. Below is a brief outline of the process.

K-Means Clustering Procedure

Assign a number at random, from 1 to \(K\), to each of the observations. These are initial cluster assignments for each observations.
Iterate until the cluster assignments stop changing.
For each of the \(K\) clusters, compute the cluster centroid. The \(k^{th}\) cluster centroid is the vector of the feature means for the observations in the \(k^{th}\) cluster and finds the centroid of each cluster.
Assign each observation to the cluster whose centroid is closest, by calculating the cluster variation using the sum of the Euclidean distance between the data points and centroids.

In order to perform K-means clustering, we must decide how many clusters we expect in the data. Traditional there are 5 positions on a basketball team;

• Point guard

• Shooting guard

• Small forward

• Power forward

• Center

We assume naturally that there will be 5 cluster groups, intuitively the game of basketball has evolved over time into a more position-less game. Ultimately we will determine if there exists more or less defined cluster groups. This is an unsupervised learning algorithm meaning that there is no pre-determined outcome , the algorithm just tries to find patterns in the data. We will define each cluster as the mean of the players.

To best define a player, we will use the per 100 possession stats among various categories using data from the website basketballreference.com.

NBA Statistics in R

Now to view NBA statistics in our R console, based on their respective statistics that have been tracked and documented throughout the season.

This includes:

• Games

• Games started

• Minutes

• 2pt & 3pt field goals/attempts/percentage

• Free throws/attempts/percentage

• Assists, turnovers

• Offensive/ Defensive and Total Rebounds

• Fouls

• Blocks

• Steals

Below are some of the Per-100 possession statistics for each player during the 2018-2019 NBA regular season.

rk	player	pos	age	tm	g	gs	mp	fg	fga	fgpercent	x3p	x3pa	x3ppercent	x2p	x2pa	x2ppercent	ft	fta	ftpercent	orb	drb	trb	ast	stl	blk	tov	pf	pts	x	ortg	drtg	link
1	Álex Abrines	SG	25	OKC	31	2	588	4.4	12.5	0.357	3.3	10.1	0.323	1.2	2.4	0.500	1.0	1.0	0.923	0.4	3.4	3.8	1.6	1.3	0.5	1.1	4.2	13.1	NA	103	111	/players/a/abrinal01.html
2	Quincy Acy	PF	28	PHO	10	0	123	1.6	7.0	0.222	0.8	5.8	0.133	0.8	1.2	0.667	2.7	3.9	0.700	1.2	8.5	9.7	3.1	0.4	1.6	1.6	9.3	6.6	NA	87	116	/players/a/acyqu01.html
3	Jaylen Adams	PG	22	ATL	34	1	428	4.1	11.9	0.345	2.7	8.0	0.338	1.4	3.9	0.361	0.8	1.0	0.778	1.2	5.3	6.5	7.0	1.5	0.5	3.0	4.9	11.7	NA	99	115	/players/a/adamsja01.html
4	Steven Adams	C	25	OKC	80	80	2669	8.4	14.1	0.595	0.0	0.0	0.000	8.4	14.1	0.596	2.6	5.1	0.500	6.8	6.5	13.3	2.2	2.0	1.3	2.4	3.6	19.4	NA	120	106	/players/a/adamsst01.html
5	Bam Adebayo	C	21	MIA	82	28	1913	7.2	12.4	0.576	0.1	0.4	0.200	7.1	12.0	0.588	4.2	5.8	0.735	4.2	11.0	15.2	4.7	1.8	1.7	3.1	5.2	18.6	NA	120	104	/players/a/adebaba01.html
6	Deng Adel	SF	21	CLE	19	3	194	2.8	9.2	0.306	1.5	5.9	0.261	1.3	3.3	0.385	1.0	1.0	1.000	0.8	4.1	4.9	1.3	0.3	1.0	1.5	3.3	8.2	NA	85	121	/players/a/adelde01.html

Now let’s view the structure of this data set.

## 'data.frame':    708 obs. of  33 variables:
##  $ rk        : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ player    : chr  "Álex Abrines" "Quincy Acy" "Jaylen Adams" "Steven Adams" ...
##  $ pos       : chr  "SG" "PF" "PG" "C" ...
##  $ age       : num  25 28 22 25 21 21 25 33 21 23 ...
##  $ tm        : chr  "OKC" "PHO" "ATL" "OKC" ...
##  $ g         : num  31 10 34 80 82 19 7 81 10 38 ...
##  $ gs        : num  2 0 1 80 28 3 0 81 1 2 ...
##  $ mp        : num  588 123 428 2669 1913 ...
##  $ fg        : num  4.4 1.6 4.1 8.4 7.2 2.8 6.7 12.4 5.3 7.7 ...
##  $ fga       : num  12.5 7 11.9 14.1 12.4 9.2 22.3 24 15.8 20.5 ...
##  $ fgpercent : num  0.357 0.222 0.345 0.595 0.576 0.306 0.3 0.519 0.333 0.376 ...
##  $ x3p       : num  3.3 0.8 2.7 0 0.1 1.5 0 0.2 1.2 3.7 ...
##  $ x3pa      : num  10.1 5.8 8 0 0.4 5.9 8.9 0.8 4.8 11.4 ...
##  $ x3ppercent: num  0.323 0.133 0.338 0 0.2 0.261 0 0.238 0.25 0.323 ...
##  $ x2p       : num  1.2 0.8 1.4 8.4 7.1 1.3 6.7 12.3 4 4 ...
##  $ x2pa      : num  2.4 1.2 3.9 14.1 12 3.3 13.4 23.2 10.9 9.1 ...
##  $ x2ppercent: num  0.5 0.667 0.361 0.596 0.588 0.385 0.5 0.528 0.37 0.443 ...
##  $ ft        : num  1 2.7 0.8 2.6 4.2 1 2.2 6.3 3.2 5.2 ...
##  $ fta       : num  1 3.9 1 5.1 5.8 1 4.5 7.5 4.8 6.9 ...
##  $ ftpercent : num  0.923 0.7 0.778 0.5 0.735 1 0.5 0.847 0.667 0.75 ...
##  $ orb       : num  0.4 1.2 1.2 6.8 4.2 0.8 2.2 4.6 4.4 0.3 ...
##  $ drb       : num  3.4 8.5 5.3 6.5 11 4.1 6.7 9 6.1 2.3 ...
##  $ trb       : num  3.8 9.7 6.5 13.3 15.2 4.9 8.9 13.5 10.5 2.6 ...
##  $ ast       : num  1.6 3.1 7 2.2 4.7 1.3 13.4 3.5 5.3 2.9 ...
##  $ stl       : num  1.3 0.4 1.5 2 1.8 0.3 4.5 0.8 0.4 0.7 ...
##  $ blk       : num  0.5 1.6 0.5 1.3 1.7 1 0 1.9 0 0.7 ...
##  $ tov       : num  1.1 1.6 3 2.4 3.1 1.5 4.5 2.6 3.2 3.8 ...
##  $ pf        : num  4.2 9.3 4.9 3.6 5.2 3.3 8.9 3.3 2.8 5.4 ...
##  $ pts       : num  13.1 6.6 11.7 19.4 18.6 8.2 15.6 31.4 15 24.3 ...
##  $ x         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ ortg      : num  103 87 99 120 120 85 84 117 93 95 ...
##  $ drtg      : num  111 116 115 106 104 121 104 110 117 111 ...
##  $ link      : chr  "/players/a/abrinal01.html" "/players/a/acyqu01.html" "/players/a/adamsja01.html" "/players/a/adamsst01.html" ...

We will not be using be the following variables in our analysis

• rk

• player

• tm

• g, gs

• age

• fg/x3p/x2p percent

• o/d rb

• x

• o/d rtg

• link

For these do not contribute significant clustering.

Exploratory Data Analysis

Now we will conduct some basic exploratory data analysis to get a better understanding of the data set that we will be using. First, we will create a data frame using only the numeric value rows to aid in the classification process.

fg	fga	x3p	x3pa	x2p	x2pa	ft	fta	trb	ast	blk	stl	tov	pts
4.4	12.5	3.3	10.1	1.2	2.4	1.0	1.0	3.8	1.6	0.5	1.3	1.1	13.1
1.6	7.0	0.8	5.8	0.8	1.2	2.7	3.9	9.7	3.1	1.6	0.4	1.6	6.6
4.1	11.9	2.7	8.0	1.4	3.9	0.8	1.0	6.5	7.0	0.5	1.5	3.0	11.7
8.4	14.1	0.0	0.0	8.4	14.1	2.6	5.1	13.3	2.2	1.3	2.0	2.4	19.4
7.2	12.4	0.1	0.4	7.1	12.0	4.2	5.8	15.2	4.7	1.7	1.8	3.1	18.6
2.8	9.2	1.5	5.9	1.3	3.3	1.0	1.0	4.9	1.3	1.0	0.3	1.5	8.2

Correlation

Now let’s examine the correlation between each of these variable to determine which axis to use when constructing and evaluating our clusterplot. The correlation plot between the different statistical categories will show if there’s anything interesting interactions among them. We will use the ggcorrpplot library to help visualize the correlation between each variable.

As expected there is no surprises here. I will refer back to this graph in my final assessment of each cluster group.

Choosing a K value

There are two ways are that we can determine the value of \(k\) affects the clustering , so we can make sensible conclusions about the data, the elbow method and the gap statistic.

The Elbow Method

The elbow method uses the sum squared deviations within each cluster from each observation and the cluster centroid. A cluster that has a small sum of squares is more tightly compact than oppose to ones with large sums of squares. The score will only get smaller as we increase the value of k and the clusters get smaller. Thus the point of interest in the plot referred to as the “elbow knee” where the sum of squares value within each cluster drops considerably and then levels out for larger values k.

For generating the Elbow Method in order to find the optimal number of clusters, first we will generate a set of random numbers to help with this simulation. Computing the sum of squares with each cluster from \(k = 2\) to \(k = 10\) will allow us to perform k-means clustering on our data frame.

Below is the plot of the Elbow Method for values \(k = 2\) to \(k = 10\).

As we can see from this plot there is distinctive no elbow/knee present.

The Gap Statistic

Thus we will move onto the second method to find the optimal K value, known as the Gap statistic. The higher the value is for the gap-statistic, the better our value of \(k\) is. Again, our goal is to get the best clusters while minimizing k.

We will generate a set of random numbers to help with this simulation in order to determine and visualize the optimal number of clusters we will graph the graph statistic.

As you can see from the above plot, the gap statistic peaks at \(k = 8\).

K-Means Cluster Analysis

In order to perform k-means clustering on the Per100 data we will use the k-means method function which partitions the points into \(k\) groups such that the sum of squares from points to the assigned cluster centres is minimized.

Let’s take a look at the structure

## List of 9
##  $ cluster     : int [1:708] 6 8 8 4 4 8 7 1 7 3 ...
##  $ centers     : num [1:8, 1:14] 10.62 39 9.32 6.87 12.17 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:8] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:14] "fg" "fga" "x3p" "x3pa" ...
##  $ totss       : num 146662
##  $ withinss    : num [1:8] 7967 1685 5573 8345 3869 ...
##  $ tot.withinss: num 53355
##  $ betweenss   : num 93307
##  $ size        : int [1:8] 52 2 110 107 39 146 160 92
##  $ iter        : int 5
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

Here is a summary of this data

##              Length Class  Mode   
## cluster      708    -none- numeric
## centers      112    -none- numeric
## totss          1    -none- numeric
## withinss       8    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           8    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric

We will use the aggregate function to generate the cluster means below.

Cluster Groups
Group.1	fg	fga	x3p	x3pa	x2p	x2pa	ft	fta	trb	ast	blk	stl	tov	pts
1	10.623077	20.830769	0.5269231	2.019231	10.109615	18.809615	4.738461	7.057692	17.609615	3.026923	1.6769231	1.180769	3.403846	26.521154
2	39.000000	48.650000	0.0000000	0.000000	39.000000	48.650000	0.000000	0.000000	14.450000	4.800000	0.0000000	0.000000	0.000000	77.950000
3	9.320000	20.960000	2.8600000	8.170909	6.462727	12.786364	3.822727	4.968182	7.421818	4.700000	0.7163636	1.399091	2.710909	25.321818
4	6.871963	13.998131	0.5925234	2.141122	6.272897	11.862617	3.471028	5.148598	14.450467	3.697196	1.7925234	1.568224	2.519626	17.796262
5	12.166667	25.502564	3.0948718	8.117949	9.076923	17.376923	6.333333	7.823077	9.558974	8.484615	0.8820513	1.846154	3.933333	33.761539
6	5.899315	15.307534	3.2623288	9.693151	2.642466	5.615068	1.845890	2.383562	6.244521	3.117123	0.6575342	1.273973	1.778082	16.909589
7	6.803125	16.301250	1.9556250	6.092500	4.841875	10.207500	2.690625	3.580000	6.594375	5.117500	0.5693750	1.658125	2.513750	18.256875
8	3.385870	9.869565	1.2793478	4.773913	2.108696	5.095652	1.226087	1.796739	8.044565	3.669565	0.6532609	1.568478	2.081522	9.279348

Visualization

Now we will create a simple cluster plot to illustrate the various cluster groups

Here is a more fancy plot to help distiguish each cluster.

Another cluster plot to further illustrate each group.

Breakdown of Each Cluster Group

Now that we have generated the cluster groups I will identify each group’s stats line a provide an player that would fall into this category.

Group 1

HIGH LEVEL STARTER FRINGE ALL STAR

This position is typically an all-around player, who can contribute to various statistical categories into team’s success such as scoring, defending multiple positions, shooting and distributing the ball well.

STAT LINE PER 100 POSSESSIONS

9.3 FG, 20.8 FGA, 2.9 X3P, 8.2 X3PA, 6.4 X2P, 12.6 X2PA, 3.8 FT, 4.9 FTA, 7.6 TRB, 4.5 AST, 0.7BLK, 1.4 STL, 2.7TOV, 25.2PTS

Example: Malcolm Brogdon, starting point guard from the Milwaukee Bucks

Group 3

SUPERSTAR ALL NBA

This player is usually the best player on the team one of the 15 best players in the entire NBA, league leaders in various categories such as points and assists.

STAT LINE PER 100 POSSESSIONS

11.9 FG, 25.6 FGA, 3.3 X3P, 8.5 X3PA, 8.6 X2P, 17.1 X2PA, 5.5 FT, 6.8 FTA, 7.4 TRB, 9.0 AST, 0.6 BLK, 1.9 STL, 3.8 TOV, 32.6 PTS

Example: James Harden, shooting guard for the Houston Rockets

Group 4

OFF THE BENCH SCORING 6TH MAN

This player comes into the game to give the star players rest and fill in the role to a lesser extent and usually less efficiently

STAT LINE PER 100 POSSESSIONS

6.8 FG, 16.3 FGA ,1.9 X3P, 6.1 X3PA, 4.8 X2P, 10.2 X2PA, 2.7 FT, 3.6 FTA, 6.7 TRB, 5.1 AST, 0.6 BLK, 1.6 STL, 2.5 TOV, 18.2 PTS

Example: Yogi Ferrell, back up point guard for the Sacramento Kings

Group 5

TRADITIONAL BIG REBOUNDER INSIDE SCORING

Usually the center position does most of the play near the basket, with interior scoring and elite rebounding capabilities.

STAT LINE PER 100 POSSESSIONS

7.0 FG, 14.1 FGA, 0.5 X3P, 1.9 X3PA, 6.5 X2P, 12.2 X2PA 3.6 FT, 5.3 FTA, 14.5 TRB, 3.7 AST, 1.8 BLK, 1.6 STL, 2.6 TOV, 18.1 PTS

Example: Bam Adebayo, point centre for the Miami Heat

Group 6

3 AND D

Elite three point shooter and great perimeter defender.

STAT LINE PER 100 POSSESSIONS

5.9 FG, 15.3 FGA, 3.3 X3P, 9.7 X3PA, 2.6 X2P, 5.6 X2PA, 1.8 FT, 2.4 FTA, 6.2 TRB, 3.1 AST, 0.7 BLK, 1.2STL, 1.8 TOV, 16.9 TOV

Example: JR Smith, shooting guard for the Cleveland Cavaliers

Group 7

BACK TO THE BASKET SCORING BIG MAN REBOUNDER

More polished interior and mid range scorer, superb rebounder and focal point of the offense

STAT LINE PER 100 POSSESSIONS

11.1 FG, 21.8 FGA, 0.7 X3P, 2.6 X3PA, 10.4 X2P, 19.2 X2PA, 5.4 FT, 7.6 FTA, 17.8 TRB, 3.4 AST, 1.7 BLK, 1.2 STL, 3.5 TOV, 28. 4 PTS

Example: Enes Kanter, power forward for the Boston Celtics

Group 8

ALL AROUND ROLE PLAYER

Can essentially fill any position on the court on both the defensive and offensive end works well with any line-up combination

STAT LINE PER 100 POSSESIONS

3.4 FG, 9.9 FGA, 1.3 X3P, 4.8 X3PA, 2.1 X2P, 5.1 X2PA, 1.2 FT, 1.8 FTA, 8.0 TRB, 3.7 AST, 0.7 BLK, 1.5 STL, 2.1 TOV, 9.3 PTS

Example: PJ Tucker, stretch 4 man for the Houston Rockets

Group 2

An outlier based on inflation of statistics on a per 100 possession basis.

Clustering Basketball Players by Position

Nicholas Burke

05 December 2019

Introduction

K-Means Clustering Procedure

NBA Statistics in R

Exploratory Data Analysis

Correlation

Choosing a K value

The Elbow Method

The Gap Statistic

K-Means Cluster Analysis

Visualization

Breakdown of Each Cluster Group

Group 1

Group 3

Group 4

Group 5

Group 6

Group 7

Group 8

Group 2