Introduction

This project will perform K-means clustering of NBA data to determine various position groups among players. K-means clustering is a straight forward unsupervised learning approach for sectioning data sets into \(K\) distinct, non-overlapping clusters based upon similarities. In order to perform K-means clustering, first we need to determine the appropriate number of clusters \(K\). The K-means algorithm will then assign each observation to exactly one of the \(K\) clusters. Below is a brief outline of the process.

K-Means Clustering Procedure

  1. Assign a number at random, from 1 to \(K\), to each of the observations. These are initial cluster assignments for each observations.

  2. Iterate until the cluster assignments stop changing.

  3. For each of the \(K\) clusters, compute the cluster centroid. The \(k^{th}\) cluster centroid is the vector of the feature means for the observations in the \(k^{th}\) cluster and finds the centroid of each cluster.

  4. Assign each observation to the cluster whose centroid is closest, by calculating the cluster variation using the sum of the Euclidean distance between the data points and centroids.

In order to perform K-means clustering, we must decide how many clusters we expect in the data. Traditional there are 5 positions on a basketball team;

• Point guard

• Shooting guard

• Small forward

• Power forward

• Center

We assume naturally that there will be 5 cluster groups, intuitively the game of basketball has evolved over time into a more position-less game. Ultimately we will determine if there exists more or less defined cluster groups. This is an unsupervised learning algorithm meaning that there is no pre-determined outcome , the algorithm just tries to find patterns in the data. We will define each cluster as the mean of the players.

To best define a player, we will use the per 100 possession stats among various categories using data from the website basketballreference.com.

NBA Statistics in R

Now to view NBA statistics in our R console, based on their respective statistics that have been tracked and documented throughout the season.

This includes:

• Games

• Games started

• Minutes

• 2pt & 3pt field goals/attempts/percentage

• Free throws/attempts/percentage

• Assists, turnovers

• Offensive/ Defensive and Total Rebounds

• Fouls

• Blocks

• Steals

Below are some of the Per-100 possession statistics for each player during the 2018-2019 NBA regular season.

rk player pos age tm g gs mp fg fga fgpercent x3p x3pa x3ppercent x2p x2pa x2ppercent ft fta ftpercent orb drb trb ast stl blk tov pf pts x ortg drtg link
1 Álex Abrines SG 25 OKC 31 2 588 4.4 12.5 0.357 3.3 10.1 0.323 1.2 2.4 0.500 1.0 1.0 0.923 0.4 3.4 3.8 1.6 1.3 0.5 1.1 4.2 13.1 NA 103 111 /players/a/abrinal01.html
2 Quincy Acy PF 28 PHO 10 0 123 1.6 7.0 0.222 0.8 5.8 0.133 0.8 1.2 0.667 2.7 3.9 0.700 1.2 8.5 9.7 3.1 0.4 1.6 1.6 9.3 6.6 NA 87 116 /players/a/acyqu01.html
3 Jaylen Adams PG 22 ATL 34 1 428 4.1 11.9 0.345 2.7 8.0 0.338 1.4 3.9 0.361 0.8 1.0 0.778 1.2 5.3 6.5 7.0 1.5 0.5 3.0 4.9 11.7 NA 99 115 /players/a/adamsja01.html
4 Steven Adams C 25 OKC 80 80 2669 8.4 14.1 0.595 0.0 0.0 0.000 8.4 14.1 0.596 2.6 5.1 0.500 6.8 6.5 13.3 2.2 2.0 1.3 2.4 3.6 19.4 NA 120 106 /players/a/adamsst01.html
5 Bam Adebayo C 21 MIA 82 28 1913 7.2 12.4 0.576 0.1 0.4 0.200 7.1 12.0 0.588 4.2 5.8 0.735 4.2 11.0 15.2 4.7 1.8 1.7 3.1 5.2 18.6 NA 120 104 /players/a/adebaba01.html
6 Deng Adel SF 21 CLE 19 3 194 2.8 9.2 0.306 1.5 5.9 0.261 1.3 3.3 0.385 1.0 1.0 1.000 0.8 4.1 4.9 1.3 0.3 1.0 1.5 3.3 8.2 NA 85 121 /players/a/adelde01.html

Now let’s view the structure of this data set.

## 'data.frame':    708 obs. of  33 variables:
##  $ rk        : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ player    : chr  "Álex Abrines" "Quincy Acy" "Jaylen Adams" "Steven Adams" ...
##  $ pos       : chr  "SG" "PF" "PG" "C" ...
##  $ age       : num  25 28 22 25 21 21 25 33 21 23 ...
##  $ tm        : chr  "OKC" "PHO" "ATL" "OKC" ...
##  $ g         : num  31 10 34 80 82 19 7 81 10 38 ...
##  $ gs        : num  2 0 1 80 28 3 0 81 1 2 ...
##  $ mp        : num  588 123 428 2669 1913 ...
##  $ fg        : num  4.4 1.6 4.1 8.4 7.2 2.8 6.7 12.4 5.3 7.7 ...
##  $ fga       : num  12.5 7 11.9 14.1 12.4 9.2 22.3 24 15.8 20.5 ...
##  $ fgpercent : num  0.357 0.222 0.345 0.595 0.576 0.306 0.3 0.519 0.333 0.376 ...
##  $ x3p       : num  3.3 0.8 2.7 0 0.1 1.5 0 0.2 1.2 3.7 ...
##  $ x3pa      : num  10.1 5.8 8 0 0.4 5.9 8.9 0.8 4.8 11.4 ...
##  $ x3ppercent: num  0.323 0.133 0.338 0 0.2 0.261 0 0.238 0.25 0.323 ...
##  $ x2p       : num  1.2 0.8 1.4 8.4 7.1 1.3 6.7 12.3 4 4 ...
##  $ x2pa      : num  2.4 1.2 3.9 14.1 12 3.3 13.4 23.2 10.9 9.1 ...
##  $ x2ppercent: num  0.5 0.667 0.361 0.596 0.588 0.385 0.5 0.528 0.37 0.443 ...
##  $ ft        : num  1 2.7 0.8 2.6 4.2 1 2.2 6.3 3.2 5.2 ...
##  $ fta       : num  1 3.9 1 5.1 5.8 1 4.5 7.5 4.8 6.9 ...
##  $ ftpercent : num  0.923 0.7 0.778 0.5 0.735 1 0.5 0.847 0.667 0.75 ...
##  $ orb       : num  0.4 1.2 1.2 6.8 4.2 0.8 2.2 4.6 4.4 0.3 ...
##  $ drb       : num  3.4 8.5 5.3 6.5 11 4.1 6.7 9 6.1 2.3 ...
##  $ trb       : num  3.8 9.7 6.5 13.3 15.2 4.9 8.9 13.5 10.5 2.6 ...
##  $ ast       : num  1.6 3.1 7 2.2 4.7 1.3 13.4 3.5 5.3 2.9 ...
##  $ stl       : num  1.3 0.4 1.5 2 1.8 0.3 4.5 0.8 0.4 0.7 ...
##  $ blk       : num  0.5 1.6 0.5 1.3 1.7 1 0 1.9 0 0.7 ...
##  $ tov       : num  1.1 1.6 3 2.4 3.1 1.5 4.5 2.6 3.2 3.8 ...
##  $ pf        : num  4.2 9.3 4.9 3.6 5.2 3.3 8.9 3.3 2.8 5.4 ...
##  $ pts       : num  13.1 6.6 11.7 19.4 18.6 8.2 15.6 31.4 15 24.3 ...
##  $ x         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ ortg      : num  103 87 99 120 120 85 84 117 93 95 ...
##  $ drtg      : num  111 116 115 106 104 121 104 110 117 111 ...
##  $ link      : chr  "/players/a/abrinal01.html" "/players/a/acyqu01.html" "/players/a/adamsja01.html" "/players/a/adamsst01.html" ...

We will not be using be the following variables in our analysis

• rk

• player

• tm

• g, gs

• age

• fg/x3p/x2p percent

• o/d rb

• x

• o/d rtg

• link

For these do not contribute significant clustering.

Exploratory Data Analysis

Now we will conduct some basic exploratory data analysis to get a better understanding of the data set that we will be using. First, we will create a data frame using only the numeric value rows to aid in the classification process.

fg fga x3p x3pa x2p x2pa ft fta trb ast blk stl tov pts
4.4 12.5 3.3 10.1 1.2 2.4 1.0 1.0 3.8 1.6 0.5 1.3 1.1 13.1
1.6 7.0 0.8 5.8 0.8 1.2 2.7 3.9 9.7 3.1 1.6 0.4 1.6 6.6
4.1 11.9 2.7 8.0 1.4 3.9 0.8 1.0 6.5 7.0 0.5 1.5 3.0 11.7
8.4 14.1 0.0 0.0 8.4 14.1 2.6 5.1 13.3 2.2 1.3 2.0 2.4 19.4
7.2 12.4 0.1 0.4 7.1 12.0 4.2 5.8 15.2 4.7 1.7 1.8 3.1 18.6
2.8 9.2 1.5 5.9 1.3 3.3 1.0 1.0 4.9 1.3 1.0 0.3 1.5 8.2

Correlation

Now let’s examine the correlation between each of these variable to determine which axis to use when constructing and evaluating our clusterplot. The correlation plot between the different statistical categories will show if there’s anything interesting interactions among them. We will use the ggcorrpplot library to help visualize the correlation between each variable.

As expected there is no surprises here. I will refer back to this graph in my final assessment of each cluster group.

Choosing a K value

There are two ways are that we can determine the value of \(k\) affects the clustering , so we can make sensible conclusions about the data, the elbow method and the gap statistic.

The Elbow Method

The elbow method uses the sum squared deviations within each cluster from each observation and the cluster centroid. A cluster that has a small sum of squares is more tightly compact than oppose to ones with large sums of squares. The score will only get smaller as we increase the value of k and the clusters get smaller. Thus the point of interest in the plot referred to as the “elbow knee” where the sum of squares value within each cluster drops considerably and then levels out for larger values k.

For generating the Elbow Method in order to find the optimal number of clusters, first we will generate a set of random numbers to help with this simulation. Computing the sum of squares with each cluster from \(k = 2\) to \(k = 10\) will allow us to perform k-means clustering on our data frame.

Below is the plot of the Elbow Method for values \(k = 2\) to \(k = 10\).

As we can see from this plot there is distinctive no elbow/knee present.

The Gap Statistic

Thus we will move onto the second method to find the optimal K value, known as the Gap statistic. The higher the value is for the gap-statistic, the better our value of \(k\) is. Again, our goal is to get the best clusters while minimizing k.

We will generate a set of random numbers to help with this simulation in order to determine and visualize the optimal number of clusters we will graph the graph statistic.

As you can see from the above plot, the gap statistic peaks at \(k = 8\).

K-Means Cluster Analysis

In order to perform k-means clustering on the Per100 data we will use the k-means method function which partitions the points into \(k\) groups such that the sum of squares from points to the assigned cluster centres is minimized.

Let’s take a look at the structure

## List of 9
##  $ cluster     : int [1:708] 6 8 8 4 4 8 7 1 7 3 ...
##  $ centers     : num [1:8, 1:14] 10.62 39 9.32 6.87 12.17 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:8] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:14] "fg" "fga" "x3p" "x3pa" ...
##  $ totss       : num 146662
##  $ withinss    : num [1:8] 7967 1685 5573 8345 3869 ...
##  $ tot.withinss: num 53355
##  $ betweenss   : num 93307
##  $ size        : int [1:8] 52 2 110 107 39 146 160 92
##  $ iter        : int 5
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

Here is a summary of this data

##              Length Class  Mode   
## cluster      708    -none- numeric
## centers      112    -none- numeric
## totss          1    -none- numeric
## withinss       8    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           8    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric

We will use the aggregate function to generate the cluster means below.

Cluster Groups
Group.1 fg fga x3p x3pa x2p x2pa ft fta trb ast blk stl tov pts
1 10.623077 20.830769 0.5269231 2.019231 10.109615 18.809615 4.738461 7.057692 17.609615 3.026923 1.6769231 1.180769 3.403846 26.521154
2 39.000000 48.650000 0.0000000 0.000000 39.000000 48.650000 0.000000 0.000000 14.450000 4.800000 0.0000000 0.000000 0.000000 77.950000
3 9.320000 20.960000 2.8600000 8.170909 6.462727 12.786364 3.822727 4.968182 7.421818 4.700000 0.7163636 1.399091 2.710909 25.321818
4 6.871963 13.998131 0.5925234 2.141122 6.272897 11.862617 3.471028 5.148598 14.450467 3.697196 1.7925234 1.568224 2.519626 17.796262
5 12.166667 25.502564 3.0948718 8.117949 9.076923 17.376923 6.333333 7.823077 9.558974 8.484615 0.8820513 1.846154 3.933333 33.761539
6 5.899315 15.307534 3.2623288 9.693151 2.642466 5.615068 1.845890 2.383562 6.244521 3.117123 0.6575342 1.273973 1.778082 16.909589
7 6.803125 16.301250 1.9556250 6.092500 4.841875 10.207500 2.690625 3.580000 6.594375 5.117500 0.5693750 1.658125 2.513750 18.256875
8 3.385870 9.869565 1.2793478 4.773913 2.108696 5.095652 1.226087 1.796739 8.044565 3.669565 0.6532609 1.568478 2.081522 9.279348

Visualization

Now we will create a simple cluster plot to illustrate the various cluster groups

Here is a more fancy plot to help distiguish each cluster.

Another cluster plot to further illustrate each group.

Breakdown of Each Cluster Group

Now that we have generated the cluster groups I will identify each group’s stats line a provide an player that would fall into this category.

Group 1

HIGH LEVEL STARTER FRINGE ALL STAR

This position is typically an all-around player, who can contribute to various statistical categories into team’s success such as scoring, defending multiple positions, shooting and distributing the ball well.

STAT LINE PER 100 POSSESSIONS

9.3 FG, 20.8 FGA, 2.9 X3P, 8.2 X3PA, 6.4 X2P, 12.6 X2PA, 3.8 FT, 4.9 FTA, 7.6 TRB, 4.5 AST, 0.7BLK, 1.4 STL, 2.7TOV, 25.2PTS

Example: Malcolm Brogdon, starting point guard from the Milwaukee Bucks

Group 3

SUPERSTAR ALL NBA

This player is usually the best player on the team one of the 15 best players in the entire NBA, league leaders in various categories such as points and assists.

STAT LINE PER 100 POSSESSIONS

11.9 FG, 25.6 FGA, 3.3 X3P, 8.5 X3PA, 8.6 X2P, 17.1 X2PA, 5.5 FT, 6.8 FTA, 7.4 TRB, 9.0 AST, 0.6 BLK, 1.9 STL, 3.8 TOV, 32.6 PTS

Example: James Harden, shooting guard for the Houston Rockets

Group 4

OFF THE BENCH SCORING 6TH MAN

This player comes into the game to give the star players rest and fill in the role to a lesser extent and usually less efficiently

STAT LINE PER 100 POSSESSIONS

6.8 FG, 16.3 FGA ,1.9 X3P, 6.1 X3PA, 4.8 X2P, 10.2 X2PA, 2.7 FT, 3.6 FTA, 6.7 TRB, 5.1 AST, 0.6 BLK, 1.6 STL, 2.5 TOV, 18.2 PTS

Example: Yogi Ferrell, back up point guard for the Sacramento Kings

Group 5

TRADITIONAL BIG REBOUNDER INSIDE SCORING

Usually the center position does most of the play near the basket, with interior scoring and elite rebounding capabilities.

STAT LINE PER 100 POSSESSIONS

7.0 FG, 14.1 FGA, 0.5 X3P, 1.9 X3PA, 6.5 X2P, 12.2 X2PA 3.6 FT, 5.3 FTA, 14.5 TRB, 3.7 AST, 1.8 BLK, 1.6 STL, 2.6 TOV, 18.1 PTS

Example: Bam Adebayo, point centre for the Miami Heat

Group 6

3 AND D

Elite three point shooter and great perimeter defender.

STAT LINE PER 100 POSSESSIONS

5.9 FG, 15.3 FGA, 3.3 X3P, 9.7 X3PA, 2.6 X2P, 5.6 X2PA, 1.8 FT, 2.4 FTA, 6.2 TRB, 3.1 AST, 0.7 BLK, 1.2STL, 1.8 TOV, 16.9 TOV

Example: JR Smith, shooting guard for the Cleveland Cavaliers

Group 7

BACK TO THE BASKET SCORING BIG MAN REBOUNDER

More polished interior and mid range scorer, superb rebounder and focal point of the offense

STAT LINE PER 100 POSSESSIONS

11.1 FG, 21.8 FGA, 0.7 X3P, 2.6 X3PA, 10.4 X2P, 19.2 X2PA, 5.4 FT, 7.6 FTA, 17.8 TRB, 3.4 AST, 1.7 BLK, 1.2 STL, 3.5 TOV, 28. 4 PTS

Example: Enes Kanter, power forward for the Boston Celtics

Group 8

ALL AROUND ROLE PLAYER

Can essentially fill any position on the court on both the defensive and offensive end works well with any line-up combination

STAT LINE PER 100 POSSESIONS

3.4 FG, 9.9 FGA, 1.3 X3P, 4.8 X3PA, 2.1 X2P, 5.1 X2PA, 1.2 FT, 1.8 FTA, 8.0 TRB, 3.7 AST, 0.7 BLK, 1.5 STL, 2.1 TOV, 9.3 PTS

Example: PJ Tucker, stretch 4 man for the Houston Rockets

Group 2

An outlier based on inflation of statistics on a per 100 possession basis.