This project will perform K-means clustering of NBA data to determine various position groups among players. K-means clustering is a straight forward unsupervised learning approach for sectioning data sets into \(K\) distinct, non-overlapping clusters based upon similarities. In order to perform K-means clustering, first we need to determine the appropriate number of clusters \(K\). The K-means algorithm will then assign each observation to exactly one of the \(K\) clusters. Below is a brief outline of the process.
Assign a number at random, from 1 to \(K\), to each of the observations. These are initial cluster assignments for each observations.
Iterate until the cluster assignments stop changing.
For each of the \(K\) clusters, compute the cluster centroid. The \(k^{th}\) cluster centroid is the vector of the feature means for the observations in the \(k^{th}\) cluster and finds the centroid of each cluster.
Assign each observation to the cluster whose centroid is closest, by calculating the cluster variation using the sum of the Euclidean distance between the data points and centroids.
In order to perform K-means clustering, we must decide how many clusters we expect in the data. Traditional there are 5 positions on a basketball team;
• Point guard
• Shooting guard
• Small forward
• Power forward
• Center
We assume naturally that there will be 5 cluster groups, intuitively the game of basketball has evolved over time into a more position-less game. Ultimately we will determine if there exists more or less defined cluster groups. This is an unsupervised learning algorithm meaning that there is no pre-determined outcome , the algorithm just tries to find patterns in the data. We will define each cluster as the mean of the players.
To best define a player, we will use the per 100 possession stats among various categories using data from the website basketballreference.com.
Now to view NBA statistics in our R console, based on their respective statistics that have been tracked and documented throughout the season.
This includes:
• Games
• Games started
• Minutes
• 2pt & 3pt field goals/attempts/percentage
• Free throws/attempts/percentage
• Assists, turnovers
• Offensive/ Defensive and Total Rebounds
• Fouls
• Blocks
• Steals
Below are some of the Per-100 possession statistics for each player during the 2018-2019 NBA regular season.
| rk | player | pos | age | tm | g | gs | mp | fg | fga | fgpercent | x3p | x3pa | x3ppercent | x2p | x2pa | x2ppercent | ft | fta | ftpercent | orb | drb | trb | ast | stl | blk | tov | pf | pts | x | ortg | drtg | link |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Álex Abrines | SG | 25 | OKC | 31 | 2 | 588 | 4.4 | 12.5 | 0.357 | 3.3 | 10.1 | 0.323 | 1.2 | 2.4 | 0.500 | 1.0 | 1.0 | 0.923 | 0.4 | 3.4 | 3.8 | 1.6 | 1.3 | 0.5 | 1.1 | 4.2 | 13.1 | NA | 103 | 111 | /players/a/abrinal01.html |
| 2 | Quincy Acy | PF | 28 | PHO | 10 | 0 | 123 | 1.6 | 7.0 | 0.222 | 0.8 | 5.8 | 0.133 | 0.8 | 1.2 | 0.667 | 2.7 | 3.9 | 0.700 | 1.2 | 8.5 | 9.7 | 3.1 | 0.4 | 1.6 | 1.6 | 9.3 | 6.6 | NA | 87 | 116 | /players/a/acyqu01.html |
| 3 | Jaylen Adams | PG | 22 | ATL | 34 | 1 | 428 | 4.1 | 11.9 | 0.345 | 2.7 | 8.0 | 0.338 | 1.4 | 3.9 | 0.361 | 0.8 | 1.0 | 0.778 | 1.2 | 5.3 | 6.5 | 7.0 | 1.5 | 0.5 | 3.0 | 4.9 | 11.7 | NA | 99 | 115 | /players/a/adamsja01.html |
| 4 | Steven Adams | C | 25 | OKC | 80 | 80 | 2669 | 8.4 | 14.1 | 0.595 | 0.0 | 0.0 | 0.000 | 8.4 | 14.1 | 0.596 | 2.6 | 5.1 | 0.500 | 6.8 | 6.5 | 13.3 | 2.2 | 2.0 | 1.3 | 2.4 | 3.6 | 19.4 | NA | 120 | 106 | /players/a/adamsst01.html |
| 5 | Bam Adebayo | C | 21 | MIA | 82 | 28 | 1913 | 7.2 | 12.4 | 0.576 | 0.1 | 0.4 | 0.200 | 7.1 | 12.0 | 0.588 | 4.2 | 5.8 | 0.735 | 4.2 | 11.0 | 15.2 | 4.7 | 1.8 | 1.7 | 3.1 | 5.2 | 18.6 | NA | 120 | 104 | /players/a/adebaba01.html |
| 6 | Deng Adel | SF | 21 | CLE | 19 | 3 | 194 | 2.8 | 9.2 | 0.306 | 1.5 | 5.9 | 0.261 | 1.3 | 3.3 | 0.385 | 1.0 | 1.0 | 1.000 | 0.8 | 4.1 | 4.9 | 1.3 | 0.3 | 1.0 | 1.5 | 3.3 | 8.2 | NA | 85 | 121 | /players/a/adelde01.html |
Now let’s view the structure of this data set.
## 'data.frame': 708 obs. of 33 variables:
## $ rk : num 1 2 3 4 5 6 7 8 9 10 ...
## $ player : chr "Álex Abrines" "Quincy Acy" "Jaylen Adams" "Steven Adams" ...
## $ pos : chr "SG" "PF" "PG" "C" ...
## $ age : num 25 28 22 25 21 21 25 33 21 23 ...
## $ tm : chr "OKC" "PHO" "ATL" "OKC" ...
## $ g : num 31 10 34 80 82 19 7 81 10 38 ...
## $ gs : num 2 0 1 80 28 3 0 81 1 2 ...
## $ mp : num 588 123 428 2669 1913 ...
## $ fg : num 4.4 1.6 4.1 8.4 7.2 2.8 6.7 12.4 5.3 7.7 ...
## $ fga : num 12.5 7 11.9 14.1 12.4 9.2 22.3 24 15.8 20.5 ...
## $ fgpercent : num 0.357 0.222 0.345 0.595 0.576 0.306 0.3 0.519 0.333 0.376 ...
## $ x3p : num 3.3 0.8 2.7 0 0.1 1.5 0 0.2 1.2 3.7 ...
## $ x3pa : num 10.1 5.8 8 0 0.4 5.9 8.9 0.8 4.8 11.4 ...
## $ x3ppercent: num 0.323 0.133 0.338 0 0.2 0.261 0 0.238 0.25 0.323 ...
## $ x2p : num 1.2 0.8 1.4 8.4 7.1 1.3 6.7 12.3 4 4 ...
## $ x2pa : num 2.4 1.2 3.9 14.1 12 3.3 13.4 23.2 10.9 9.1 ...
## $ x2ppercent: num 0.5 0.667 0.361 0.596 0.588 0.385 0.5 0.528 0.37 0.443 ...
## $ ft : num 1 2.7 0.8 2.6 4.2 1 2.2 6.3 3.2 5.2 ...
## $ fta : num 1 3.9 1 5.1 5.8 1 4.5 7.5 4.8 6.9 ...
## $ ftpercent : num 0.923 0.7 0.778 0.5 0.735 1 0.5 0.847 0.667 0.75 ...
## $ orb : num 0.4 1.2 1.2 6.8 4.2 0.8 2.2 4.6 4.4 0.3 ...
## $ drb : num 3.4 8.5 5.3 6.5 11 4.1 6.7 9 6.1 2.3 ...
## $ trb : num 3.8 9.7 6.5 13.3 15.2 4.9 8.9 13.5 10.5 2.6 ...
## $ ast : num 1.6 3.1 7 2.2 4.7 1.3 13.4 3.5 5.3 2.9 ...
## $ stl : num 1.3 0.4 1.5 2 1.8 0.3 4.5 0.8 0.4 0.7 ...
## $ blk : num 0.5 1.6 0.5 1.3 1.7 1 0 1.9 0 0.7 ...
## $ tov : num 1.1 1.6 3 2.4 3.1 1.5 4.5 2.6 3.2 3.8 ...
## $ pf : num 4.2 9.3 4.9 3.6 5.2 3.3 8.9 3.3 2.8 5.4 ...
## $ pts : num 13.1 6.6 11.7 19.4 18.6 8.2 15.6 31.4 15 24.3 ...
## $ x : num NA NA NA NA NA NA NA NA NA NA ...
## $ ortg : num 103 87 99 120 120 85 84 117 93 95 ...
## $ drtg : num 111 116 115 106 104 121 104 110 117 111 ...
## $ link : chr "/players/a/abrinal01.html" "/players/a/acyqu01.html" "/players/a/adamsja01.html" "/players/a/adamsst01.html" ...
We will not be using be the following variables in our analysis
• rk
• player
• tm
• g, gs
• age
• fg/x3p/x2p percent
• o/d rb
• x
• o/d rtg
• link
For these do not contribute significant clustering.
Now we will conduct some basic exploratory data analysis to get a better understanding of the data set that we will be using. First, we will create a data frame using only the numeric value rows to aid in the classification process.
| fg | fga | x3p | x3pa | x2p | x2pa | ft | fta | trb | ast | blk | stl | tov | pts |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4.4 | 12.5 | 3.3 | 10.1 | 1.2 | 2.4 | 1.0 | 1.0 | 3.8 | 1.6 | 0.5 | 1.3 | 1.1 | 13.1 |
| 1.6 | 7.0 | 0.8 | 5.8 | 0.8 | 1.2 | 2.7 | 3.9 | 9.7 | 3.1 | 1.6 | 0.4 | 1.6 | 6.6 |
| 4.1 | 11.9 | 2.7 | 8.0 | 1.4 | 3.9 | 0.8 | 1.0 | 6.5 | 7.0 | 0.5 | 1.5 | 3.0 | 11.7 |
| 8.4 | 14.1 | 0.0 | 0.0 | 8.4 | 14.1 | 2.6 | 5.1 | 13.3 | 2.2 | 1.3 | 2.0 | 2.4 | 19.4 |
| 7.2 | 12.4 | 0.1 | 0.4 | 7.1 | 12.0 | 4.2 | 5.8 | 15.2 | 4.7 | 1.7 | 1.8 | 3.1 | 18.6 |
| 2.8 | 9.2 | 1.5 | 5.9 | 1.3 | 3.3 | 1.0 | 1.0 | 4.9 | 1.3 | 1.0 | 0.3 | 1.5 | 8.2 |
Now let’s examine the correlation between each of these variable to determine which axis to use when constructing and evaluating our clusterplot. The correlation plot between the different statistical categories will show if there’s anything interesting interactions among them. We will use the ggcorrpplot library to help visualize the correlation between each variable.
As expected there is no surprises here. I will refer back to this graph in my final assessment of each cluster group.
There are two ways are that we can determine the value of \(k\) affects the clustering , so we can make sensible conclusions about the data, the elbow method and the gap statistic.
The elbow method uses the sum squared deviations within each cluster from each observation and the cluster centroid. A cluster that has a small sum of squares is more tightly compact than oppose to ones with large sums of squares. The score will only get smaller as we increase the value of k and the clusters get smaller. Thus the point of interest in the plot referred to as the “elbow knee” where the sum of squares value within each cluster drops considerably and then levels out for larger values k.
For generating the Elbow Method in order to find the optimal number of clusters, first we will generate a set of random numbers to help with this simulation. Computing the sum of squares with each cluster from \(k = 2\) to \(k = 10\) will allow us to perform k-means clustering on our data frame.
Below is the plot of the Elbow Method for values \(k = 2\) to \(k = 10\).
As we can see from this plot there is distinctive no elbow/knee present.
Thus we will move onto the second method to find the optimal K value, known as the Gap statistic. The higher the value is for the gap-statistic, the better our value of \(k\) is. Again, our goal is to get the best clusters while minimizing k.
We will generate a set of random numbers to help with this simulation in order to determine and visualize the optimal number of clusters we will graph the graph statistic.
As you can see from the above plot, the gap statistic peaks at \(k = 8\).
In order to perform k-means clustering on the Per100 data we will use the k-means method function which partitions the points into \(k\) groups such that the sum of squares from points to the assigned cluster centres is minimized.
Let’s take a look at the structure
## List of 9
## $ cluster : int [1:708] 6 8 8 4 4 8 7 1 7 3 ...
## $ centers : num [1:8, 1:14] 10.62 39 9.32 6.87 12.17 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:8] "1" "2" "3" "4" ...
## .. ..$ : chr [1:14] "fg" "fga" "x3p" "x3pa" ...
## $ totss : num 146662
## $ withinss : num [1:8] 7967 1685 5573 8345 3869 ...
## $ tot.withinss: num 53355
## $ betweenss : num 93307
## $ size : int [1:8] 52 2 110 107 39 146 160 92
## $ iter : int 5
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
Here is a summary of this data
## Length Class Mode
## cluster 708 -none- numeric
## centers 112 -none- numeric
## totss 1 -none- numeric
## withinss 8 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 8 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
We will use the aggregate function to generate the cluster means below.
| Group.1 | fg | fga | x3p | x3pa | x2p | x2pa | ft | fta | trb | ast | blk | stl | tov | pts |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 10.623077 | 20.830769 | 0.5269231 | 2.019231 | 10.109615 | 18.809615 | 4.738461 | 7.057692 | 17.609615 | 3.026923 | 1.6769231 | 1.180769 | 3.403846 | 26.521154 |
| 2 | 39.000000 | 48.650000 | 0.0000000 | 0.000000 | 39.000000 | 48.650000 | 0.000000 | 0.000000 | 14.450000 | 4.800000 | 0.0000000 | 0.000000 | 0.000000 | 77.950000 |
| 3 | 9.320000 | 20.960000 | 2.8600000 | 8.170909 | 6.462727 | 12.786364 | 3.822727 | 4.968182 | 7.421818 | 4.700000 | 0.7163636 | 1.399091 | 2.710909 | 25.321818 |
| 4 | 6.871963 | 13.998131 | 0.5925234 | 2.141122 | 6.272897 | 11.862617 | 3.471028 | 5.148598 | 14.450467 | 3.697196 | 1.7925234 | 1.568224 | 2.519626 | 17.796262 |
| 5 | 12.166667 | 25.502564 | 3.0948718 | 8.117949 | 9.076923 | 17.376923 | 6.333333 | 7.823077 | 9.558974 | 8.484615 | 0.8820513 | 1.846154 | 3.933333 | 33.761539 |
| 6 | 5.899315 | 15.307534 | 3.2623288 | 9.693151 | 2.642466 | 5.615068 | 1.845890 | 2.383562 | 6.244521 | 3.117123 | 0.6575342 | 1.273973 | 1.778082 | 16.909589 |
| 7 | 6.803125 | 16.301250 | 1.9556250 | 6.092500 | 4.841875 | 10.207500 | 2.690625 | 3.580000 | 6.594375 | 5.117500 | 0.5693750 | 1.658125 | 2.513750 | 18.256875 |
| 8 | 3.385870 | 9.869565 | 1.2793478 | 4.773913 | 2.108696 | 5.095652 | 1.226087 | 1.796739 | 8.044565 | 3.669565 | 0.6532609 | 1.568478 | 2.081522 | 9.279348 |
Now we will create a simple cluster plot to illustrate the various cluster groups
Here is a more fancy plot to help distiguish each cluster.
Another cluster plot to further illustrate each group.
Now that we have generated the cluster groups I will identify each group’s stats line a provide an player that would fall into this category.
HIGH LEVEL STARTER FRINGE ALL STAR
This position is typically an all-around player, who can contribute to various statistical categories into team’s success such as scoring, defending multiple positions, shooting and distributing the ball well.
STAT LINE PER 100 POSSESSIONS
9.3 FG, 20.8 FGA, 2.9 X3P, 8.2 X3PA, 6.4 X2P, 12.6 X2PA, 3.8 FT, 4.9 FTA, 7.6 TRB, 4.5 AST, 0.7BLK, 1.4 STL, 2.7TOV, 25.2PTS
Example: Malcolm Brogdon, starting point guard from the Milwaukee Bucks
SUPERSTAR ALL NBA
This player is usually the best player on the team one of the 15 best players in the entire NBA, league leaders in various categories such as points and assists.
STAT LINE PER 100 POSSESSIONS
11.9 FG, 25.6 FGA, 3.3 X3P, 8.5 X3PA, 8.6 X2P, 17.1 X2PA, 5.5 FT, 6.8 FTA, 7.4 TRB, 9.0 AST, 0.6 BLK, 1.9 STL, 3.8 TOV, 32.6 PTS
Example: James Harden, shooting guard for the Houston Rockets
OFF THE BENCH SCORING 6TH MAN
This player comes into the game to give the star players rest and fill in the role to a lesser extent and usually less efficiently
STAT LINE PER 100 POSSESSIONS
6.8 FG, 16.3 FGA ,1.9 X3P, 6.1 X3PA, 4.8 X2P, 10.2 X2PA, 2.7 FT, 3.6 FTA, 6.7 TRB, 5.1 AST, 0.6 BLK, 1.6 STL, 2.5 TOV, 18.2 PTS
Example: Yogi Ferrell, back up point guard for the Sacramento Kings
TRADITIONAL BIG REBOUNDER INSIDE SCORING
Usually the center position does most of the play near the basket, with interior scoring and elite rebounding capabilities.
STAT LINE PER 100 POSSESSIONS
7.0 FG, 14.1 FGA, 0.5 X3P, 1.9 X3PA, 6.5 X2P, 12.2 X2PA 3.6 FT, 5.3 FTA, 14.5 TRB, 3.7 AST, 1.8 BLK, 1.6 STL, 2.6 TOV, 18.1 PTS
Example: Bam Adebayo, point centre for the Miami Heat
3 AND D
Elite three point shooter and great perimeter defender.
STAT LINE PER 100 POSSESSIONS
5.9 FG, 15.3 FGA, 3.3 X3P, 9.7 X3PA, 2.6 X2P, 5.6 X2PA, 1.8 FT, 2.4 FTA, 6.2 TRB, 3.1 AST, 0.7 BLK, 1.2STL, 1.8 TOV, 16.9 TOV
Example: JR Smith, shooting guard for the Cleveland Cavaliers
BACK TO THE BASKET SCORING BIG MAN REBOUNDER
More polished interior and mid range scorer, superb rebounder and focal point of the offense
STAT LINE PER 100 POSSESSIONS
11.1 FG, 21.8 FGA, 0.7 X3P, 2.6 X3PA, 10.4 X2P, 19.2 X2PA, 5.4 FT, 7.6 FTA, 17.8 TRB, 3.4 AST, 1.7 BLK, 1.2 STL, 3.5 TOV, 28. 4 PTS
Example: Enes Kanter, power forward for the Boston Celtics
ALL AROUND ROLE PLAYER
Can essentially fill any position on the court on both the defensive and offensive end works well with any line-up combination
STAT LINE PER 100 POSSESIONS
3.4 FG, 9.9 FGA, 1.3 X3P, 4.8 X3PA, 2.1 X2P, 5.1 X2PA, 1.2 FT, 1.8 FTA, 8.0 TRB, 3.7 AST, 0.7 BLK, 1.5 STL, 2.1 TOV, 9.3 PTS
Example: PJ Tucker, stretch 4 man for the Houston Rockets
An outlier based on inflation of statistics on a per 100 possession basis.