Introduction
Description of the Dataset
Data Preparation
Principal Component Analysis
Clustering of PCA Results from old and modern seasons
Market Basket Analysis
Conclusions
References
Each and every day enormous amounts of data are being collected. One may think, that the more data we have, the better models we can build. However, working with many variables may actually lead to significant problems with machine learning techniques such as Clustering. This issue has even its own name The Curse of Dimensionality. Too many variables may create noise in the data, making it harder to work with. Not only the noise, but also the computation time of algorithms increases. Therefore it is important to choose the right tools for the job.
The goal of this analysis is to use unsupervised learning techniques, namely Principal Component Analysis (PCA), Clustering and Market Basket Analysis (MBA) to obtain some valuable insights about NBA players. PCA allows to reduce the number of dimensions of the data while retaining as much variance as possible. The task of clustering is to group the data points which are similar to each other and are different from the rest. MBA allows to mine rules in the dataset. If A => B.
I want to see how the NBA players have changed over the last forty years and what patterns may be discovered in the data! I hope that these methods will allow me to learn more about basketball and see what story does the data carry.
I have used Python and Beautiful Soup package to scrap the data from the place where data science meets basketball. Basketball-reference.com is so far the best site to find lots of data about basketball. I decided to analyze player stats per game from 1980 to 2020. The reason behind this choice is that, in the 1979-1980 NBA season the 3-point line was first introduced. Moreover, the further back we go with basketball data, the more missing values there are.
My dataset contains around 21 000 observations and 29 variables:
Player - player’s name
Pos - position
Age - player’s age on February 1 of the season
Tm - team
G - number of played games
GS - number of games this player started
MP - minutes played per game
FG - field goals per game (in other words how many shots per game this player was making)
FGA - field goal attempts per game (how many baskets per game in total this player was attempting)
FG. - field goal percentage (FGA/FG)
3P - 3-point field goals per game
3PA - 3-point field goal attempts per game
3P. - 3-point field goal percentage
2P - 2-point field goals per game
2PA - 2-point field goal attempts per game
2P. - 2-point field goal percentage
eFG. - effective field goal per game, this statistic adjusts for the fact, that the 3-point shot is more valuable than a 2-point one
FT - free throw field goals per game
FTA - free throw field goal attempts per game
FT. - free throw field goal percentage
ORB - offensive rebounds per game
DRB - defensive rebounds per game
TRB - total rebounds per game
AST - assists per game
STL - steals per game
BLK - blocks per game
TOV - turnovers per game (losing a ball during the possession)
PF - personal fouls per game
PTS - points per game
Let’s load the necessary packages:
library(dplyr)
library(ggplot2)
library(VIM)
library(factoextra)
library(psych)
library(corrplot)
library(maptools)
library(ggfortify)
library(cluster)
library(NbClust)
library(clustertend)
library(ggpubr)
Let’s have a look at the data using the VIM package:
read.csv("nba_player_stats.csv") -> data ##load the data
aggr_plot_0 <- aggr(data, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## X3P. 0.20627926
## FT. 0.07872028
## GS 0.06926467
## X2P. 0.04553383
## FG. 0.04296337
## eFG. 0.04296337
## Age 0.03837327
## G 0.03837327
## MP 0.03837327
## FG 0.03837327
## FGA 0.03837327
## X3P 0.03837327
## X3PA 0.03837327
## X2P 0.03837327
## X2PA 0.03837327
## FT 0.03837327
## FTA 0.03837327
## ORB 0.03837327
## DRB 0.03837327
## TRB 0.03837327
## AST 0.03837327
## STL 0.03837327
## BLK 0.03837327
## TOV 0.03837327
## PF 0.03837327
## PTS 0.03837327
## Player 0.00000000
## Pos 0.00000000
## Tm 0.00000000
I have scraped some empty rows, so let’s remove them. The only rows that do not contain anything in the Age column are fully empty.
df = data %>%
filter(!is.na(Age))
aggr_plot_1 <- aggr(df, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## X3P. 0.174606205
## FT. 0.041957041
## GS 0.032124105
## X2P. 0.007446301
## FG. 0.004773270
## eFG. 0.004773270
## Player 0.000000000
## Pos 0.000000000
## Age 0.000000000
## Tm 0.000000000
## G 0.000000000
## MP 0.000000000
## FG 0.000000000
## FGA 0.000000000
## X3P 0.000000000
## X3PA 0.000000000
## X2P 0.000000000
## X2PA 0.000000000
## FT 0.000000000
## FTA 0.000000000
## ORB 0.000000000
## DRB 0.000000000
## TRB 0.000000000
## AST 0.000000000
## STL 0.000000000
## BLK 0.000000000
## TOV 0.000000000
## PF 0.000000000
## PTS 0.000000000
summary(df)
## Player Pos Age Tm
## Length:20950 Length:20950 Min. :18.00 Length:20950
## Class :character Class :character 1st Qu.:24.00 Class :character
## Mode :character Mode :character Median :26.00 Mode :character
## Mean :26.76
## 3rd Qu.:29.00
## Max. :44.00
##
## G GS MP FG
## Min. : 1.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.:25.00 1st Qu.: 0.00 1st Qu.:11.40 1st Qu.: 1.300
## Median :54.00 Median : 8.00 Median :19.10 Median : 2.500
## Mean :48.85 Mean :23.16 Mean :19.85 Mean : 3.062
## 3rd Qu.:74.00 3rd Qu.:44.00 3rd Qu.:28.20 3rd Qu.: 4.400
## Max. :85.00 Max. :83.00 Max. :44.50 Max. :13.400
## NA's :673
## FGA FG. X3P X3PA
## Min. : 0.000 Min. :0.0000 Min. :0.0000 Min. : 0.000
## 1st Qu.: 3.100 1st Qu.:0.4000 1st Qu.:0.0000 1st Qu.: 0.000
## Median : 5.600 Median :0.4440 Median :0.1000 Median : 0.500
## Mean : 6.767 Mean :0.4385 Mean :0.4236 Mean : 1.236
## 3rd Qu.: 9.500 3rd Qu.:0.4870 3rd Qu.:0.7000 3rd Qu.: 2.000
## Max. :27.800 Max. :1.0000 Max. :5.1000 Max. :13.200
## NA's :100
## X3P. X2P X2PA X2P.
## Min. :0.000 Min. : 0.000 Min. : 0.00 Min. :0.000
## 1st Qu.:0.129 1st Qu.: 1.000 1st Qu.: 2.40 1st Qu.:0.425
## Median :0.300 Median : 2.050 Median : 4.40 Median :0.470
## Mean :0.256 Mean : 2.638 Mean : 5.53 Mean :0.461
## 3rd Qu.:0.365 3rd Qu.: 3.700 3rd Qu.: 7.80 3rd Qu.:0.507
## Max. :1.000 Max. :13.200 Max. :27.00 Max. :1.000
## NA's :3658 NA's :156
## eFG. FT FTA FT.
## Min. :0.0000 Min. : 0.000 Min. : 0.000 Min. :0.0000
## 1st Qu.:0.4350 1st Qu.: 0.500 1st Qu.: 0.800 1st Qu.:0.6610
## Median :0.4780 Median : 1.100 Median : 1.500 Median :0.7500
## Mean :0.4677 Mean : 1.523 Mean : 2.042 Mean :0.7236
## 3rd Qu.:0.5150 3rd Qu.: 2.100 3rd Qu.: 2.800 3rd Qu.:0.8150
## Max. :1.5000 Max. :10.300 Max. :13.100 Max. :1.0000
## NA's :100 NA's :879
## ORB DRB TRB AST
## Min. :0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.:0.400 1st Qu.: 1.200 1st Qu.: 1.600 1st Qu.: 0.600
## Median :0.800 Median : 2.000 Median : 2.800 Median : 1.300
## Mean :1.008 Mean : 2.462 Mean : 3.468 Mean : 1.846
## 3rd Qu.:1.400 3rd Qu.: 3.300 3rd Qu.: 4.700 3rd Qu.: 2.500
## Max. :7.000 Max. :13.000 Max. :18.700 Max. :14.500
##
## STL BLK TOV PF
## Min. :0.0000 Min. :0.0000 Min. :0.000 Min. :0.000
## 1st Qu.:0.3000 1st Qu.:0.1000 1st Qu.:0.600 1st Qu.:1.300
## Median :0.6000 Median :0.2000 Median :1.100 Median :1.900
## Mean :0.6561 Mean :0.4035 Mean :1.236 Mean :1.908
## 3rd Qu.:0.9000 3rd Qu.:0.5000 3rd Qu.:1.700 3rd Qu.:2.500
## Max. :3.7000 Max. :6.0000 Max. :5.700 Max. :6.000
##
## PTS
## Min. : 0.000
## 1st Qu.: 3.400
## Median : 6.500
## Mean : 8.068
## 3rd Qu.:11.500
## Max. :37.100
##
We can observe, that we have a lot of missing values in the X3P. column. It’s due to the fact that if someone hasn’t attempted any 3 point shot during the season, then he will not have the according to it 3 point field goal percentage value. There are in total 3658 missing values. Removing observations containing these NA’s would result in deleting 17.5% of all observations in the dataset. Furthermore they are not missing at random, so substitution is definitely a better choice, rather than removing this variable.
Replacing NAs with the mean values doesn’t make much sense, as we would have players who have never made a 3 point in their career, and yet they would be making 1 out of 4 3-pointers per game. In that case I will just substitute them with zeros. It will result in bringing down the mean value of this variable, but it seems to be the best solution here.
A little more than 4% of all values in free throw percentage are missing, and it’s the same story as in the 3PT. missing values case. The last variable containing more than 1% of missing values is the “GS” which stands for Games Started. Here it’s also better to substitute these NAs with zeros. Even though that we may have as a result some first team players starting from the bench, it doesn’t affect the data too much.
Two variables containing NAs are 2 point field goal percentage and effective fg percentage. The observations containing missing values in these columns are 100% correlated and constitute to less than 0.5 % of all observations, so we may just remove them. We may also remove rows containing NAs in X2P. column, because there are only 56 of them.
df <- df %>%
mutate(X3P. = ifelse(is.na(X3P.), 0, X3P.))
df <- df %>% mutate(FT. = ifelse(is.na(FT.), 0, FT.),
GS = ifelse(is.na(GS), 0, GS))
df <- df %>% filter(complete.cases(.))
summary(df)
## Player Pos Age Tm
## Length:20794 Length:20794 Min. :18.00 Length:20794
## Class :character Class :character 1st Qu.:24.00 Class :character
## Mode :character Mode :character Median :26.00 Mode :character
## Mean :26.77
## 3rd Qu.:29.00
## Max. :44.00
## G GS MP FG
## Min. : 1.00 Min. : 0.00 Min. : 0.70 Min. : 0.000
## 1st Qu.:25.00 1st Qu.: 0.00 1st Qu.:11.60 1st Qu.: 1.300
## Median :54.00 Median : 7.00 Median :19.20 Median : 2.500
## Mean :49.21 Mean :22.58 Mean :19.98 Mean : 3.084
## 3rd Qu.:75.00 3rd Qu.:42.00 3rd Qu.:28.20 3rd Qu.: 4.400
## Max. :85.00 Max. :83.00 Max. :44.50 Max. :13.400
## FGA FG. X3P X3PA
## Min. : 0.200 Min. :0.000 Min. :0.0000 Min. : 0.000
## 1st Qu.: 3.200 1st Qu.:0.400 1st Qu.:0.0000 1st Qu.: 0.000
## Median : 5.700 Median :0.444 Median :0.1000 Median : 0.500
## Mean : 6.815 Mean :0.439 Mean :0.4262 Mean : 1.243
## 3rd Qu.: 9.500 3rd Qu.:0.487 3rd Qu.:0.7000 3rd Qu.: 2.000
## Max. :27.800 Max. :1.000 Max. :5.1000 Max. :13.200
## X3P. X2P X2PA X2P.
## Min. :0.000 Min. : 0.000 Min. : 0.100 Min. :0.000
## 1st Qu.:0.000 1st Qu.: 1.100 1st Qu.: 2.400 1st Qu.:0.425
## Median :0.250 Median : 2.100 Median : 4.400 Median :0.470
## Mean :0.212 Mean : 2.657 Mean : 5.571 Mean :0.461
## 3rd Qu.:0.353 3rd Qu.: 3.800 3rd Qu.: 7.800 3rd Qu.:0.507
## Max. :1.000 Max. :13.200 Max. :27.000 Max. :1.000
## eFG. FT FTA FT.
## Min. :0.000 Min. : 0.000 Min. : 0.000 Min. :0.0000
## 1st Qu.:0.435 1st Qu.: 0.600 1st Qu.: 0.800 1st Qu.:0.6440
## Median :0.478 Median : 1.100 Median : 1.500 Median :0.7450
## Mean :0.468 Mean : 1.533 Mean : 2.055 Mean :0.6975
## 3rd Qu.:0.515 3rd Qu.: 2.100 3rd Qu.: 2.800 3rd Qu.:0.8120
## Max. :1.000 Max. :10.300 Max. :13.100 Max. :1.0000
## ORB DRB TRB AST
## Min. :0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.:0.400 1st Qu.: 1.200 1st Qu.: 1.700 1st Qu.: 0.600
## Median :0.800 Median : 2.100 Median : 2.800 Median : 1.300
## Mean :1.015 Mean : 2.478 Mean : 3.491 Mean : 1.858
## 3rd Qu.:1.400 3rd Qu.: 3.300 3rd Qu.: 4.700 3rd Qu.: 2.500
## Max. :7.000 Max. :13.000 Max. :18.700 Max. :14.500
## STL BLK TOV PF
## Min. :0.0000 Min. :0.0000 Min. :0.000 Min. :0.000
## 1st Qu.:0.3000 1st Qu.:0.1000 1st Qu.:0.600 1st Qu.:1.300
## Median :0.6000 Median :0.2000 Median :1.100 Median :1.900
## Mean :0.6605 Mean :0.4064 Mean :1.244 Mean :1.919
## 3rd Qu.:0.9000 3rd Qu.:0.5000 3rd Qu.:1.700 3rd Qu.:2.500
## Max. :3.7000 Max. :6.0000 Max. :5.700 Max. :6.000
## PTS
## Min. : 0.000
## 1st Qu.: 3.500
## Median : 6.600
## Mean : 8.126
## 3rd Qu.:11.600
## Max. :37.100
aggr_plot_1 <- aggr(df, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## Player 0
## Pos 0
## Age 0
## Tm 0
## G 0
## GS 0
## MP 0
## FG 0
## FGA 0
## FG. 0
## X3P 0
## X3PA 0
## X3P. 0
## X2P 0
## X2PA 0
## X2P. 0
## eFG. 0
## FT 0
## FTA 0
## FT. 0
## ORB 0
## DRB 0
## TRB 0
## AST 0
## STL 0
## BLK 0
## TOV 0
## PF 0
## PTS 0
Our data is almost ready, but we have to remove columns containing character values, for the PCA and Clustering algorithms work only with numeric data. It’s worth to create a new data frame, as we will need these variables later.
ready_df <- df %>% select(c(-Tm, -Pos, -Player))
Let’s first reduce the number of dimensions and see which variables are useful and which are problematic. We want to scale and center the data. The reason for this is that different variables have different scales, so some of them may impact the results drastically more than the other ones. The following formula is used to scale the data: \[\frac{x_i - mean(x)}{sd(x)}\]
ready_df.pcal <- prcomp(ready_df, center = TRUE, scale = TRUE)
ready_df.pcal
## Standard deviations (1, .., p=26):
## [1] 3.561799206 1.885727606 1.561046383 1.147309396 1.009314289 0.935816411
## [7] 0.910441143 0.759759604 0.732612388 0.642501627 0.618182080 0.585415085
## [13] 0.536973188 0.513600264 0.368946454 0.343751603 0.332258308 0.247095855
## [19] 0.130672191 0.118218973 0.104348406 0.095259857 0.015096195 0.013896727
## [25] 0.009448687 0.007090579
##
## Rotation (n x k) = (26 x 26):
## PC1 PC2 PC3 PC4 PC5 PC6
## Age -0.01097663 -0.035546454 0.01893441 -0.28159556 0.822755216 -0.46472679
## G -0.17715283 0.003064152 0.09566874 -0.12373119 0.171271196 0.55971901
## GS -0.22078411 -0.016912450 -0.03230595 -0.11972426 0.029376117 0.15640818
## MP -0.26540505 -0.069528474 -0.01998409 -0.09375760 0.048051023 0.05311946
## FG -0.26864635 -0.058908883 -0.01973744 0.08585258 -0.058425460 -0.12004979
## FGA -0.26190878 -0.123923612 -0.06215291 0.05775409 -0.066511576 -0.11497976
## FG. -0.11182958 0.240254258 0.48445283 0.16687118 0.025415817 -0.06228079
## X3P -0.09522200 -0.395780386 0.16836627 -0.33994321 -0.208060037 -0.13338128
## X3PA -0.09656538 -0.403831860 0.14147653 -0.31708060 -0.210959792 -0.13705921
## X3P. -0.05592784 -0.328628102 0.18544724 -0.17945013 -0.071843182 0.09125819
## X2P -0.26091211 0.053095193 -0.07090421 0.19247092 -0.001730109 -0.09022775
## X2PA -0.25748236 0.020374537 -0.12646944 0.19127869 0.008609941 -0.07564859
## X2P. -0.10059577 0.179548318 0.50934341 0.14331253 -0.015192810 -0.08649943
## eFG. -0.10987305 0.101898971 0.56139700 0.02519108 -0.031795840 -0.07752961
## FT -0.24435126 -0.035091290 -0.09696219 0.17892805 -0.024074051 -0.14339216
## FTA -0.24596114 0.014672209 -0.10983122 0.16389781 -0.056944269 -0.15949958
## FT. -0.10408341 -0.154760194 0.12193760 0.01939703 0.325732902 0.45059268
## ORB -0.17193764 0.340205024 -0.09647138 -0.18796311 -0.052780543 0.02904160
## DRB -0.22133084 0.189387185 -0.05911917 -0.31530015 -0.054999445 -0.04615678
## TRB -0.21536228 0.249960218 -0.07483518 -0.28713789 -0.056891871 -0.02263865
## AST -0.17439935 -0.243541141 -0.02521452 0.24164710 0.184531147 0.07664074
## STL -0.20184963 -0.146796511 -0.03067479 0.11304823 0.092389653 0.15906012
## BLK -0.13249623 0.294497942 -0.06344197 -0.30927260 -0.120391134 0.01266459
## TOV -0.24280020 -0.066962738 -0.09926577 0.17133968 0.028301401 -0.01840197
## PF -0.20528357 0.152555514 -0.03350578 -0.15124416 0.077198494 0.19157970
## PTS -0.26866069 -0.093012157 -0.02031625 0.07146426 -0.070694932 -0.13707429
## PC7 PC8 PC9 PC10 PC11
## Age 0.0203982618 -0.08468745 0.05590760 -0.029997309 -6.060989e-02
## G 0.0975116295 -0.25254532 0.47482504 -0.016809470 -1.541947e-01
## GS 0.2451674357 -0.12245490 0.38618510 -0.162826836 3.616645e-01
## MP 0.1047014692 0.03284434 -0.01131910 0.074564859 -1.405055e-02
## FG -0.0835850482 -0.10580637 0.10351242 0.032818722 -8.761433e-02
## FGA -0.0806953331 -0.06581741 0.09651680 0.053290707 -1.119590e-01
## FG. 0.0393721228 -0.01657724 -0.01122879 -0.004880908 -1.075426e-03
## X3P 0.0002843858 0.20914338 0.12670227 0.045465681 -5.896981e-02
## X3PA 0.0122301480 0.25869615 0.12594992 0.043960123 -4.728158e-02
## X3P. -0.1157952988 -0.69914293 -0.49559640 -0.055498474 1.942082e-02
## X2P -0.0902443806 -0.17481165 0.07447223 0.021991758 -7.672006e-02
## X2PA -0.0960613889 -0.17721063 0.05876967 0.042663224 -1.077308e-01
## X2P. 0.0574419814 0.12880475 0.04862627 -0.007154679 4.664632e-02
## eFG. 0.0226842106 -0.03159506 -0.05208011 -0.007338568 -6.513683e-05
## FT -0.2411542622 0.05543881 0.04512654 -0.105349623 4.660993e-02
## FTA -0.1887955973 0.03992203 0.02540222 -0.094873307 3.336778e-02
## FT. -0.6509761494 0.33640646 -0.14627190 -0.048719126 1.632419e-01
## ORB -0.0383700620 -0.03666841 -0.10320467 0.311598473 1.834831e-01
## DRB -0.0101470282 0.05264768 -0.13079176 0.099601632 3.212110e-01
## TRB -0.0200284283 0.02473147 -0.12757786 0.176566109 2.899322e-01
## AST 0.3903355236 0.15019415 -0.21506369 -0.186274198 2.374682e-01
## STL 0.3824940832 0.14056644 -0.29626509 0.034776039 4.131361e-02
## BLK -0.0188450210 0.07835256 -0.13756015 -0.813966598 -2.332334e-01
## TOV 0.1505650366 0.12558366 -0.18955809 -0.053276732 -5.430948e-02
## PF 0.1171717716 0.17120126 -0.20763440 0.302229813 -6.523564e-01
## PTS -0.1193740441 -0.04532223 0.10138611 0.004765958 -6.107688e-02
## PC12 PC13 PC14 PC15 PC16
## Age -0.024931136 0.064301714 0.007228253 -0.024846628 -0.046951029
## G -0.275761833 0.238291999 -0.376149173 0.041727787 0.004799647
## GS 0.085611876 -0.339456398 0.612671593 -0.038200095 -0.047110343
## MP 0.102147772 -0.060603392 0.034280548 0.017817773 0.011015715
## FG 0.232552871 -0.003248433 -0.077529701 -0.012428025 0.021974345
## FGA 0.242226953 -0.003928528 -0.086908150 -0.004397996 -0.021452374
## FG. -0.011499757 -0.029375538 0.027242264 -0.174392017 0.284124972
## X3P -0.029892305 0.024021771 -0.036255203 -0.169869273 0.074991114
## X3PA -0.009308867 0.047932048 -0.063577202 -0.015225408 -0.155187113
## X3P. -0.113850267 -0.058001557 0.094001852 0.065575977 -0.148277402
## X2P 0.259065029 -0.010753979 -0.072699468 0.036538265 0.002587060
## X2PA 0.277559224 -0.023554022 -0.072685301 0.001082137 0.037299374
## X2P. 0.044142887 0.032662503 -0.012341250 0.423153901 -0.653686853
## eFG. -0.044696752 -0.021796851 0.022300596 -0.246304453 0.373900527
## FT -0.470905885 0.140625242 0.187352142 0.052063083 0.023974659
## FTA -0.499006001 0.155607112 0.166893813 0.030280533 0.009084964
## FT. 0.189692733 -0.072386614 0.063081697 -0.048091950 -0.011911142
## ORB -0.036553205 0.130024943 -0.035667666 -0.629302768 -0.401015948
## DRB -0.014548159 -0.049825814 -0.189652213 0.472366494 0.302289319
## TRB -0.022861842 0.009007682 -0.146515026 0.121744429 0.078320000
## AST -0.088408139 -0.274497540 -0.325283890 -0.119248479 0.002360862
## STL 0.192149302 0.696199555 0.268848586 0.045017407 0.072288828
## BLK 0.132233561 0.077835531 -0.046872098 -0.088173550 -0.058650242
## TOV -0.198332885 -0.297566237 -0.230005443 -0.108597989 -0.154739727
## PF -0.128921256 -0.293947996 0.288181550 0.122071060 0.030497775
## PTS 0.060400328 0.032977716 -0.017980950 -0.014111146 0.029856035
## PC17 PC18 PC19 PC20 PC21
## Age 0.0464189209 0.0294312240 -0.005814938 -0.003857841 -0.008733541
## G 0.0383475299 0.0308946539 -0.011998283 -0.005451421 0.006329323
## GS 0.1110859566 0.1149177149 -0.024459619 -0.030082074 -0.005216374
## MP -0.1956541256 -0.9066309524 0.097451418 0.068325585 -0.012977694
## FG -0.0211967527 0.1355891380 0.343278187 0.105581124 -0.114840024
## FGA -0.0082335044 0.0189060870 -0.445269979 -0.198312878 0.076359807
## FG. 0.0289059494 -0.0065971892 -0.414216705 0.503027052 -0.231960950
## X3P 0.0140493176 0.0898692064 0.151377816 0.396745526 -0.050232178
## X3PA -0.0007583355 0.0528341119 -0.267204817 -0.289611736 -0.036580590
## X3P. -0.0073347680 0.0171681683 -0.040341274 0.053951368 -0.025042163
## X2P -0.0274387053 0.1199221453 0.328726721 0.006456309 -0.108592871
## X2PA -0.0090977829 0.0004969614 -0.397791138 -0.112618378 0.101739512
## X2P. -0.0536522974 0.0055148583 0.055860548 0.016406178 0.032565428
## eFG. 0.0553017720 -0.0344446476 0.256348977 -0.557002386 0.225319624
## FT -0.1131430838 -0.0116809898 -0.052834174 0.189040283 0.643792032
## FTA -0.0855061997 -0.0454078599 -0.017755165 -0.246367623 -0.652840044
## FT. 0.0445695639 0.0317623922 0.003947488 -0.008996662 -0.052066891
## ORB -0.1038332659 0.0533713946 -0.002094248 -0.001813434 0.032563323
## DRB 0.0591555042 0.0759363719 -0.023880503 0.003764678 0.002893477
## TRB 0.0064723587 0.0722439911 -0.018393223 0.003329964 0.012941469
## AST -0.5074368242 0.1931754960 -0.011503073 -0.020457035 -0.002628955
## STL 0.1591894034 0.0821588537 0.005041825 0.008059840 0.005580777
## BLK -0.0411812020 -0.0038116606 -0.012229569 0.006061897 0.013117198
## TOV 0.7670662271 -0.0926003106 0.035521753 0.027788082 0.019788081
## PF -0.1623503993 0.1903780748 0.003591677 -0.016514985 0.007865294
## PTS -0.0409627885 0.1072655280 0.255644005 0.159136822 0.055881686
## PC22 PC23 PC24 PC25 PC26
## Age -0.0009743366 -8.792264e-05 -1.556063e-04 -3.921777e-05 -2.640656e-05
## G 0.0013692409 3.426063e-04 -5.357593e-04 -3.788918e-04 2.282033e-04
## GS 0.0052415902 4.868648e-05 -2.124374e-04 -6.161283e-05 -1.023642e-04
## MP -0.0465196004 -1.267897e-04 -3.073885e-04 -7.143599e-05 8.864286e-05
## FG -0.1408043369 -4.018397e-02 4.193445e-01 -6.648942e-01 1.236655e-04
## FGA 0.1430414288 -3.643081e-03 4.273442e-03 2.022663e-03 -7.242944e-01
## FG. -0.2618256307 8.755595e-04 -1.856335e-03 -4.470153e-04 -2.649511e-05
## X3P 0.5224744073 3.023341e-02 -2.456062e-01 -3.691908e-02 8.945460e-04
## X3PA -0.5505184911 2.441823e-03 -1.496398e-02 -2.461625e-03 2.513845e-01
## X3P. -0.0233954860 8.764703e-05 -8.156167e-04 -1.574820e-04 -1.080143e-04
## X2P -0.2844527040 8.386062e-02 -7.188104e-01 1.212598e-01 -6.724184e-03
## X2PA 0.3737213068 3.449313e-03 -2.059911e-03 -4.777931e-03 6.419243e-01
## X2P. 0.1690500512 -2.235465e-04 2.787143e-03 2.604653e-04 7.574355e-04
## eFG. 0.1265786016 -6.849750e-04 -1.245347e-03 -5.814467e-05 -7.175942e-04
## FT -0.1203030059 1.570995e-02 -1.090061e-01 -1.658676e-01 -2.312652e-03
## FTA 0.1581592885 1.707601e-04 -1.132206e-03 -2.740174e-03 4.717143e-04
## FT. 0.0043629042 -1.759244e-04 2.280596e-05 -1.040878e-04 4.385737e-05
## ORB -0.0041812612 2.648108e-01 3.151845e-02 4.318650e-03 -1.209294e-03
## DRB 0.0095309797 5.535313e-01 6.642349e-02 8.201958e-03 -2.347897e-03
## TRB 0.0057998965 -7.804821e-01 -9.313889e-02 -1.178684e-02 3.498629e-03
## AST 0.0221146644 -2.321055e-05 4.895013e-04 1.220175e-04 3.010996e-05
## STL 0.0063937685 1.359198e-04 1.773733e-04 2.800215e-04 1.556614e-05
## BLK 0.0024539063 -1.494622e-04 4.596989e-05 2.071876e-04 -4.107934e-05
## TOV -0.0117931271 -2.040208e-04 -2.054870e-04 2.419361e-04 8.732850e-05
## PF -0.0027528254 2.062095e-04 -1.626481e-04 -9.216130e-05 -9.330674e-05
## PTS -0.0828069281 -6.697081e-02 4.700156e-01 7.169872e-01 8.040200e-03
The output may seem a little bit scary at first, but it is just the transformation matrix consisting of Principal Components as columns. The result of PCA is always a k*k matrix, where k is the smaller number of rows or columns of the original dataset. Our dataset has 20794 rows and 26 columns, so the result is a 26x26 matrix. There can be only as many Principal Components as the smaller number of rows or columns in the dataset.
Let’s look at the eigen values of Principal Components.
eig.val <- get_eigenvalue(ready_df.pcal)
eig.val
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 1.268641e+01 4.879390e+01 48.79390
## Dim.2 3.555969e+00 1.367680e+01 62.47070
## Dim.3 2.436866e+00 9.372561e+00 71.84326
## Dim.4 1.316319e+00 5.062765e+00 76.90603
## Dim.5 1.018715e+00 3.918136e+00 80.82416
## Dim.6 8.757524e-01 3.368278e+00 84.19244
## Dim.7 8.289031e-01 3.188089e+00 87.38053
## Dim.8 5.772347e-01 2.220133e+00 89.60066
## Dim.9 5.367209e-01 2.064311e+00 91.66497
## Dim.10 4.128083e-01 1.587724e+00 93.25270
## Dim.11 3.821491e-01 1.469804e+00 94.72250
## Dim.12 3.427108e-01 1.318119e+00 96.04062
## Dim.13 2.883402e-01 1.109001e+00 97.14962
## Dim.14 2.637852e-01 1.014559e+00 98.16418
## Dim.15 1.361215e-01 5.235442e-01 98.68772
## Dim.16 1.181652e-01 4.544814e-01 99.14221
## Dim.17 1.103956e-01 4.245984e-01 99.56680
## Dim.18 6.105636e-02 2.348322e-01 99.80164
## Dim.19 1.707522e-02 6.567393e-02 99.86731
## Dim.20 1.397573e-02 5.375279e-02 99.92106
## Dim.21 1.088859e-02 4.187919e-02 99.96294
## Dim.22 9.074440e-03 3.490169e-02 99.99784
## Dim.23 2.278951e-04 8.765197e-04 99.99872
## Dim.24 1.931190e-04 7.427655e-04 99.99946
## Dim.25 8.927769e-05 3.433757e-04 99.99981
## Dim.26 5.027631e-05 1.933704e-04 100.00000
Eigenvalues correspond to the amount of variance explained by the Principal Components. The first Component explains the most amount of variance of the dataset, and every other one adequately less. We may see that the first 3 PCs explain 71% of variance and 5 of them explain 80%. It will be more visible with the scree plot.
fviz_eig(ready_df.pcal, addlabels = TRUE, ylim = c(0, 50))
There is no common rule how many PCs should be kept, it is for us to decide. We may look at the scree plot and see when PCs look similar and not significant. In this case it would make sens to retain only 3, 5 or 7 of them. The other approach is to retain only PCs which have corresponding eigenvalues to them bigger than 1. According to the Kaiser Rule, eigenvalue bigger than 1 means that the PC explains more variance than the original variable. In our case it would result in keeping 5 of them. We may also run a parallel analysis, which compares the obtained eigenvalues from the PCA with the ones generated from the Monte-Carlo simulated data of the same size.
As we can see, it suggests to retain only 4 of them, but the fifth one’s adjusted eigenvalue is 0.98 which is really close to 1. Let’s keep 5 of them and see what it brings.
Let’s see how our variables look:
var <- get_pca_var(ready_df.pcal)
#variable correlation plot
fviz_pca_var(ready_df.pcal, col.var = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
The graph above shows us how well our variables are represented by the first 2 PCs. The longer the factors (the closer they are to the circumference of the correlation circle), the better their quality of representation. Colors help us in the interpretation as well. The warmer the color, the better the representation. The closer the factors are to each other, the more they are correlated with each other.
We can observe that points (PTS), field goals per game (FG), field goal attempts (FGA) and 2-point field goals per game (X2P) have the best quality of representation. We can also see that age, effective field goal percentage (eFG.) and 2-point field goal percentage are not well represented by the first 2 PCs. The other thing worth noticing is that players shooting from long distance (X3P, X3PA, X3P.) are not likely to have a lot of rebounds (DRB, ORB) and blocks (BLK), and it works the other way around as well. We will see later whether this trend has changed over the years.
We may also group the variables:
# Color variables by groups
res.km <- kmeans(var$coord, centers = 4, nstart = 25)
grp <- as.factor(res.km$cluster)
fviz_pca_var(ready_df.pcal, col.var = grp,
palette = c("#E495A5", "#ABB065", "#39BEB1", "#ACA4E2"),
legend.title = "Cluster", repel = TRUE)
Let’s look as well at the second and third PCs.
fviz_pca_var(ready_df.pcal, col.var = "cos2",
axes = (c(2,3)),
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE)
We may see that second and third dimensions represent well the effectiveness of scoring, in other words how effective and consistent players are. The negative correlation between shooting threes and defense is articulated even more.
Before proceeding further with our analysis, let’s first see which variables may be problematic by checking their uniqueness and complexity. We will do that with the help of psych package. We will obtain rotated PCs, so they are easier to interpret. We will be also able to check earlier mentioned features of variables.
xxx.pca1<-principal(ready_df, nfactors=5, rotate="varimax")
print(loadings(xxx.pca1), digits=3, cutoff=0.5, sort=TRUE)
##
## Loadings:
## RC1 RC4 RC3 RC2 RC5
## GS 0.616
## MP 0.781
## FG 0.870
## FGA 0.878
## X2P 0.875
## X2PA 0.898
## FT 0.864
## FTA 0.842
## AST 0.768
## STL 0.738
## TOV 0.874
## PTS 0.877
## ORB 0.820
## DRB 0.826
## TRB 0.864
## BLK 0.797
## PF 0.614
## FG. 0.948
## X2P. 0.934
## eFG. 0.942
## X3P 0.945
## X3PA 0.931
## X3P. 0.704
## Age 0.890
## G
## FT.
##
## RC1 RC4 RC3 RC2 RC5
## SS loadings 9.401 4.540 2.982 2.976 1.114
## Proportion Var 0.362 0.175 0.115 0.114 0.043
## Cumulative Var 0.362 0.536 0.651 0.765 0.808
plot(xxx.pca1$complexity)
plot(xxx.pca1$uniquenesses)
We may see that there are few variables which have high complexity and uniqueness values. A high uniqueness value means that the variable’s variance is not shared with other variables, so it’s simply unique. It is an undesirable feature in PCA, because it makes reducing the number of dimensions harder.
High complexity is also not a desirable feature, for it means that a variable loads on many factors. This as a result makes it harder to interpret the loadings of factors, which we will do soon.
plot(xxx.pca1$complexity, xxx.pca1$uniqueness, xlim=c(0, 4))
pointLabel(xxx.pca1$complexity, xxx.pca1$uniqueness, labels=names(xxx.pca1$uniqueness), cex=0.8)
abline(h=c(0.38, 0.75), lty=3, col=2)
abline(v=c(1.8), lty=3, col=2)
It seems that we may remove free throw percentage (FT.), minutes played (MP), games started and games played (GS, G). I would like to keep personal fouls and 3-point field goal percentage.
final_df <- ready_df %>%
dplyr::select(c(-G, -GS, -MP, -FT.))
final_df.pcal <- prcomp(final_df, center = TRUE, scale = TRUE, rank. = 5)
#Scree plot
fviz_eig(final_df.pcal, addlabels = TRUE, ylim = c(0, 50))
Now our first 5 PCs explain even as much variance as 85%, but it was to be expected as we have now less variables in the data frame.
Let’s have a look now at the contribution of variables to single PCs:
# Contribution of variables to PC1
fviz_contrib(final_df.pcal, choice = "var", axes = 1, top = 15)
# Contribution of variables to PC2
fviz_contrib(final_df.pcal, choice = "var", axes = 2, top = 15)
# Contribution of variables to PC3
fviz_contrib(final_df.pcal, choice = "var", axes = 3, top = 15)
# Contribution of variables to PC4
fviz_contrib(final_df.pcal, choice = "var", axes = 4, top = 15)
# Contribution of variables to PC5
fviz_contrib(final_df.pcal, choice = "var", axes = 5, top = 15)
We may see that a lot of variables contribute to the first PC, but the main variables are related to scoring points. 3-point shooting contributes the most to the second PC. Scoring points contributes to PC 3. Variables in PC4 are also related to 3-point shot, but they are also related to defense (DRB, BLK). Age contributes to almost the whole PC5, that’s why it had related to it eigenvalue close to 1, when we were choosing the right amount of PCs.
Now let’s rotate the PCs, which will help us to name to Components when looking at the loadings of factors.
xxx.pca2<-principal(final_df, nfactors=5, rotate="varimax")
print(loadings(xxx.pca2), digits=3, cutoff=0.5, sort=TRUE)
##
## Loadings:
## RC1 RC4 RC3 RC2 RC5
## FG 0.863
## FGA 0.871
## X2P 0.869
## X2PA 0.892
## FT 0.853
## FTA 0.833
## AST 0.783
## STL 0.747
## TOV 0.877
## PTS 0.869
## ORB 0.830
## DRB 0.832
## TRB 0.872
## BLK 0.804
## PF 0.612
## FG. 0.951
## X2P. 0.938
## eFG. 0.946
## X3P 0.949
## X3PA 0.935
## X3P. 0.711
## Age 0.984
##
## RC1 RC4 RC3 RC2 RC5
## SS loadings 7.995 4.181 2.890 2.781 1.018
## Proportion Var 0.363 0.190 0.131 0.126 0.046
## Cumulative Var 0.363 0.553 0.685 0.811 0.857
I would call these factors the following:
PC1: Ability to score points, in other words how many points this player can score for the team, but also how well does he pass the ball and steals it from the opponent team.
PC2: 3-point shooting, how well does this player shoot threes.
PC3: Effectiveness and efficiency, so this factor explains how efficient is the player when shooting. Some players may score 15 points and have 20 field goal attempts (=low efficiency). Other play may score 15 points with only 5 field goal attempts. It makes a huge difference for the team.
PC4: Defense, here the interpretation is simple. This factor explains which players grab a lot of rebounds and block shots. What’s interesting is the fact that they tend to have more fouls as well. It’s quite logical, because defending in the post (close to the rim) is more physical than defending on the perimeter (far from the rim). As a consequence, under the basket players foul more.
PC5: Age, no need for interpretation.
Let’s see how the players look when plotted in the first 2 dimensions.
autoplot(
final_df.pcal,
data = df,
colour = "Pos",
loadings = TRUE, loadings.colour = "red", loadings.size = 4,
loadings.label = TRUE, loadings.label.size = 6, repel = TRUE, loadings.label.colour = "blue"
) + xlab("Ability to score points") + ylab("3-point shooting")
autoplot(
final_df.pcal,
x = 2,
y = 3,
data = df,
colour = "Pos",
loadings = TRUE, loadings.colour = "red", loadings.size = 4,
loadings.label = TRUE, loadings.label.size = 6, repel = TRUE, loadings.label.colour = "blue"
) + xlab("3-point shooting") + ylab("Effectiveness and efficiency")
autoplot(
final_df.pcal,
x = 1,
y = 3,
data = df,
colour = "Pos",
loadings = TRUE, loadings.colour = "red", loadings.size = 4,
loadings.label = TRUE, loadings.label.size = 6, repel = TRUE, loadings.label.colour = "blue"
) + xlab("Ability to score points") + ylab("Effectiveness and efficiency")
autoplot(
final_df.pcal,
x = 3,
y = 4,
data = df,
colour = "Pos",
loadings = TRUE, loadings.colour = "red", loadings.size = 4,
loadings.label = TRUE, loadings.label.size = 6, repel = TRUE, loadings.label.colour = "blue"
) + xlab("Effectiveness and efficiency") + ylab("Defense")
autoplot(
final_df.pcal,
x = 4,
y = 5,
data = df,
colour = "Pos",
loadings = TRUE, loadings.colour = "red", loadings.size = 4,
loadings.label = TRUE, loadings.label.size = 6, repel = TRUE, loadings.label.colour = "blue"
) + xlab("Defense") + ylab("Age")
autoplot(
final_df.pcal,
x = 2,
y = 4,
data = df,
colour = "Pos",
loadings = TRUE, loadings.colour = "red", loadings.size = 4,
loadings.label = TRUE, loadings.label.size = 6, repel = TRUE, loadings.label.colour = "blue"
) + xlab("3-point shooting") + ylab("Defense")
autoplot(
final_df.pcal,
x = 1,
y = 4,
data = df,
colour = "Pos",
loadings = TRUE, loadings.colour = "red", loadings.size = 4,
loadings.label = TRUE, loadings.label.size = 6, repel = TRUE, loadings.label.colour = "blue"
) + xlab("Ability to score points") + ylab("Defense")
We may see that there are a lot of ways in which we can look at NBA players. On the first graph there is a clear distinction between all positions in terms of shooting threes. All players regardless of position may score the ball well or poorly. The bigger guys however (centers, power forwards) tend to shoot less threes, while point guards and shooting guards are good at them. Small forwards tend to be somewhere in the middle.
PC 2 and 3 show that there are slightly more effective scorers (high field goal percentage) between centers and forwards than between the guards. We may see that there are some players that group together and form clearly visible clusters. It’s a valuable insight, that the most efficient players are also these, who tend to shoot less threes. Shooting from long distance is generally harder, than shooting from close range. It’s fascinating, that there is a clearly visible cluster of players, which stand out from the rest, but they are mainly scoring from close distance. There is also the other cluster at the bottom, which represents players which are not efficient at all. Their role is probably more related to defense or other tasks.
PC 1 and 3 show us that generally, the more points players score, the smaller the efficiency and effectiveness.
Clustering the PCA results from all seasons doesn’t take very long, but obtaining any information related to the clustering demands a lot of time and computing power. Let’s try to use clustering in a different way. Let’s see if we can observe any differences in the player stats between historic and modern players.
I will perform PCA on 500 players from modern times (2020-2021 season) and historic times (1980-1981, 1981-1982 seasons).
df_1980 <- df %>% dplyr::slice_head(n = 500) #First 500 observations from the dataset
df_2020 <- df %>% dplyr::slice_tail(n = 500) #Last 500 observations
#Remove unnecessary variables (rdf = ready data frame)
rdf_1980 <- df_1980 %>% dplyr::select(c(-Tm, -Pos, -Player, -G, -GS, -MP, -FT.))
rdf_2020 <- df_2020 %>% dplyr::select(c(-Tm, -Pos, -Player, -G, -GS, -MP, -FT.))
#PCA
pca_1980 <- prcomp(rdf_1980, center = TRUE, scale = TRUE, rank. = 5)
pca_2020 <- prcomp(rdf_2020, center = TRUE, scale = TRUE, rank. = 5)
Let’s see if the data is clusterable with the Hopkins statistic. The smaller value we will obtain, the better.
# Hopkins statistic for 1980
hopkins(rdf_1980, n=nrow(pca_1980$x)-1)
## $H
## [1] 0.1352564
#Hopkins statistic for 2020
hopkins(rdf_2020, n=nrow(pca_2020$x)-1)
## $H
## [1] 0.1477269
Both values are smaller than 0.15 which indicates that our data is clusterable, but not perfect for clustering.
Let’s check the optimal number of clusters with Silhouette Statistic and Gap Statistic.
fviz_nbclust(pca_1980$x, FUNcluster=kmeans, k.max = 8)
fviz_nbclust(pca_1980$x, FUNcluster=kmeans, method="gap_stat", k.max = 8) + theme_classic()
The optimal number of cluster may be 2 (Silhouette Statistic) or 5 (Gap Statistic).
fviz_nbclust(pca_2020$x, FUNcluster=kmeans, k.max = 8)
fviz_nbclust(pca_2020$x, FUNcluster=kmeans, method="gap_stat", k.max = 8) + theme_classic()
The optimal number of cluster may be 2 (Silhouette Statistic) or 3 (Gap Statistic), however 5 doesn’t look bad either.
We may choose a handy package NbClust, which will calculate for us much more statistics.
number <- NbClust(pca_1980$x, distance="euclidean", min.nc=2, max.nc=10, method="ward.D2", index="all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 5 proposed 2 as the best number of clusters
## * 9 proposed 3 as the best number of clusters
## * 1 proposed 4 as the best number of clusters
## * 2 proposed 6 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 3 proposed 8 as the best number of clusters
## * 2 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
According to 23 different statistics, the optimal number of clusters is 3 with euclidean distance.
Let’s see if there is a difference with manhattan and minkowski distances.
number <- NbClust(pca_1980$x, distance="manhattan", min.nc=2, max.nc=10, method="ward.D2", index="all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 7 proposed 2 as the best number of clusters
## * 9 proposed 3 as the best number of clusters
## * 3 proposed 5 as the best number of clusters
## * 1 proposed 6 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
## * 2 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
number <- NbClust(pca_1980$x, distance="minkowski", min.nc=2, max.nc=10, method="ward.D2", index="all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 5 proposed 2 as the best number of clusters
## * 9 proposed 3 as the best number of clusters
## * 1 proposed 4 as the best number of clusters
## * 2 proposed 6 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 3 proposed 8 as the best number of clusters
## * 2 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
I will go with 3 clusters then.
number <- NbClust(pca_2020$x, distance="euclidean", min.nc=2, max.nc=10, method="ward.D2", index="all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 5 proposed 2 as the best number of clusters
## * 4 proposed 3 as the best number of clusters
## * 3 proposed 4 as the best number of clusters
## * 2 proposed 5 as the best number of clusters
## * 3 proposed 6 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 5 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
number <- NbClust(pca_2020$x, distance="manhattan", min.nc=2, max.nc=10, method="ward.D2", index="all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 8 proposed 2 as the best number of clusters
## * 7 proposed 3 as the best number of clusters
## * 1 proposed 4 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
## * 2 proposed 6 as the best number of clusters
## * 3 proposed 7 as the best number of clusters
## * 2 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
number <- NbClust(pca_2020$x, distance="minkowski", min.nc=2, max.nc=10, method="ward.D2", index="all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 5 proposed 2 as the best number of clusters
## * 4 proposed 3 as the best number of clusters
## * 3 proposed 4 as the best number of clusters
## * 2 proposed 5 as the best number of clusters
## * 3 proposed 6 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 5 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
According to all indices and 3 different distances, the optimal number is 2.
Let’s plot the results.
autoplot(kmeans(pca_1980$x, 3), data = pca_1980, frame = TRUE)
autoplot(kmeans(pca_2020$x, 2), data = pca_2020, frame = TRUE)
Looking at the plots, the conclusion here may be, that in the old NBA times we could divide the players between 3 groups:
For the modern times we may describe our clusters in the following way:
The General conclusion would be that nowadays more and more players shoot the 3-point shot well and more players attempt it. In the old times, right after the introduction of the 3-point shot we may observe that not many players were good 3-point shooters, and even those who tried, weren’t the best efficient scorers. Nowadays players that attempt 3-point shots and make them, are the most lethal scorers. That’s why more and more players shoot from long distance, as the way of the game is being played, has changed.
As we can see on both plots, the clusters seem to be strongly influenced by the outliers. Let’s try to use a PAM clustering method, to see whether we could obtain different clustering results. Partitioning Around Medoids (PAM) is a method which is more robust to outliers, but takes longer to compute the clusters. Our data from single seasons is not so huge, so we can run PAM.
fviz_nbclust(pca_1980$x, FUNcluster=cluster::pam, k.max = 7)
fviz_nbclust(pca_1980$x, FUNcluster=cluster::pam, method="gap_stat", k.max = 7)
Let’s try 7 clusters.
fviz_nbclust(pca_2020$x, FUNcluster=cluster::pam, k.max = 7)
fviz_nbclust(pca_2020$x, FUNcluster=cluster::pam, method="gap_stat", k.max = 7)
Let’s try 5 clusters.
autoplot(pam(pca_1980$x, 7), data = df_1980, frame = TRUE) + geom_text(label = df_1980$Player, check_overlap = TRUE)
autoplot(pam(pca_2020$x, 5), data = pca_2020, frame = TRUE) + geom_text(label = df_2020$Player, check_overlap = TRUE)
The players that are in the right clusters, were dominating the league in the 80s. Big center Kareem-Abdul Jabbar or power forward Julius Erving were scoring lots of points and making spectacular plays. Kareem was a six time NBA champion, which is extremely rare for an NBA player. He also holds the first place in the NBA hall of fame for most points scored (38387). It’s not a surprise, that he is on the edge of the scoring cluster.
In the bottom right cluster we can see also excellent scorers, like Larry Bird, but they are rather power and small forwards, rather than centers. In the blue middle cluster we may see some point guards like for example an absolute superstar Pete Maravich, which was one of the best point guards in NBA history. His role was not solely to score points, but also to distribute the ball and make brilliant plays.
In the left clusters we may observe players, who had other roles on the court than scoring. They were starting usually from the bench and assisted the superstars. Players in the middle bottom cluster.
What’s worth to note, is that the best 3-point shooters in the early 80s averaged less than 1 3-point attempt per game.
Right blue cluster represents all the superstars of modern NBA. The closer to the top of the cluster, the better the 3-point shooters are there. For example Trae Young, Stephen Curry, James Harden or Damian Lillard are one of the best 3-point shooting guards of all time. They are exceptional in playmayking as well, dishing a lot passes to their teammates. At the bottom of the same cluster we may find players such as Andre Drummond (doesn’t shoot 3s at all, old school player that dominates the paint), Anthony Davis (incredible defender and 2-point scorer), Nikola Yokic (incredibly intelligent big man who can score from literally anywhere on the court and his passes are insane). They are all centers and power-forwards (bigger players).
Players such as Tacko Fall or JaVale Mc Gee are in the middle left clusters. They can sometimes score with a high efficiency under the basket, but their main role is defense. The left top green cluster represents players that are not good at scoring, but they do not have a lot of games played. In other words, they were contracted for a short time and didn’t work out in the NBA (Malik Newman, Stanton Kidd).
Market Basket Analysis is used mostly for the analysis of customer behavior in order to increase sales. Here however I will use it to find rules about basketball players and finish my analysis.
#Necessary packages
library(arules)
library(arulesViz)
library(arulesCBA)
In order to perform the Market Basket Analysis, we have to recode the data.
#recode the variables
library(tidyverse)
df2 <- df
df2 <- df2 %>% mutate(Age = ifelse(Age < 27, "Young", ifelse(Age > 26 & Age <34, "Middle Aged", "Veteran")),
FGA = case_when(FGA < 4 ~ "Few FGA", FGA >= 4 & FGA < 14 ~ "Moderate FGA", FGA >=14 & FGA <18 ~ "Lots of FGA", FGA >= 18 ~ "Shooting Machine"),
G = case_when(G < 15 ~ "Few Games", G >= 15 & G < 50 ~ "Moderate Amount of Games", G >= 50 ~ "Playing Machine" ),
MP = case_when(MP < 10 ~ "Very few MP", MP >=10 & MP < 25 ~ "Moderate MP", MP >= 25 ~ "Lots of MP" ),
FG. = case_when(FG. < 0.25 ~ "Low FG", FG. >= 0.25 & FG. < 0.4 ~ "Moderate FG", FG. >=0.4 & FG. <0.5 ~ "Good FG", FG. >= 0.5 & FG. <0.6 ~ "Excellent FG", FG. >= 0.6 ~ "Lethal FG" ),
X3P = case_when(X3P < 0.5 ~ "Doesn't Shoot 3s", X3P >=0.5 & X3P < 3 ~ "Sometimes 3", X3P >= 3 & X3P < 5 ~ "Regular 3-point shooter", X3P >= 5 ~ "3-point Beast"),
X3P. = case_when(X3P. < 0.2 ~ "Tragic 3-point Efficiency", X3P. >= 0.2 & X3P. <0.4 ~ "Fair 3-point Efficiency", X3P. >=0.4 ~ "Lethal 3-point Efficiency" ),
X2P = case_when(X2P < 1 ~ "Doesn't shoot 2s", X2P >=1 & X2P < 4 ~ "Sometimes 2", X2P >= 4 & X2P < 7 ~ "regular 2-point shooter", X2P >= 7 ~ "2-point Beast"),
X2P. = case_when(X2P. < 0.2 ~ "Tragic 2-point Efficiency", X2P. >= 0.2 & X2P. <0.5 ~ "Fair 2-point Efficiency", X2P. >=0.5 & X2P. <0.7 ~ "Good 2-point Efficiency", X2P. >=0.7 ~ "Lethal 2-point Efficiency" ),
FT. = case_when(FT. <0.5 ~ "Tragic FT Efficiency", FT. >=0.5 & FT. < 0.7 ~ "Poor FT Efficiency", FT. >=0.7 & FT. < 0.8 ~ "Good FT", FT. >= 0.8 ~ "Excellent FT Efficiency" ),
TRB = case_when(TRB < 1 ~ "Doesn't Rebound", TRB >=1 & TRB < 4 ~ "Sometimes Rebound", TRB >= 4 & TRB <8 ~ "Good Rebound", TRB >= 8 & TRB < 12 ~ "Very Good Rebound", TRB >= 12 ~ "Rebounding Beast"),
ORB = case_when(ORB < 1 ~ "Bad ORB", ORB >= 1 & ORB < 2.5 ~ "Good ORB", ORB >= 2.5 & ORB < 3.5 ~ "Great ORB", ORB >= 3.5 ~ "ORB Beast"),
DRB = case_when(DRB < 3 ~ "Bad DRB", DRB >= 3 & DRB < 6 ~ "Good DRB", DRB >= 6 & DRB < 8 ~ "Great DRB", DRB >= 8 ~ "DRB Beast"),
AST = case_when(AST < 0.5 ~ "Doesn't Pass", AST >= 0.5 & AST < 1.5 ~ "Sometimes Pass", AST >= 1.5 & AST < 2.5 ~ "Decent Passer", AST >= 2.5 & AST < 3.5 ~ "Good Passer", AST >= 3.5 & AST < 5 ~ "Great Passer", AST >= 5 ~ "Passing Prodigy"),
STL = case_when(STL < 0.3 ~ "Doesn't Steal", STL >= 0.3 & STL < 1 ~ "Sometimes Steal", STL >= 1 & STL < 1.8 ~ "Great Steal", STL >= 1.8 ~ "Incredible Steal"),
BLK = case_when(BLK < 0.2 ~ "Doesn't Block", BLK >= 0.2 & BLK < 0.4 ~ "Sometimes Block", BLK >= 0.4 & BLK < 0.8 ~ "Good Block", BLK >= 0.8 ~ "Great Block" ),
TOV = case_when(TOV < 0.3 ~ "Few TOV", TOV >= 0.3 & TOV < 0.6 ~ "Sometimes TOV", TOV >= 0.6 & TOV < 1.5 ~ "Frequent TOV", TOV >= 1.5 ~ "Lots of TOV"),
PF = case_when(PF < 0.3 ~ "Few PF", PF >= 0.3 & PF < 1 ~ "Sometimes PF", PF >= 1 & PF < 3 ~ "Frequent PF", PF >= 3 ~ "Lots of PF"),
PTS = case_when(PTS < 5 ~ "Few PTS", PTS >=5 & PTS < 10 ~ "Moderate PTS", PTS >= 10 & PTS < 13 ~ "Good PTS", PTS >= 13 & PTS < 17 ~ "Great PTS", PTS >= 17 ~ "PTS Beast")
)
Let’s create 2 datasets - old times and modern times.
df2 <- df2 %>% select(c(-Tm, -FT, -eFG., -X3PA, -X2PA, -FG, -GS, -FTA, -Player, -PF, -ORB, -DRB))
df_old <- df2 %>% dplyr::slice_head(n = 2500) #First 2500 observations from the dataset
df_modern <- df2 %>% dplyr::slice_tail(n = 2500) #Last 2500 observations
write.csv(df_old, file="nba_basket_old.csv")
write.csv(df_modern, file="nba_basket_modern.csv")
Now let’s read the data as transactions, so we could mine the association rules. I choose 2500 players from modern times and the same number from historic times.
basket_old <- read.transactions("nba_basket_old.csv", format="basket", sep=",", skip=1) # reading the file as transactions
basket_modern <- read.transactions("nba_basket_modern.csv", format="basket", sep=",", skip=1) # reading the file as transactions
We may start with comparing the frequencies of items between the baskets. In other words, which attributes appear the most.Out
itemFrequencyPlot(basket_old, topN=37, type="relative", main="Item Frequency for historic times")
itemFrequencyPlot(basket_modern, topN=37, type="relative", main="Item Frequency for modern times")
Although these graphs are simple, they convey important information about the way the NBA has changed over 40 years. The first noticeable difference which has been already underlined by PCA and Clustering is the number of players that do not shoot 3s in the historic times. What’s interesting is that nowadays fair 3-point efficiency is the most frequent feature among players!!! This means that nowadays it’s not something special to be a good 3-point shooter, but it’s something which most of the players can and should do if they want to be good offensive players.
In both cases young players appear more frequently than middle aged and veterans. There are relatively very few veterans in the league, as playing for long time is very taxing on athletes’ bodies.
We can see that nowadays there are much more players with excellent free throw efficiency than in the past. It’s become a very important part of the game as well. Now that big guys are learning to shoot from the distance, free throws are even easier for them.
For me personally, a very nice discovery is that nowadays there are less players who commit a lot of turnovers than in the past and the difference is visibly significant. It may be due to many reasons, such as that the way of the game is being played has changed, players got simply better or the rules are now more player friendly.
Let’s proceed with the comparison of players who scored the most points in historic times with the modern times. I am more interested in the confidence and not support. The reason being that the are not many players who are exceptional scorers. Support metric takes into account total number of observations, so the support threshold must be low, so I will go with minimum support at 0.03, which means that there should be at least 3 % of all observations, where a player has both “PTS Beast” and something else. I am setting the confidence to 0.7
rules<-apriori(data=basket_old, parameter=list(supp=0.025, conf=0.7), appearance=list(default="lhs", rhs="PTS Beast"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.7 0.1 1 none FALSE TRUE 5 0.025 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 62
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[2575 item(s), 2500 transaction(s)] done [0.04s].
## sorting and recoding items ... [63 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [0.51s].
## writing ... [1395 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules.clean<-rules[!is.redundant(rules)]
rules.clean<-rules.clean[is.significant(rules.clean, basket_old)]
rules.clean<-rules.clean[is.maximal(rules.clean)]
rules.bysupp<-sort(rules.clean, by="support", decreasing=TRUE)
inspectDT(rules.bysupp)
rules2<-apriori(data=basket_modern, parameter=list(supp=0.025, conf=0.7), appearance=list(default="lhs", rhs="PTS Beast"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.7 0.1 1 none FALSE TRUE 5 0.025 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 62
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[2580 item(s), 2500 transaction(s)] done [0.03s].
## sorting and recoding items ... [64 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [0.37s].
## writing ... [240 rule(s)] done [0.02s].
## creating S4 object ... done [0.00s].
rules2.clean<-rules2[!is.redundant(rules2)]
rules2.clean<-rules2.clean[is.significant(rules2.clean, basket_modern)]
rules2.clean<-rules2.clean[is.maximal(rules2.clean)]
rules2.bysupp<-sort(rules2.clean, by="support", decreasing=TRUE)
inspectDT(rules2.bysupp)
Obviously, players that score a lot of points must have high FGA (shots attempted) as well. They play a lot of minutes, but they have to be efficient as well. They have a ball a lot so they naturally commit more turnovers as well.
In the historic times table, we may see that rule with a “Good Block” has quite high support 0.032 and confidence 1. We do not see this in the modern times however. This may indicate, that big athletic guys were great scorers. In nowadays table we can see more rules relating to 3-point shot.
We may also look at the general rules with the highest support for both periods.
rules<-apriori(data=basket_old, parameter=list(supp=0.1, conf=0.7))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.7 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 250
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[2575 item(s), 2500 transaction(s)] done [0.02s].
## sorting and recoding items ... [51 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 done [0.06s].
## writing ... [13891 rule(s)] done [0.00s].
## creating S4 object ... done [0.01s].
rules.clean<-rules[!is.redundant(rules)]
rules.clean<-rules.clean[is.significant(rules.clean, basket_old)]
rules.clean<-rules.clean[is.maximal(rules.clean)]
rules.bysupp<-sort(rules.clean, by="support", decreasing=TRUE)
inspectDT(rules.bysupp)
rules2<-apriori(data=basket_modern, parameter=list(supp=0.1, conf=0.7))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.7 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 250
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[2580 item(s), 2500 transaction(s)] done [0.02s].
## sorting and recoding items ... [49 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 done [0.04s].
## writing ... [9351 rule(s)] done [0.00s].
## creating S4 object ... done [0.01s].
rules2.clean<-rules2[!is.redundant(rules2)]
rules2.clean<-rules2.clean[is.significant(rules2.clean, basket_modern)]
rules2.clean<-rules2.clean[is.maximal(rules2.clean)]
rules2.bysupp<-sort(rules2.clean, by="support", decreasing=TRUE)
inspectDT(rules2.bysupp)
Inspecting the rules with high confidence and high support does not seem meaningful in this case. The rules we get are rather obvious, such as that sometimes shoot 3s and 2s have moderate FGA (shots attempted).
One may play with the interactive tool such as ruleExplorer and adjust the parameters. I am fully confident that this way may be found incredible insights about the game of basketball, but the difficulty lies in low support. Confidence here is not a problem, but given the low support, many rules are found and it’s hard to dig through them.
Running a PCA on the NBA dataset allowed us to see which components explained the most variance. The main one happens to be the offensive ability to score points, which was to be expected. Looking at the pairs of different components however, one may find many fascinating insights about the NBA, which are not visible at all when looking at the big tables of data.
We may see from the visualization of PCA results, that the NBA players nowadays are much more diverse. In the historic times, the variables on the components group together, whereas nowadays they are more spread.
Clustering of the PCA results from different seasons allowed us also to see how the league has changed over the years. We can clearly see that nowadays more players shoot 3s and more 3-point attempts are being made on average. In the 80s 3-point shot was something very new. Players had to take their time to adapt and master the art of long distance shots.
Superstars in the NBA may be considered as outliers, which results in the PAM method to be more effective than kmeans in this case. We can see from the clusters, that nowadays the players who score the most points, are also the best 3-point shooters. In the past it was the opposite, it was the era of big centers, that dominated the paint and point guards supported them, and shot occasionally as well. In today’s basketball, less and less players are unable to shoot the 3-point shot. It’s a crucial ability, that any player should learn.
Market Basket Analysis has provided us with some knowledge about the players in the NBA and it definitely expanded my understanding of basketball. However, finding incredible insights turned out to be a very time consuming and difficult process. I am sure that there are plenty more rules, which I was unable to find. I tried to recode the data to the best of my abilities in order to perform MBA, but I am aware that my solution is not perfect.
I will definitely comeback to this data at some point in my life and see what I can find and improve.
Comments on PAM results
As partitioning around medoids is robust to outliers, in this case it resulted in more clusters being suggested and the results are quite satisfying in my opinion. Let’s start with the old times.