Dimension Reduction Project for National Basketball Association
(NBA) Players
Name: Hamed Ahmed Hamed Ahmed
Student Id: 454827
I will apply different dimension reduction techniques like PCA, MDS
and T-SNE
Dataset: collection of relevant, historical, statistics for National
Basketball Association (NBA) Players
To provide more context, here are the meanings of the columns in an
NBA dataset:
- gp: Games played
- min: Minutes played
- fgm: Field goals made
- fga: Field goals attempted
- fg_pct: Field goal percentage
- fg3m: Three-point field goals made
- fg3a: Three-point field goals attempted
- fg3_pct: Three-point field goal percentage
- ftm: Free throws made
- fta: Free throws attempted
- ft_pct: Free throw percentage
- oreb: Offensive rebounds
- dreb: Defensive rebounds
- reb: Total rebounds
- ast: Assists
- stl: Steals
- blk: Blocks
- tov: Turnovers
- pf: Personal fouls
- pts: Total points
- ast_tov: Assist-to-turnover ratio
- stl_tov: Steal-to-turnover ratio
- efg_pct: Effective field goal percentage
- ts_pct: True shooting percentage
Dataset source: from Kaggle
Problem Statement:
- Determining the best all-time NBA players based on basketball
skills
- requires analyzing various statistics that reflect different aspects
of the game,
- such as rebounds, assists, steals, blocks, and field goal
percentage,
- as well as advanced metrics like Player Efficiency Rating
Import the libraries
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(corrplot)
## corrplot 0.92 loaded
library(ggplot2) # for data visualization
library(Rtsne) # for t-SNE
Load and prepare dataset
setwd("D:/Data Science/winter semester/UL/Final Project/layoff/dimensionReducction")
nba_df <- read.csv("nba.csv",header=T, sep =";")
#Showing the Structure of the data
str(nba_df)
## 'data.frame': 1214 obs. of 26 variables:
## $ player_id : int 893 76375 76127 201142 2544 78497 947 77847 76804 600015 ...
## $ player_name: chr "Michael Jordan" "Wilt Chamberlain" "Elgin Baylor" "Kevin Durant" ...
## $ gp : int 1072 1045 846 703 1061 932 914 792 791 1040 ...
## $ min : num 38.3 45.8 40 37.4 38.9 ...
## $ fgm : num 11.37 12.13 10.28 9.19 9.82 ...
## $ fga : num 22.9 22.5 23.8 18.9 19.6 ...
## $ fg_pct : num 0.497 0.54 0.431 0.488 0.501 0.474 0.425 0.436 0.511 0.485 ...
## $ fg3m : num 0.542 NA NA 1.792 1.383 ...
## $ fg3a : num 1.66 NA NA 4.72 4.05 ...
## $ fg3_pct : num 0.327 NA NA 0.379 0.342 NA 0.313 NA 0.297 NA ...
## $ ftm : num 6.83 5.8 6.81 7.02 6.1 ...
## $ fta : num 8.18 11.35 8.74 7.96 8.25 ...
## $ ft_pct : num 0.835 0.511 0.78 0.882 0.74 0.814 0.78 0.761 0.844 0.838 ...
## $ oreb : num 1.556 NA NA 0.787 1.215 ...
## $ dreb : num 4.67 NA NA 6.37 6.05 ...
## $ reb : num 6.22 22.89 13.55 7.16 7.26 ...
## $ ast : num 5.25 4.44 4.31 3.79 7.03 ...
## $ stl : num 2.35 NA NA 1.19 1.65 ...
## $ blk : num 0.833 NA NA 1.05 0.77 ...
## $ tov : num 2.73 NA NA 3.16 3.41 ...
## $ pf : num 2.6 1.99 3.07 1.89 1.86 ...
## $ pts : num 30.1 30.1 27.4 27.2 27.1 ...
## $ ast_tov : num 1.93 NA NA 1.2 2.06 ...
## $ stl_tov : num 0.86 NA NA 0.378 0.483 ...
## $ efg_pct : num 0.509 0.54 0.431 0.535 0.536 ...
## $ ts_pct : num 0.569 0.547 0.494 0.608 0.584 ...
# View the rows
head(nba_df)
## player_id player_name gp min fgm fga fg_pct fg3m
## 1 893 Michael Jordan 1072 38.25653 11.37313 22.88899 0.497 0.54198
## 2 76375 Wilt Chamberlain 1045 45.79809 12.13493 22.48517 0.540 NA
## 3 76127 Elgin Baylor 846 40.02719 10.27541 23.84279 0.431 NA
## 4 201142 Kevin Durant 703 37.37980 9.19346 18.85349 0.488 1.79232
## 5 2544 LeBron James 1061 38.90009 9.82375 19.60697 0.501 1.38266
## 6 78497 Jerry West 932 39.23927 9.67382 20.42060 0.474 NA
## fg3a fg3_pct ftm fta ft_pct oreb dreb reb ast
## 1 1.65858 0.327 6.83489 8.18284 0.835 1.55597 4.66791 6.22388 5.25466
## 2 NA NA 5.79617 11.35120 0.511 NA NA 22.89378 4.44306
## 3 NA NA 6.81206 8.73641 0.780 NA NA 13.54965 4.31442
## 4 4.72404 0.379 7.01991 7.96017 0.882 0.78663 6.36984 7.15647 3.78805
## 5 4.04807 0.342 6.10179 8.24882 0.740 1.21489 6.04807 7.26296 7.03205
## 6 NA NA 7.68240 9.44313 0.814 0.96774 2.77419 5.76824 6.69313
## stl blk tov pf pts ast_tov stl_tov efg_pct ts_pct
## 1 2.34515 0.83302 2.72761 2.59608 30.12313 1.92647 0.85978 0.50872 0.56859
## 2 NA NA NA 1.98565 30.06603 NA NA 0.53969 0.54706
## 3 NA NA NA 3.06856 27.36288 NA NA 0.43097 0.49415
## 4 1.19488 1.04979 3.15932 1.89189 27.19915 1.19901 0.37821 0.53516 0.60832
## 5 1.64844 0.77003 3.41093 1.86334 27.13195 2.06162 0.48328 0.53629 0.58382
## 6 2.61290 0.74194 NA 2.61266 27.03004 NA NA 0.47373 0.54994
# Summary of the dataset
summary(nba_df)
## player_id player_name gp min
## Min. : 2.0 Length:1214 Min. : 400.0 Min. : 8.947
## 1st Qu.: 960.5 Class :character 1st Qu.: 522.2 1st Qu.:20.834
## Median : 76248.5 Mode :character Median : 675.5 Median :25.428
## Mean : 60550.1 Mean : 707.0 Mean :25.514
## 3rd Qu.: 77959.5 3rd Qu.: 845.0 3rd Qu.:30.326
## Max. :600015.0 Max. :1611.0 Max. :45.798
##
## fgm fga fg_pct fg3m
## Min. : 0.6232 Min. : 1.531 Min. :0.2720 Min. :0.00000
## 1st Qu.: 2.8580 1st Qu.: 6.313 1st Qu.:0.4310 1st Qu.:0.00626
## Median : 3.8835 Median : 8.621 Median :0.4560 Median :0.13778
## Mean : 4.1978 Mean : 9.190 Mean :0.4575 Mean :0.43017
## 3rd Qu.: 5.2848 3rd Qu.:11.608 3rd Qu.:0.4840 3rd Qu.:0.79338
## Max. :12.1349 Max. :23.843 Max. :0.6770 Max. :3.33972
## NA's :212
## fg3a fg3_pct ftm fta
## Min. :0.00000 Min. :0.0000 Min. :0.2141 Min. : 0.2441
## 1st Qu.:0.04473 1st Qu.:0.1520 1st Qu.:1.2723 1st Qu.: 1.7757
## Median :0.47784 Median :0.2945 Median :1.9217 Median : 2.5686
## Mean :1.22112 Mean :0.2492 Mean :2.2011 Mean : 2.9206
## 3rd Qu.:2.20637 3rd Qu.:0.3510 3rd Qu.:2.8106 3rd Qu.: 3.7288
## Max. :7.62892 Max. :1.0000 Max. :7.8056 Max. :11.3512
## NA's :212 NA's :212
## ft_pct oreb dreb reb
## Min. :0.4140 Min. :0.0000 Min. :0.5294 Min. : 1.007
## 1st Qu.:0.7033 1st Qu.:0.6642 1st Qu.:1.9791 1st Qu.: 2.851
## Median :0.7600 Median :1.1662 Median :2.7810 Median : 4.176
## Mean :0.7471 Mean :1.3223 Mean :3.1343 Mean : 4.756
## 3rd Qu.:0.8030 3rd Qu.:1.8467 3rd Qu.:3.9862 3rd Qu.: 6.135
## Max. :0.9050 Max. :5.0647 Max. :9.7748 Max. :22.894
## NA's :119 NA's :119
## ast stl blk tov
## Min. : 0.1975 Min. :0.08197 Min. :0.0000 Min. :0.1734
## 1st Qu.: 1.1933 1st Qu.:0.54012 1st Qu.:0.1788 1st Qu.:1.0703
## Median : 2.0158 Median :0.77330 Median :0.3286 Median :1.5042
## Mean : 2.4697 Mean :0.85038 Mean :0.5104 Mean :1.5930
## 3rd Qu.: 3.3460 3rd Qu.:1.08312 3rd Qu.:0.6667 3rd Qu.:2.0228
## Max. :11.1932 Max. :2.71117 Max. :3.5017 Max. :3.9222
## NA's :119 NA's :119 NA's :187
## pf pts ast_tov stl_tov
## Min. :0.7944 Min. : 1.729 Min. :0.0000 Min. :0.1154
## 1st Qu.:1.9845 1st Qu.: 7.399 1st Qu.:0.9081 1st Qu.:0.4058
## Median :2.3795 Median :10.069 Median :1.3041 Median :0.5207
## Mean :2.4150 Mean :10.952 Mean :1.4638 Mean :0.5582
## 3rd Qu.:2.8274 3rd Qu.:13.970 3rd Qu.:1.9191 3rd Qu.:0.6709
## Max. :4.2415 Max. :30.123 Max. :4.6936 Max. :1.9345
## NA's :187 NA's :187
## efg_pct ts_pct
## Min. :0.2716 Min. :0.3089
## 1st Qu.:0.4570 1st Qu.:0.5000
## Median :0.4805 Median :0.5230
## Mean :0.4773 Mean :0.5214
## 3rd Qu.:0.5019 3rd Qu.:0.5453
## Max. :0.6771 Max. :0.6433
##
# There are around 1214 player in NBA (obs) with 26 features
dim(nba_df)
## [1] 1214 26
# Check for missing values (We have totally 212 missing values )
colSums((is.na(nba_df)))
## player_id player_name gp min fgm fga
## 0 0 0 0 0 0
## fg_pct fg3m fg3a fg3_pct ftm fta
## 0 212 212 212 0 0
## ft_pct oreb dreb reb ast stl
## 0 119 119 0 0 119
## blk tov pf pts ast_tov stl_tov
## 119 187 0 0 187 187
## efg_pct ts_pct
## 0 0
# Visualize the mean of each row with missing value
hist(rowMeans(is.na(nba_df)),
main="Mean of the NA values in the dataset",
col="chocolate",
border="brown",
)

# Remove missing values
nba_df_cleaned <- na.omit(nba_df)
# There are around 1002 player (212 players have been removed)
dim(nba_df_cleaned)
## [1] 1002 26
# Visualize the mean of each column having missing values
# as we can see we have couple of columns that have missing values
barplot(colMeans(is.na(nba_df)), las=3)

# So let's remove these column and Convert our columns to be numeric
nba_df_cleaned = nba_df_cleaned[,3:ncol(nba_df_cleaned)]
nba_df_cleaned = as.data.frame(sapply(nba_df_cleaned, as.numeric ))
names = nba_df[,2]
# There are around 1002 player (2 features have been removed)
dim(nba_df_cleaned)
## [1] 1002 24
Let’s do some descriptive analysis for the 26 features
- Dimension 1: Doing Univariate analysis for all the features which is
the process of analyzing one variable at a time.
- As we saw we can’t work with the data without scaling so let’s do
scaling
- and do visualization again
- We can observe that we have a semi normal distribution for the data
with small outliers
boxplot(scale(nba_df_cleaned), las=3, col="orange")

hist(scale(nba_df_cleaned), freq = T)

So We need DATA dimension reduction technologies to see relations
easier than on correlation matrix
- Let’s try MDS
- Analysis for products
dist.what<-dist(t(nba_df_cleaned)) # as a main input we need distance between units
as.matrix(dist.what)[1:10, 1:10] # let’s see the distance matrix
## gp min fgm fga fg_pct fg3m
## gp 0.00 23067.4236 23723.56316 23574.6746 23840.014136 23840.65253
## min 23067.42 0.0000 682.58015 528.1055 805.171910 805.22060
## fgm 23723.56 682.5801 0.00000 162.4425 128.332821 129.19903
## fga 23574.67 528.1055 162.44246 0.0000 289.939602 289.37336
## fg_pct 23840.01 805.1719 128.33282 289.9396 0.000000 17.91694
## fg3m 23840.65 805.2206 129.19903 289.3734 17.916937 0.00000
## fg3a 23816.03 779.7538 112.48055 265.7809 52.425666 38.22599
## fg3_pct 23846.48 811.6170 134.33726 295.7714 8.420854 15.89355
## ftm 23787.38 749.4590 72.06724 231.9452 63.488319 65.61668
## fta 23766.19 727.6807 54.15986 211.0759 86.308283 88.77398
## fg3a fg3_pct ftm fta
## gp 23816.03423 23846.480966 23787.38235 23766.19362
## min 779.75379 811.616975 749.45897 727.68068
## fgm 112.48055 134.337258 72.06724 54.15986
## fga 265.78094 295.771420 231.94516 211.07594
## fg_pct 52.42567 8.420854 63.48832 86.30828
## fg3m 38.22599 15.893550 65.61668 88.77398
## fg3a 0.00000 53.125779 61.88112 80.54533
## fg3_pct 53.12578 0.000000 69.10670 92.32359
## ftm 61.88112 69.106700 0.00000 24.69012
## fta 80.54533 92.323586 24.69012 0.00000
mds1<-cmdscale(dist.what, k=2) #k - the maximum dimension of the space
summary(mds1) # we get coordinates of new points
## V1 V2
## Min. :-1086.7 Min. :-30.648
## 1st Qu.:-1077.5 1st Qu.:-28.383
## Median :-1046.7 Median :-17.480
## Mean : 0.0 Mean : 0.000
## 3rd Qu.: -986.7 3rd Qu.: -3.413
## Max. :22759.8 Max. :187.537
plot(mds1[,1], mds1[,2], type = "n", xlab = "Dimension 1", ylab = "Dimension 2")
text(mds1[,1], mds1[,2], labels = nba_df$player_name, col = "blue", cex = 0.7)

Let’s try With PCA to reduce the features to 2 dimensions
PCA with eigen decomposition
pca_ed = prcomp(nba_df_cleaned, scale=T)
summary(pca_ed)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.8682 2.5256 1.51142 1.23009 1.1645 0.88156 0.84972
## Proportion of Variance 0.3428 0.2658 0.09518 0.06305 0.0565 0.03238 0.03008
## Cumulative Proportion 0.3428 0.6085 0.70373 0.76677 0.8233 0.85566 0.88574
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.72053 0.70570 0.65472 0.6500 0.6060 0.41614 0.30767
## Proportion of Variance 0.02163 0.02075 0.01786 0.0176 0.0153 0.00722 0.00394
## Cumulative Proportion 0.90737 0.92812 0.94598 0.9636 0.9789 0.98611 0.99005
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.30271 0.23671 0.19467 0.17058 0.11126 0.06634 0.05948
## Proportion of Variance 0.00382 0.00233 0.00158 0.00121 0.00052 0.00018 0.00015
## Cumulative Proportion 0.99387 0.99620 0.99778 0.99899 0.99951 0.99969 0.99984
## PC22 PC23 PC24
## Standard deviation 0.04715 0.03983 1.207e-06
## Proportion of Variance 0.00009 0.00007 0.000e+00
## Cumulative Proportion 0.99993 1.00000 1.000e+00
# PCA with SVD
pca_SVD = princomp(nba_df_cleaned, cor=T)
summary(pca_SVD)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 2.8682141 2.5255533 1.51141800 1.23009037 1.1644937
## Proportion of Variance 0.3427772 0.2657675 0.09518268 0.06304676 0.0565019
## Cumulative Proportion 0.3427772 0.6085447 0.70372734 0.76677410 0.8232760
## Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## Standard deviation 0.88155692 0.84972077 0.72053177 0.70569521 0.65472139
## Proportion of Variance 0.03238094 0.03008439 0.02163192 0.02075024 0.01786084
## Cumulative Proportion 0.85565695 0.88574134 0.90737326 0.92812350 0.94598433
## Comp.11 Comp.12 Comp.13 Comp.14 Comp.15
## Standard deviation 0.64999065 0.60602359 0.416144178 0.307674754 0.30270579
## Proportion of Variance 0.01760366 0.01530269 0.007215666 0.003944323 0.00381795
## Cumulative Proportion 0.96358799 0.97889069 0.986106351 0.990050674 0.99386862
## Comp.16 Comp.17 Comp.18 Comp.19
## Standard deviation 0.23670589 0.194673058 0.170577179 0.1112640219
## Proportion of Variance 0.00233457 0.001579067 0.001212357 0.0005158201
## Cumulative Proportion 0.99620319 0.997782261 0.998994618 0.9995104379
## Comp.20 Comp.21 Comp.22 Comp.23
## Standard deviation 0.0663436601 0.0594830912 4.715408e-02 3.982794e-02
## Proportion of Variance 0.0001833951 0.0001474266 9.264613e-05 6.609437e-05
## Cumulative Proportion 0.9996938329 0.9998412595 9.999339e-01 1.000000e+00
## Comp.24
## Standard deviation 1.207352e-06
## Proportion of Variance 6.073741e-14
## Cumulative Proportion 1.000000e+00
Create a plot of variance explained for each principal
component.
pr.var <- pca_ed$sdev ^ 2
pve <- pr.var/sum(pr.var)
plot(pve, xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
ylim = c(0, 1), type = "b")

Plot cumulative proportion of variance explained
plot(cumsum(pve), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
ylim = c(0, 1), type = "b")

First Component
fviz_contrib(pca_ed, choice = "var", axes = 1)

Let’s show the best 5 players in NBA and numbers of played
games
calculateScore <- function(data,col) {
return(sum((pca_ed$rotation[, 1]*data)^2))
}
col <- 1
nba_df$player_name[sort.int(apply(nba_df_cleaned, 1, calculateScore), decreasing = T, index.return = T)$ix[1:5]]
## [1] "Billy Knight" "Oscar Robertson" "Bob Davies" "Jerry West"
## [5] "Antawn Jamison"
col <- 2
nba_df$gp[sort.int(apply(nba_df_cleaned, 1, calculateScore), decreasing = T, index.return = T)$ix[1:5]]
## [1] 671 1040 462 932 1083
How much each variable contributes to the second component.
fviz_contrib(pca_ed, choice = "var", axes = 2)

# Now we visualize top 5 players with a contributions to first component
names_top_5 = names[order(get_pca_ind(pca_ed)$contrib[,1],decreasing=T)]
fviz_contrib(pca_ed, choice = "ind", axes = 1, top=5)+scale_x_discrete(labels=names_top_5)

let’s try T-SNE
try to normalize the data
nba_norm <- scale(nba_df_cleaned)
tsne_results <- Rtsne(nba_norm, dims = 2, perplexity = 7, verbose = TRUE)
## Performing PCA
## Read the 1002 x 24 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 7.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.06 seconds (sparsity = 0.029737)!
## Learning embedding...
## Iteration 50: error is 86.561426 (50 iterations in 0.14 seconds)
## Iteration 100: error is 78.250430 (50 iterations in 0.09 seconds)
## Iteration 150: error is 78.017078 (50 iterations in 0.08 seconds)
## Iteration 200: error is 78.010397 (50 iterations in 0.08 seconds)
## Iteration 250: error is 78.000621 (50 iterations in 0.08 seconds)
## Iteration 300: error is 1.905840 (50 iterations in 0.08 seconds)
## Iteration 350: error is 1.591351 (50 iterations in 0.08 seconds)
## Iteration 400: error is 1.491249 (50 iterations in 0.08 seconds)
## Iteration 450: error is 1.449161 (50 iterations in 0.09 seconds)
## Iteration 500: error is 1.419951 (50 iterations in 0.09 seconds)
## Iteration 550: error is 1.402818 (50 iterations in 0.08 seconds)
## Iteration 600: error is 1.391999 (50 iterations in 0.08 seconds)
## Iteration 650: error is 1.383686 (50 iterations in 0.08 seconds)
## Iteration 700: error is 1.376738 (50 iterations in 0.08 seconds)
## Iteration 750: error is 1.370636 (50 iterations in 0.08 seconds)
## Iteration 800: error is 1.363942 (50 iterations in 0.09 seconds)
## Iteration 850: error is 1.359771 (50 iterations in 0.08 seconds)
## Iteration 900: error is 1.356238 (50 iterations in 0.08 seconds)
## Iteration 950: error is 1.351958 (50 iterations in 0.08 seconds)
## Iteration 1000: error is 1.348208 (50 iterations in 0.08 seconds)
## Fitting performed in 1.71 seconds.
tsne_df <- data.frame(x = tsne_results$Y[,1], y = tsne_results$Y[,2])
ggplot(tsne_df, aes(x, y)) +
geom_point() +
ggtitle("t-SNE Visualization of NBA Dataset")

comment