Dimension Reduction Project for National Basketball Association (NBA) Players

Name: Hamed Ahmed Hamed Ahmed

Student Id: 454827

I will apply different dimension reduction techniques like PCA, MDS and T-SNE

Dataset: collection of relevant, historical, statistics for National Basketball Association (NBA) Players

To provide more context, here are the meanings of the columns in an NBA dataset:

gp: Games played
min: Minutes played
fgm: Field goals made
fga: Field goals attempted
fg_pct: Field goal percentage
fg3m: Three-point field goals made
fg3a: Three-point field goals attempted
fg3_pct: Three-point field goal percentage
ftm: Free throws made
fta: Free throws attempted
ft_pct: Free throw percentage
oreb: Offensive rebounds
dreb: Defensive rebounds
reb: Total rebounds
ast: Assists
stl: Steals
blk: Blocks
tov: Turnovers
pf: Personal fouls
pts: Total points
ast_tov: Assist-to-turnover ratio
stl_tov: Steal-to-turnover ratio
efg_pct: Effective field goal percentage
ts_pct: True shooting percentage

Dataset source: from Kaggle

Problem Statement:

Determining the best all-time NBA players based on basketball skills
requires analyzing various statistics that reflect different aspects of the game,
such as rebounds, assists, steals, blocks, and field goal percentage,
as well as advanced metrics like Player Efficiency Rating

Import the libraries

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 1.0.0 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(corrplot)

## corrplot 0.92 loaded

library(ggplot2) # for data visualization
library(Rtsne)   # for t-SNE

Load and prepare dataset

setwd("D:/Data Science/winter semester/UL/Final Project/layoff/dimensionReducction")
nba_df <- read.csv("nba.csv",header=T, sep =";")

#Showing the Structure of the data
str(nba_df)

## 'data.frame':    1214 obs. of  26 variables:
##  $ player_id  : int  893 76375 76127 201142 2544 78497 947 77847 76804 600015 ...
##  $ player_name: chr  "Michael Jordan" "Wilt Chamberlain" "Elgin Baylor" "Kevin Durant" ...
##  $ gp         : int  1072 1045 846 703 1061 932 914 792 791 1040 ...
##  $ min        : num  38.3 45.8 40 37.4 38.9 ...
##  $ fgm        : num  11.37 12.13 10.28 9.19 9.82 ...
##  $ fga        : num  22.9 22.5 23.8 18.9 19.6 ...
##  $ fg_pct     : num  0.497 0.54 0.431 0.488 0.501 0.474 0.425 0.436 0.511 0.485 ...
##  $ fg3m       : num  0.542 NA NA 1.792 1.383 ...
##  $ fg3a       : num  1.66 NA NA 4.72 4.05 ...
##  $ fg3_pct    : num  0.327 NA NA 0.379 0.342 NA 0.313 NA 0.297 NA ...
##  $ ftm        : num  6.83 5.8 6.81 7.02 6.1 ...
##  $ fta        : num  8.18 11.35 8.74 7.96 8.25 ...
##  $ ft_pct     : num  0.835 0.511 0.78 0.882 0.74 0.814 0.78 0.761 0.844 0.838 ...
##  $ oreb       : num  1.556 NA NA 0.787 1.215 ...
##  $ dreb       : num  4.67 NA NA 6.37 6.05 ...
##  $ reb        : num  6.22 22.89 13.55 7.16 7.26 ...
##  $ ast        : num  5.25 4.44 4.31 3.79 7.03 ...
##  $ stl        : num  2.35 NA NA 1.19 1.65 ...
##  $ blk        : num  0.833 NA NA 1.05 0.77 ...
##  $ tov        : num  2.73 NA NA 3.16 3.41 ...
##  $ pf         : num  2.6 1.99 3.07 1.89 1.86 ...
##  $ pts        : num  30.1 30.1 27.4 27.2 27.1 ...
##  $ ast_tov    : num  1.93 NA NA 1.2 2.06 ...
##  $ stl_tov    : num  0.86 NA NA 0.378 0.483 ...
##  $ efg_pct    : num  0.509 0.54 0.431 0.535 0.536 ...
##  $ ts_pct     : num  0.569 0.547 0.494 0.608 0.584 ...

# View the rows
head(nba_df)

##   player_id      player_name   gp      min      fgm      fga fg_pct    fg3m
## 1       893   Michael Jordan 1072 38.25653 11.37313 22.88899  0.497 0.54198
## 2     76375 Wilt Chamberlain 1045 45.79809 12.13493 22.48517  0.540      NA
## 3     76127     Elgin Baylor  846 40.02719 10.27541 23.84279  0.431      NA
## 4    201142     Kevin Durant  703 37.37980  9.19346 18.85349  0.488 1.79232
## 5      2544     LeBron James 1061 38.90009  9.82375 19.60697  0.501 1.38266
## 6     78497       Jerry West  932 39.23927  9.67382 20.42060  0.474      NA
##      fg3a fg3_pct     ftm      fta ft_pct    oreb    dreb      reb     ast
## 1 1.65858   0.327 6.83489  8.18284  0.835 1.55597 4.66791  6.22388 5.25466
## 2      NA      NA 5.79617 11.35120  0.511      NA      NA 22.89378 4.44306
## 3      NA      NA 6.81206  8.73641  0.780      NA      NA 13.54965 4.31442
## 4 4.72404   0.379 7.01991  7.96017  0.882 0.78663 6.36984  7.15647 3.78805
## 5 4.04807   0.342 6.10179  8.24882  0.740 1.21489 6.04807  7.26296 7.03205
## 6      NA      NA 7.68240  9.44313  0.814 0.96774 2.77419  5.76824 6.69313
##       stl     blk     tov      pf      pts ast_tov stl_tov efg_pct  ts_pct
## 1 2.34515 0.83302 2.72761 2.59608 30.12313 1.92647 0.85978 0.50872 0.56859
## 2      NA      NA      NA 1.98565 30.06603      NA      NA 0.53969 0.54706
## 3      NA      NA      NA 3.06856 27.36288      NA      NA 0.43097 0.49415
## 4 1.19488 1.04979 3.15932 1.89189 27.19915 1.19901 0.37821 0.53516 0.60832
## 5 1.64844 0.77003 3.41093 1.86334 27.13195 2.06162 0.48328 0.53629 0.58382
## 6 2.61290 0.74194      NA 2.61266 27.03004      NA      NA 0.47373 0.54994

# Summary of the dataset
summary(nba_df)

##    player_id        player_name              gp              min        
##  Min.   :     2.0   Length:1214        Min.   : 400.0   Min.   : 8.947  
##  1st Qu.:   960.5   Class :character   1st Qu.: 522.2   1st Qu.:20.834  
##  Median : 76248.5   Mode  :character   Median : 675.5   Median :25.428  
##  Mean   : 60550.1                      Mean   : 707.0   Mean   :25.514  
##  3rd Qu.: 77959.5                      3rd Qu.: 845.0   3rd Qu.:30.326  
##  Max.   :600015.0                      Max.   :1611.0   Max.   :45.798  
##                                                                         
##       fgm               fga             fg_pct            fg3m        
##  Min.   : 0.6232   Min.   : 1.531   Min.   :0.2720   Min.   :0.00000  
##  1st Qu.: 2.8580   1st Qu.: 6.313   1st Qu.:0.4310   1st Qu.:0.00626  
##  Median : 3.8835   Median : 8.621   Median :0.4560   Median :0.13778  
##  Mean   : 4.1978   Mean   : 9.190   Mean   :0.4575   Mean   :0.43017  
##  3rd Qu.: 5.2848   3rd Qu.:11.608   3rd Qu.:0.4840   3rd Qu.:0.79338  
##  Max.   :12.1349   Max.   :23.843   Max.   :0.6770   Max.   :3.33972  
##                                                      NA's   :212      
##       fg3a            fg3_pct            ftm              fta         
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.2141   Min.   : 0.2441  
##  1st Qu.:0.04473   1st Qu.:0.1520   1st Qu.:1.2723   1st Qu.: 1.7757  
##  Median :0.47784   Median :0.2945   Median :1.9217   Median : 2.5686  
##  Mean   :1.22112   Mean   :0.2492   Mean   :2.2011   Mean   : 2.9206  
##  3rd Qu.:2.20637   3rd Qu.:0.3510   3rd Qu.:2.8106   3rd Qu.: 3.7288  
##  Max.   :7.62892   Max.   :1.0000   Max.   :7.8056   Max.   :11.3512  
##  NA's   :212       NA's   :212                                        
##      ft_pct            oreb             dreb             reb        
##  Min.   :0.4140   Min.   :0.0000   Min.   :0.5294   Min.   : 1.007  
##  1st Qu.:0.7033   1st Qu.:0.6642   1st Qu.:1.9791   1st Qu.: 2.851  
##  Median :0.7600   Median :1.1662   Median :2.7810   Median : 4.176  
##  Mean   :0.7471   Mean   :1.3223   Mean   :3.1343   Mean   : 4.756  
##  3rd Qu.:0.8030   3rd Qu.:1.8467   3rd Qu.:3.9862   3rd Qu.: 6.135  
##  Max.   :0.9050   Max.   :5.0647   Max.   :9.7748   Max.   :22.894  
##                   NA's   :119      NA's   :119                      
##       ast               stl               blk              tov        
##  Min.   : 0.1975   Min.   :0.08197   Min.   :0.0000   Min.   :0.1734  
##  1st Qu.: 1.1933   1st Qu.:0.54012   1st Qu.:0.1788   1st Qu.:1.0703  
##  Median : 2.0158   Median :0.77330   Median :0.3286   Median :1.5042  
##  Mean   : 2.4697   Mean   :0.85038   Mean   :0.5104   Mean   :1.5930  
##  3rd Qu.: 3.3460   3rd Qu.:1.08312   3rd Qu.:0.6667   3rd Qu.:2.0228  
##  Max.   :11.1932   Max.   :2.71117   Max.   :3.5017   Max.   :3.9222  
##                    NA's   :119       NA's   :119      NA's   :187     
##        pf              pts            ast_tov          stl_tov      
##  Min.   :0.7944   Min.   : 1.729   Min.   :0.0000   Min.   :0.1154  
##  1st Qu.:1.9845   1st Qu.: 7.399   1st Qu.:0.9081   1st Qu.:0.4058  
##  Median :2.3795   Median :10.069   Median :1.3041   Median :0.5207  
##  Mean   :2.4150   Mean   :10.952   Mean   :1.4638   Mean   :0.5582  
##  3rd Qu.:2.8274   3rd Qu.:13.970   3rd Qu.:1.9191   3rd Qu.:0.6709  
##  Max.   :4.2415   Max.   :30.123   Max.   :4.6936   Max.   :1.9345  
##                                    NA's   :187      NA's   :187     
##     efg_pct           ts_pct      
##  Min.   :0.2716   Min.   :0.3089  
##  1st Qu.:0.4570   1st Qu.:0.5000  
##  Median :0.4805   Median :0.5230  
##  Mean   :0.4773   Mean   :0.5214  
##  3rd Qu.:0.5019   3rd Qu.:0.5453  
##  Max.   :0.6771   Max.   :0.6433  
##

# There are around 1214 player in NBA (obs) with 26 features
dim(nba_df)

## [1] 1214   26

# Check for missing values (We have totally 212 missing values )
colSums((is.na(nba_df)))

##   player_id player_name          gp         min         fgm         fga 
##           0           0           0           0           0           0 
##      fg_pct        fg3m        fg3a     fg3_pct         ftm         fta 
##           0         212         212         212           0           0 
##      ft_pct        oreb        dreb         reb         ast         stl 
##           0         119         119           0           0         119 
##         blk         tov          pf         pts     ast_tov     stl_tov 
##         119         187           0           0         187         187 
##     efg_pct      ts_pct 
##           0           0

# Visualize the mean of each row with missing value

hist(rowMeans(is.na(nba_df)),
main="Mean of the NA values in the dataset",
col="chocolate",
border="brown",
)

# Remove missing values
nba_df_cleaned <- na.omit(nba_df)

# There are around 1002 player (212 players have been removed)
dim(nba_df_cleaned)

## [1] 1002   26

# Visualize the mean of each column having missing values
# as we can see we have couple of columns that have missing values 
barplot(colMeans(is.na(nba_df)), las=3)

# So let's remove these column and Convert our columns to be numeric
nba_df_cleaned = nba_df_cleaned[,3:ncol(nba_df_cleaned)]
nba_df_cleaned = as.data.frame(sapply(nba_df_cleaned, as.numeric ))
names = nba_df[,2]


# There are around 1002 player (2 features have been removed)
dim(nba_df_cleaned)

## [1] 1002   24

Let’s do some descriptive analysis for the 26 features

Dimension 1: Doing Univariate analysis for all the features which is the process of analyzing one variable at a time.
As we saw we can’t work with the data without scaling so let’s do scaling
and do visualization again
We can observe that we have a semi normal distribution for the data with small outliers

boxplot(scale(nba_df_cleaned), las=3, col="orange")

hist(scale(nba_df_cleaned), freq = T)

are the variables related to each other?

analyzing the relationship between two variables.
In this case, we have 14 variables, so we need to analyze the relationship between each pair of variables,
We can see that many variables have high positive correlations with each other
and some of same are not

nba_df_cleaned_cor<-cor(nba_df_cleaned, method="pearson") # nominal values
print(nba_df_cleaned_cor, digits=2)

##             gp    min    fgm    fga fg_pct   fg3m   fg3a fg3_pct    ftm     fta
## gp       1.000  0.487  0.399  0.378  0.140  0.088  0.090  0.0654  0.367  0.3684
## min      0.487  1.000  0.865  0.873  0.086  0.295  0.314  0.2086  0.760  0.7476
## fgm      0.399  0.865  1.000  0.982  0.193  0.178  0.194  0.1683  0.836  0.8191
## fga      0.378  0.873  0.982  1.000  0.021  0.295  0.316  0.2567  0.817  0.7841
## fg_pct   0.140  0.086  0.193  0.021  1.000 -0.533 -0.555 -0.4944  0.187  0.2698
## fg3m     0.088  0.295  0.178  0.295 -0.533  1.000  0.995  0.6457  0.113  0.0315
## fg3a     0.090  0.314  0.194  0.316 -0.555  0.995  1.000  0.6508  0.133  0.0523
## fg3_pct  0.065  0.209  0.168  0.257 -0.494  0.646  0.651  1.0000  0.088  0.0014
## ftm      0.367  0.760  0.836  0.817  0.187  0.113  0.133  0.0884  1.000  0.9811
## fta      0.368  0.748  0.819  0.784  0.270  0.032  0.052  0.0014  0.981  1.0000
## ft_pct   0.117  0.290  0.333  0.400 -0.367  0.456  0.452  0.5282  0.324  0.1648
## oreb     0.220  0.246  0.243  0.137  0.588 -0.483 -0.485 -0.4976  0.274  0.3865
## dreb     0.327  0.510  0.441  0.365  0.445 -0.182 -0.182 -0.2698  0.440  0.5238
## reb      0.307  0.443  0.395  0.307  0.510 -0.298 -0.299 -0.3610  0.402  0.4982
## ast      0.246  0.569  0.456  0.508 -0.204  0.292  0.322  0.3128  0.418  0.3592
## stl      0.267  0.615  0.512  0.557 -0.158  0.236  0.269  0.2471  0.445  0.4127
## blk      0.192  0.163  0.140  0.055  0.450 -0.322 -0.328 -0.3874  0.160  0.2520
## tov      0.314  0.776  0.787  0.786  0.108  0.093  0.121  0.0925  0.781  0.7779
## pf       0.220  0.348  0.336  0.271  0.408 -0.342 -0.341 -0.3603  0.347  0.4141
## pts      0.403  0.877  0.985  0.980  0.132  0.276  0.293  0.2225  0.895  0.8687
## ast_tov  0.105  0.200  0.049  0.119 -0.396  0.353  0.369  0.4450 -0.011 -0.0907
## stl_tov -0.036 -0.113 -0.241 -0.194 -0.302  0.210  0.215  0.2507 -0.291 -0.3212
## efg_pct  0.154  0.140  0.165  0.054  0.668  0.179  0.141  0.0124  0.111  0.1421
## ts_pct   0.222  0.269  0.288  0.188  0.605  0.183  0.152  0.0518  0.418  0.4070
##         ft_pct  oreb   dreb     reb    ast     stl    blk      tov     pf
## gp       0.117  0.22  0.327  0.3067  0.246  0.2673  0.192  0.31403  0.220
## min      0.290  0.25  0.510  0.4429  0.569  0.6150  0.163  0.77624  0.348
## fgm      0.333  0.24  0.441  0.3952  0.456  0.5120  0.140  0.78681  0.336
## fga      0.400  0.14  0.365  0.3065  0.508  0.5569  0.055  0.78570  0.271
## fg_pct  -0.367  0.59  0.445  0.5105 -0.204 -0.1583  0.450  0.10839  0.408
## fg3m     0.456 -0.48 -0.182 -0.2976  0.292  0.2361 -0.322  0.09294 -0.342
## fg3a     0.452 -0.48 -0.182 -0.2985  0.322  0.2693 -0.328  0.12148 -0.341
## fg3_pct  0.528 -0.50 -0.270 -0.3610  0.313  0.2471 -0.387  0.09250 -0.360
## ftm      0.324  0.27  0.440  0.4016  0.418  0.4449  0.160  0.78136  0.347
## fta      0.165  0.39  0.524  0.4982  0.359  0.4127  0.252  0.77788  0.414
## ft_pct   1.000 -0.48 -0.268 -0.3514  0.410  0.2821 -0.422  0.24386 -0.243
## oreb    -0.484  1.00  0.824  0.9172 -0.331 -0.1122  0.628  0.18133  0.624
## dreb    -0.268  0.82  1.000  0.9792 -0.107  0.0473  0.655  0.37065  0.599
## reb     -0.351  0.92  0.979  1.0000 -0.185 -0.0029  0.669  0.32117  0.633
## ast      0.410 -0.33 -0.107 -0.1847  1.000  0.7373 -0.295  0.69539 -0.053
## stl      0.282 -0.11  0.047 -0.0029  0.737  1.0000 -0.163  0.62856  0.115
## blk     -0.422  0.63  0.655  0.6689 -0.295 -0.1632  1.000  0.09677  0.498
## tov      0.244  0.18  0.371  0.3212  0.695  0.6286  0.097  1.00000  0.423
## pf      -0.243  0.62  0.599  0.6330 -0.053  0.1149  0.498  0.42285  1.000
## pts      0.384  0.20  0.423  0.3655  0.482  0.5251  0.110  0.80092  0.302
## ast_tov  0.459 -0.61 -0.427 -0.5040  0.782  0.5123 -0.495  0.18922 -0.425
## stl_tov  0.078 -0.31 -0.314 -0.3260  0.072  0.4343 -0.285 -0.32682 -0.347
## efg_pct -0.058  0.20  0.259  0.2439 -0.112 -0.1212  0.210  0.00065  0.065
## ts_pct   0.215  0.17  0.266  0.2401  0.028 -0.0093  0.149  0.18731  0.116
##            pts ast_tov stl_tov  efg_pct  ts_pct
## gp       0.403   0.105  -0.036  0.15374  0.2220
## min      0.877   0.200  -0.113  0.13960  0.2687
## fgm      0.985   0.049  -0.241  0.16518  0.2884
## fga      0.980   0.119  -0.194  0.05355  0.1884
## fg_pct   0.132  -0.396  -0.302  0.66815  0.6053
## fg3m     0.276   0.353   0.210  0.17872  0.1830
## fg3a     0.293   0.369   0.215  0.14054  0.1516
## fg3_pct  0.223   0.445   0.251  0.01241  0.0518
## ftm      0.895  -0.011  -0.291  0.11092  0.4178
## fta      0.869  -0.091  -0.321  0.14206  0.4070
## ft_pct   0.384   0.459   0.078 -0.05828  0.2154
## oreb     0.197  -0.610  -0.310  0.19999  0.1746
## dreb     0.423  -0.427  -0.314  0.25885  0.2661
## reb      0.365  -0.504  -0.326  0.24395  0.2401
## ast      0.482   0.782   0.072 -0.11241  0.0282
## stl      0.525   0.512   0.434 -0.12121 -0.0093
## blk      0.110  -0.495  -0.285  0.21027  0.1485
## tov      0.801   0.189  -0.327  0.00065  0.1873
## pf       0.302  -0.425  -0.347  0.06505  0.1160
## pts      1.000   0.074  -0.231  0.17281  0.3437
## ast_tov  0.074   1.000   0.400 -0.13469 -0.0795
## stl_tov -0.231   0.400   1.000 -0.07400 -0.1486
## efg_pct  0.173  -0.135  -0.074  1.00000  0.8775
## ts_pct   0.344  -0.080  -0.149  0.87747  1.0000

corrplot(round(cor(nba_df_cleaned), 2), diag = FALSE, method="color", order="FPC", tl.srt = 90)

ggpairs(as.data.frame(nba_df_cleaned))

ggcorr(nba_df_cleaned, nbreaks = 4, palette = "RdGy")

comment

The more dimensions we have the more possible bivariate relationships that can be analyzed.
However, analyzing all of these relationships individually would be extremely time-consuming and
may not provide us with meaning full insights. This is where analytical tools such as PCA, MDS and t-SNE
can be useful in reducing the dimensionality of the data while retaining the most important information.
By identifying patterns and relationships among the variables,
these tools can help us gain a better understanding of the underlying structure of the data and identify
the key factors that contribute to overall performance or success.
This can ultimately lead to more informed decision-making and better outcomes in fields such as sports, business, and finance.

So We need DATA dimension reduction technologies to see relations easier than on correlation matrix

Let’s try MDS
Analysis for products

dist.what<-dist(t(nba_df_cleaned)) # as a main input we need distance between units
as.matrix(dist.what)[1:10, 1:10] # let’s see the distance matrix

##               gp        min         fgm        fga       fg_pct        fg3m
## gp          0.00 23067.4236 23723.56316 23574.6746 23840.014136 23840.65253
## min     23067.42     0.0000   682.58015   528.1055   805.171910   805.22060
## fgm     23723.56   682.5801     0.00000   162.4425   128.332821   129.19903
## fga     23574.67   528.1055   162.44246     0.0000   289.939602   289.37336
## fg_pct  23840.01   805.1719   128.33282   289.9396     0.000000    17.91694
## fg3m    23840.65   805.2206   129.19903   289.3734    17.916937     0.00000
## fg3a    23816.03   779.7538   112.48055   265.7809    52.425666    38.22599
## fg3_pct 23846.48   811.6170   134.33726   295.7714     8.420854    15.89355
## ftm     23787.38   749.4590    72.06724   231.9452    63.488319    65.61668
## fta     23766.19   727.6807    54.15986   211.0759    86.308283    88.77398
##                fg3a      fg3_pct         ftm         fta
## gp      23816.03423 23846.480966 23787.38235 23766.19362
## min       779.75379   811.616975   749.45897   727.68068
## fgm       112.48055   134.337258    72.06724    54.15986
## fga       265.78094   295.771420   231.94516   211.07594
## fg_pct     52.42567     8.420854    63.48832    86.30828
## fg3m       38.22599    15.893550    65.61668    88.77398
## fg3a        0.00000    53.125779    61.88112    80.54533
## fg3_pct    53.12578     0.000000    69.10670    92.32359
## ftm        61.88112    69.106700     0.00000    24.69012
## fta        80.54533    92.323586    24.69012     0.00000

mds1<-cmdscale(dist.what, k=2) #k - the maximum dimension of the space
summary(mds1)   # we get coordinates of new points

##        V1                V2         
##  Min.   :-1086.7   Min.   :-30.648  
##  1st Qu.:-1077.5   1st Qu.:-28.383  
##  Median :-1046.7   Median :-17.480  
##  Mean   :    0.0   Mean   :  0.000  
##  3rd Qu.: -986.7   3rd Qu.: -3.413  
##  Max.   :22759.8   Max.   :187.537

plot(mds1[,1], mds1[,2], type = "n", xlab = "Dimension 1", ylab = "Dimension 2")
text(mds1[,1], mds1[,2], labels = nba_df$player_name, col = "blue", cex = 0.7)

Let’s try With PCA to reduce the features to 2 dimensions

PCA with eigen decomposition

pca_ed = prcomp(nba_df_cleaned, scale=T)
summary(pca_ed)

## Importance of components:
##                           PC1    PC2     PC3     PC4    PC5     PC6     PC7
## Standard deviation     2.8682 2.5256 1.51142 1.23009 1.1645 0.88156 0.84972
## Proportion of Variance 0.3428 0.2658 0.09518 0.06305 0.0565 0.03238 0.03008
## Cumulative Proportion  0.3428 0.6085 0.70373 0.76677 0.8233 0.85566 0.88574
##                            PC8     PC9    PC10   PC11   PC12    PC13    PC14
## Standard deviation     0.72053 0.70570 0.65472 0.6500 0.6060 0.41614 0.30767
## Proportion of Variance 0.02163 0.02075 0.01786 0.0176 0.0153 0.00722 0.00394
## Cumulative Proportion  0.90737 0.92812 0.94598 0.9636 0.9789 0.98611 0.99005
##                           PC15    PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.30271 0.23671 0.19467 0.17058 0.11126 0.06634 0.05948
## Proportion of Variance 0.00382 0.00233 0.00158 0.00121 0.00052 0.00018 0.00015
## Cumulative Proportion  0.99387 0.99620 0.99778 0.99899 0.99951 0.99969 0.99984
##                           PC22    PC23      PC24
## Standard deviation     0.04715 0.03983 1.207e-06
## Proportion of Variance 0.00009 0.00007 0.000e+00
## Cumulative Proportion  0.99993 1.00000 1.000e+00

# PCA with SVD
pca_SVD = princomp(nba_df_cleaned, cor=T)
summary(pca_SVD)

## Importance of components:
##                           Comp.1    Comp.2     Comp.3     Comp.4    Comp.5
## Standard deviation     2.8682141 2.5255533 1.51141800 1.23009037 1.1644937
## Proportion of Variance 0.3427772 0.2657675 0.09518268 0.06304676 0.0565019
## Cumulative Proportion  0.3427772 0.6085447 0.70372734 0.76677410 0.8232760
##                            Comp.6     Comp.7     Comp.8     Comp.9    Comp.10
## Standard deviation     0.88155692 0.84972077 0.72053177 0.70569521 0.65472139
## Proportion of Variance 0.03238094 0.03008439 0.02163192 0.02075024 0.01786084
## Cumulative Proportion  0.85565695 0.88574134 0.90737326 0.92812350 0.94598433
##                           Comp.11    Comp.12     Comp.13     Comp.14    Comp.15
## Standard deviation     0.64999065 0.60602359 0.416144178 0.307674754 0.30270579
## Proportion of Variance 0.01760366 0.01530269 0.007215666 0.003944323 0.00381795
## Cumulative Proportion  0.96358799 0.97889069 0.986106351 0.990050674 0.99386862
##                           Comp.16     Comp.17     Comp.18      Comp.19
## Standard deviation     0.23670589 0.194673058 0.170577179 0.1112640219
## Proportion of Variance 0.00233457 0.001579067 0.001212357 0.0005158201
## Cumulative Proportion  0.99620319 0.997782261 0.998994618 0.9995104379
##                             Comp.20      Comp.21      Comp.22      Comp.23
## Standard deviation     0.0663436601 0.0594830912 4.715408e-02 3.982794e-02
## Proportion of Variance 0.0001833951 0.0001474266 9.264613e-05 6.609437e-05
## Cumulative Proportion  0.9996938329 0.9998412595 9.999339e-01 1.000000e+00
##                             Comp.24
## Standard deviation     1.207352e-06
## Proportion of Variance 6.073741e-14
## Cumulative Proportion  1.000000e+00

comment

As we can see there is no differece between SVD and eigen decomposition so it doesn’t matter which one we use
The first principal component explains the most variability, followed by the second principal component,
and so on. By using principal components, we can reduce the dimensionality of the data set while
still retaining most of the variability.
The first principal component explained 34% of the variance,
but the second and third principal components explained 26.6% and 9.45

fviz_screeplot(pca_ed, addlabels = TRUE, hjust = -0.3,
               barfill="white", barcolor ="darkblue",
               linecolor ="red")

fviz_pca_ind(pca_ed, col.ind="contrib") +
  scale_color_gradient2(low="white", mid="darkblue",
                        high="red", midpoint=0.6)

Create a plot of variance explained for each principal component.

pr.var <- pca_ed$sdev ^ 2
pve <- pr.var/sum(pr.var)

plot(pve, xlab = "Principal Component", 
     ylab = "Proportion of Variance Explained", 
     ylim = c(0, 1), type = "b")

Plot cumulative proportion of variance explained

plot(cumsum(pve), xlab = "Principal Component", 
     ylab = "Cumulative Proportion of Variance Explained", 
     ylim = c(0, 1), type = "b")

comment

The sum of the eigenvalues for the first six principal components is equal to 89% of the total variation
in the data set. This indicates that the first six principal components are capturing a
significant amount of the variation in the data, and can be considered as a good representation of the data.

First Component

fviz_contrib(pca_ed, choice = "var", axes = 1)

comment

The red dashed line on the graph represents the expected average contribution
Of each variable to the principal component analysis (PCA)
If the contribution of the variables were uniform, meaning that each variable contributed equally to the PCA, then the expected value of each variable’s contribution would be 1/length(variables),
which in this case would be 1/14 or approximately 7%.
By ranking the players by their first PC scores, we can identify the best historical players in terms of performance. The first PC score represents the degree to which each player excels across all the variables
Being analyzed. Players with higher first PC scores are considered to be better performers overall,
While those with lower scores are considered to be relatively weaker performers. However,
It’s important to note that without more information about the specific variables and context,
It may not be meaningful to compare the players based solely on their first PC scores.

Let’s show the best 5 players in NBA and numbers of played games

calculateScore <- function(data,col) {
  return(sum((pca_ed$rotation[, 1]*data)^2))
} 

col <- 1
nba_df$player_name[sort.int(apply(nba_df_cleaned, 1, calculateScore), decreasing = T, index.return = T)$ix[1:5]]

## [1] "Billy Knight"    "Oscar Robertson" "Bob Davies"      "Jerry West"     
## [5] "Antawn Jamison"

col <- 2
nba_df$gp[sort.int(apply(nba_df_cleaned, 1, calculateScore), decreasing = T, index.return = T)$ix[1:5]]

## [1]  671 1040  462  932 1083

How much each variable contributes to the second component.

fviz_contrib(pca_ed, choice = "var", axes = 2)

# Now we visualize top 5 players with a contributions to first component
names_top_5 = names[order(get_pca_ind(pca_ed)$contrib[,1],decreasing=T)]
fviz_contrib(pca_ed, choice = "ind", axes = 1, top=5)+scale_x_discrete(labels=names_top_5)

let’s try T-SNE

try to normalize the data

nba_norm <- scale(nba_df_cleaned)

tsne_results <- Rtsne(nba_norm, dims = 2, perplexity = 7, verbose = TRUE)

## Performing PCA
## Read the 1002 x 24 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 7.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.06 seconds (sparsity = 0.029737)!
## Learning embedding...
## Iteration 50: error is 86.561426 (50 iterations in 0.14 seconds)
## Iteration 100: error is 78.250430 (50 iterations in 0.09 seconds)
## Iteration 150: error is 78.017078 (50 iterations in 0.08 seconds)
## Iteration 200: error is 78.010397 (50 iterations in 0.08 seconds)
## Iteration 250: error is 78.000621 (50 iterations in 0.08 seconds)
## Iteration 300: error is 1.905840 (50 iterations in 0.08 seconds)
## Iteration 350: error is 1.591351 (50 iterations in 0.08 seconds)
## Iteration 400: error is 1.491249 (50 iterations in 0.08 seconds)
## Iteration 450: error is 1.449161 (50 iterations in 0.09 seconds)
## Iteration 500: error is 1.419951 (50 iterations in 0.09 seconds)
## Iteration 550: error is 1.402818 (50 iterations in 0.08 seconds)
## Iteration 600: error is 1.391999 (50 iterations in 0.08 seconds)
## Iteration 650: error is 1.383686 (50 iterations in 0.08 seconds)
## Iteration 700: error is 1.376738 (50 iterations in 0.08 seconds)
## Iteration 750: error is 1.370636 (50 iterations in 0.08 seconds)
## Iteration 800: error is 1.363942 (50 iterations in 0.09 seconds)
## Iteration 850: error is 1.359771 (50 iterations in 0.08 seconds)
## Iteration 900: error is 1.356238 (50 iterations in 0.08 seconds)
## Iteration 950: error is 1.351958 (50 iterations in 0.08 seconds)
## Iteration 1000: error is 1.348208 (50 iterations in 0.08 seconds)
## Fitting performed in 1.71 seconds.

tsne_df <- data.frame(x = tsne_results$Y[,1], y = tsne_results$Y[,2])
ggplot(tsne_df, aes(x, y)) +
  geom_point() +
  ggtitle("t-SNE Visualization of NBA Dataset")

Dimension Reduction Project for National Basketball Association (NBA) Players

Name: Hamed Ahmed Hamed Ahmed

Student Id: 454827

I will apply different dimension reduction techniques like PCA, MDS and T-SNE

Dataset: collection of relevant, historical, statistics for National Basketball Association (NBA) Players

To provide more context, here are the meanings of the columns in an NBA dataset:

Dataset source: from Kaggle

Problem Statement:

Import the libraries

Load and prepare dataset

Let’s do some descriptive analysis for the 26 features

are the variables related to each other?

comment

So We need DATA dimension reduction technologies to see relations easier than on correlation matrix

Let’s try With PCA to reduce the features to 2 dimensions

PCA with eigen decomposition

comment

Create a plot of variance explained for each principal component.

Plot cumulative proportion of variance explained

comment

First Component

comment

Let’s show the best 5 players in NBA and numbers of played games

How much each variable contributes to the second component.

let’s try T-SNE

try to normalize the data

Final Conclusion