The file ppg2008.csv contains data about basketball players from the year 2008. It has various statistics on players in the NBA. You might not know what each of the metrics means (I don’t), but they are just different dimensions of data.

Find and interpret clusters in this dataset.

First, we start by looking at un-scaled data

# make matrix highly expressed genes with rows labeled
nba2 <- read.csv("ppg2008.csv", header =T,row.names = 1 )
library(pheatmap)
heatmap<- pheatmap(nba2)

# Heatmap with clusters of total NBA dataset
pdf("nba_pheatmap.pdf")
print(heatmap)  
dev.off()
## quartz_off_screen 
##                 2
# make matrix of subset of data with select columns that 
# we think have high variability 
data_subset <- cbind(nba2[,1:6],nba2[,20])
library(pheatmap)
nbasubset<-pheatmap(data_subset)

# heatmap with clusters of total NBA dataset
pdf("nbasubset_pheatmap.pdf")
print(nbasubset)  
dev.off()
## quartz_off_screen 
##                 2

Now we look into scaling data

ppg <- read.csv("ppg2008.csv", header = T, sep = ",")

Re-scaling the data for use with heatmap

ppg.scale <- as.data.frame(scale(ppg[-1]))
row.names(ppg.scale) <- ppg[,1]
library(pheatmap)
pheatmap(ppg.scale)

library(Rtsne)
ppg.tsne <- Rtsne(ppg.scale, perplexity = 5)
plot(ppg.tsne$Y)

library(tsne)
library(plotly)
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
ppg.tsne2 <- tsne(ppg[,-1], initial_dims = 2)
## Warning in if (class(X) == "dist") {: the condition has length > 1 and only the
## first element will be used
## sigma summary: Min. : 0.632077571294868 |1st Qu. : 0.729327686661631 |Median : 0.79839585611719 |Mean : 0.81090235325052 |3rd Qu. : 0.879390940940098 |Max. : 1.08837674559523 |
## Epoch: Iteration #100 error is: 14.4836948041459
## Epoch: Iteration #200 error is: 1.41291909497627
## Epoch: Iteration #300 error is: 1.15892719883642
## Epoch: Iteration #400 error is: 1.02977478743371
## Epoch: Iteration #500 error is: 0.960767602546145
## Epoch: Iteration #600 error is: 0.865345505053643
## Epoch: Iteration #700 error is: 0.785557206132158
## Epoch: Iteration #800 error is: 0.726750792598151
## Epoch: Iteration #900 error is: 0.63884053519752
## Epoch: Iteration #1000 error is: 0.567095839912868
ppg.tsne2 <- data.frame(ppg.tsne2)
ppg.pdata <- cbind(ppg.tsne2,ppg)
fig <- plot_ly(data = ppg.pdata, x = ~X1, y = ~X2, type = 'scatter', mode = 'markers')

fig <- fig %>%
  add_trace(
    x = ppg.pdata$X1,
    y = ppg.pdata$X2,
    marker = list(size = 3),
    name = ppg.pdata$meta,
    text = paste0("Name: ",ppg.pdata$Name," PTS: ",ppg.pdata$PTS," FGA: ",ppg.pdata$FGA)
  )

fig

Comments:

Looking at un-scaled data in column G, which stands for “games played”, there seems to be a lot of variability within the column as top players would likely have more play time; Dwayne Wade, Kevin Durant, and LeBron James are examples of top players we often hear about. The middle chunk appears to be quite uniform, which make sense as these statistics are likely not dependent on the players being top players and should not vary much between players regardless of their status as basketball is a team sport. For example, BLK stands for “blocks” and STL stands for “steals”, which would likely not vary much from player to player in a team sport. When we delineated the dataset further into more specific clusters via the “nbasubset_pheatmap”, the right section of the heatmap also appears to be uniform as compared to the left section including the most variable column. Overall, the un-scaled data provides a unique perspective but scaled data is needed for a more nuanced analysis.

In discussion of the scaled data, so as far as clustering and visualization goes and to consider the task, “think of it as your job, as a reporter for NY times, to make a single graphic that highlights something about this data,” the heatmap is the most interesting to the reader. While we can do dimentionality reduction and have shown that, reducing the dimentionality does not provide excited new and contextualized information for the reader. PCA or tSNE values aren’t inherently meaningful or valuable outside of clustering the higher dimensional data a 2D projection. On the other hand, the figure produced using pheatmap remains interpretably information rich in a condensed format. Its hierarchical clustering can show who your favorite player is performing most competitively with because they are close in distance. The interactive tSNE plot overall may not seem as comprehensive as the heatmap but it could enhance user experience and provide additional insights into the breakdown of the players’ stats in a neatly packaged percentile ranking of those dimensions in the clusters.