Data Preperation

The data preperation was an essential step prior to studying the dataset. To get the data ready, we had to merge the player statistics and player salary data. Then, we create a function that can be used to clean future datasets, as well as the one currently being studied. In the function, “nba_pre_processing”, we remove players who haven’t played in any games, any player without their salary reported, and any repeats. Then, we take care of traded players by only including their total season data. Next, we encode shooting percetages to players who have no attempts to just 0. Finally, we create a new column to bin the salary data and create groups.

# Reading in the data, and using a left join to match the player and salary

nba_data <- read.csv("nba2020-21.csv")
nba_salary <- read.csv("nba_salaries_21.csv")
nba <- left_join(nba_data, nba_salary, by = "Player")

# Creating the function to prepare the dataset

nba_pre_processing <- function(data){
  
  data <- data[data$G != 0,] # Excluding players who haven't played in a game
  data <- data[!is.na(data$X2020.21),] # Excluding players w/o salary info
  data <- unique(data) # Removing repeats
  
  # For players who are traded, we only want the total season numbers
  
  trade_indexes <- which(data$Tm == "TOT")
  trade_indexes_remove <- c(trade_indexes+1, trade_indexes+2)
  data <- data[-trade_indexes_remove,]
  
  # Getting rid of NAs for players w/o shot attempts
  
  data[is.na(data$FG.),]$FG. <- 0
  data[is.na(data$eFG.),]$eFG. <- 0
  data[is.na(data$X3P.),]$X3P. <- 0
  data[is.na(data$X2P.),]$X2P. <- 0
  data[is.na(data$FT.),]$FT. <- 0
  
  # Creating bins for salary and a column for it
  
  breaks <- c(0, 3000000, 13000000, 43006362)
  tags <- c("[0-3,000,000)","[3,000,000-13,000,000)", "[13,000,000-43006362]")
  groups <- cut(data$X2020.21, 
                  breaks=breaks, 
                  include.lowest=TRUE, 
                  right=FALSE, 
                  labels=tags)
  
  data <- data %>%
    mutate(data, "SalaryGroup" = as.factor(groups))
  
  return(data)
}
nba <- nba_pre_processing(nba)

Data

Here is a data table, showing the variables being studied.

datatable(nba, options = list(pageLength = 5))

Clustering

To prepare the data to be clustered, we must standardize the variables, via scaling. After standardizing the numeric variables, we cluster all of the standardized numeric variables, except salary, together to assess overall performance. The data is clustered with three centers, to match the three salary groups being studied.

# Scaling the data

nba_scaled <- nba %>%
  select(Age, G:PTS) %>%
  scale() %>%
  data.frame()

# Creating the cluster data

clust_data = nba_scaled

set.seed(1)
kmeans_obj = kmeans(clust_data, centers = 3, 
                        algorithm = "Lloyd")

clusters = as.factor(kmeans_obj$cluster)

nba <- nba %>%
  mutate(Cluster = clusters)

Next, we will evaluate the clusters.

num = kmeans_obj$betweenss
denom = kmeans_obj$totss
(var_exp = num / denom)

## [1] 0.4607729

Here, we see that the cluster quality is good.

Visualizations

First, we need to assess correlations to salary to know which variables should be included in visualizations.

cor(nba$X2020.21, nba[,5:30])

##              G        GS        MP        FG       FGA       FG.       X3P
## [1,] 0.2081611 0.5590377 0.4950764 0.6097227 0.6022275 0.1433155 0.4446676
##           X3PA      X3P.       X2P      X2PA       X2P.      eFG.        FT
## [1,] 0.4417321 0.1528258 0.5653079 0.5771586 0.06709028 0.1452944 0.5843189
##            FTA       FT.       ORB       DRB       TRB      AST       STL
## [1,] 0.5734956 0.1980282 0.2125967 0.4805635 0.4295558 0.599349 0.4666237
##            BLK       TOV       PF      PTS X2020.21
## [1,] 0.2404494 0.5954018 0.319019 0.619596        1

After running the correlations, we see that points (PTS), assists (AST), steals (STL), and turnovers (TOV) are all at least moderately, positively associated with salary.

The series of graphs to follow will visualize the cluster results. The variables on the x-and-y-axes will be two of the four variables said to be associated with salary. All of the graphs will contain points with the players cluster, with 3 representing higher overall performance, and 1 representing lower performance. All of the points will be colored by salary, where red is the lowest salary group, from 0 to 3 million, green is the second salary group, from 3 million to 13 million, and blue is the highest salary group, from 13 million and higher.

This first graph will plot steals versus turnovers. This represents how many times you take the ball from the other time compared to how much you give up the ball.

As we can see, these variables don’t have the most linear relationship. The points seem to be a bit sporadic, especially in the higher steals range. In addition, steals and turnovers are not the greatest indicators for performance. Yes, they are important, but they are not the end-all-be-all to win games, which is why they are not as correlated to salary as some of the other variables.

The second graph will plot assists versus turnovers. Assists versus turnovers is commonly studied among passers in the NBA. The assist to turnover ratio is an advanced statistic commonly overlooked, but is something that defines great passers. Being able to pass a lot but not lose the ball is a pivotal skill in the NBA that translates to wins.

Here, we see that this graph presents a much more linear relationship, showing association. However, since the two variables are highly associated, we see very little outliers who could be potential targets to sign. No one sticks out in this graph as under or over performing besides a few, which will be adressed later.

The third graph will plot points versus assists. Two of the most classic offensive statistics to study, but two that are very important to overall performance and winning, which means higher salary.

Again, we see a quite linear relationship, which is good. However, the data points are a little less clustered, giving room for players to be drastically over- performing their salaries. With this in mind, we will proceed with points and assists to evaluate players we want to target.

Analysis

Players, such as the ones circled in the image below, clearly are overperforming their salary groups, not only in terms of the clustered data performance, but also in points and assists. Therefore, we have to go in and pick out some of these players to help our team out.

Player Selection

Here, we are creating an interactive plot, where we can see which player represents each point, as well as their salary and cluster.

# Removing players with special characters that mess up plot_ly

nba_3d <- nba[-c(20, 65, 68, 82, 100, 222, 249, 267, 311, 312, 313, 386, 390, 
                 392, 421),]

# Creating plot_ly graph

plot_ly(nba_3d, type = "scatter", mode="markers", x = ~PTS, 
        y = ~AST, color = ~SalaryGroup, 
        colors = c("red","forestgreen", "darkblue"),
        text = ~paste('Player:', Player, '\n', 'Salary:', X2020.21, '\n', 
                      'Cluster:', Cluster))

Recommendations

From this graph, we can give recommendations on who to sign. The first two players we recommend are long-term investments that involve building around a young star. Trae Young and Luka Doncic are two players only in their third year, but are dominating the league when it comes to overall performance, especially points and assists, as can be seen in the top right of the graph. However, both players are still on their rookie contracts, which explains why their salary is so low at the moment. Both players will certaintly want max contracts once their current contract is over. Despite this, we feel it is worth it to go after one of these young stars, as we could hopefully bring them into our team, help them develop even more, and then we would have one of the best players in the league for years to come. Winning teams always have at least one superstar, so getting one of these players could be our biggest chance at long term success.

However, if we are looking to go the other route and acquire players who are over-performing now, we should look at players in the lower salary group, performing higher than the bunch. When we look at the graph, we see two red points that especially stick out from the others. Both are cluster three, the highest performing cluster, and are making under three million in salary. These players, Devonte Graham and Kevin Huerter, are valuable assests to a team, making little in salary. Granted, these players have found a newfound success in performance over the last year or two, which could lead them to ask for a higher contract in the future. But if we want success at this moment with a bang for our buck, we should go with Graham or Huerter.

One more recommendation I give is TJ McConnell. McConnell is in the second cluster, but he is one of the leagues top assisters. He has more assists than points, which could really help our team, assuming we surround him with scrorers. In addition, he is only making 3.5 million, which is barely over the threshold for the second salary group. McConnell has been making a big impact with the Pacers this season, so I don’t think we could go wrong here.

NBA Clustering Lab

John Hope