Business Problem

The Wizards over the past few years have been mediocre at best and abysmal at worst. Thankfully, machine learning can be used to save the team from irrelevance. The data set used in this process is available on basketball-reference.com and has statistics ranging from games played to effective field goal percentage. In terms of assembling the model, the first step required was data cleaning. This meant the data from basketball-reference and the accompanying salary data needed to be merged together as well as rename columns and remove duplicate player entries. After the appropriate processing, all numeric variables were scaled to be included in the k-means clustering model.

The model features three clusters of players, generally classifying players by the quality of their play according to the data. When the points in the clusters were color coded by salary level (low salaries under 3 million dollars, medium salaries between 3 million and 13 million dollars, and high salaries over 13 million), it became visually apparent salary closely followed the three clustering groups. The best representation of this is the Points vs Assists graph (see below) which is the closest to a linear relationship.

Plots

Turnovers vs. Steals

ggPlot

Plotly

Minutes Played vs. Games

ggPlot

Plotly

Effective FG% vs. FG Attempts

ggplot

Plotly

Points vs. Assists

ggPlot

Plotly

Confirming Model Accuracy

#Evaluate the quality of the clustering 

# Inter-cluster variance,
# "betweenss" is the sum of the distances between points 
# from different clusters.
num_NBA = kmeansObjNBA$betweenss

# Total variance, "totss" is the sum of the distances
# between all the points in the data set.
denom_NBA = kmeansObjNBA$totss

# Variance accounted for by clusters.
(var_exp_Rep = num_NBA / denom_NBA)
## [1] 0.4677652

This model explains 46.7 percent of the variance present. There is noise present in the data that the model is unable to explain. This may be a result of a number of factors including limited sample size for certain players. While there are nearly 400 players to model off of, many do not play enough minutes to generate useful information for statistics like three point percentage. A player who makes two three pointers on two total attempts has a three point percentage of 100, but this may not be representative of the player’s genuine quality. Furthermore, this is only partial season data, which means players’ true averages have yet to take shape during this particular season. This may be one reason why the model has not seen true separation between clusters yet. The truly valuable players have yet to stand out from the rest and the less valuable players may be experiencing a good run of form.

#Use the function we created to evaluate several different number of clusters

# The function explained_variance wraps our code for calculating 
# the variance explained by clustering.
explained_variance = function(data_in, k){
  
  # Running the kmeans algorithm.
  set.seed(1)
  kmeans_obj = kmeans(data_in, centers = k, algorithm = "Lloyd", iter.max = 30)
  
  # Variance accounted for by clusters:
  # var_exp = intercluster variance / total variance
  var_exp = kmeans_obj$betweenss / kmeans_obj$totss
  var_exp  
}

# The sapply() function plugs in several values into our explained_variance function.
#sapply() takes a vector, lapply() takes a dataframe
explained_var_NBA = sapply(1:10, explained_variance, data_in = scaled_nba)

# Data for ggplot2.
elbow_data_NBA = data.frame(k = 1:10, explained_var_NBA)
as.data.frame(elbow_data_NBA)
##     k explained_var_NBA
## 1   1     -3.560363e-15
## 2   2      3.475107e-01
## 3   3      4.677652e-01
## 4   4      5.024044e-01
## 5   5      5.688930e-01
## 6   6      6.049012e-01
## 7   7      6.174351e-01
## 8   8      6.300188e-01
## 9   9      6.412504e-01
## 10 10      6.561308e-01

It is clear as the number of clusters increases the amount of variance explained by the model increases as well. While it may seem wise to simply increase the number of clusters, the likelihood that there are ten truly different levels to NBA players is low. If one cluster was chosen, the model would be completely inadequate. Two clusters is certainly better, but only explains 34.7 percent of the variance in the dataset. Simply adding one more cluster explains an additional 12 percentage points of the variance. Additional clusters generally explain fewer than 6 percentage points more than the previous cluster. This is ultimately why 3 clusters were chosen. It then logically follows that there are around 3 different contract sizes in the NBA, one for the superstars, one for the role players, and then one for the guys who fill the rest of the bench. These salaries are then easily mapped onto each cluster, and if the player was outperforming other players in his cluster, then it is likely he provides value beyond his contract.

Under these conditions, there are a few players who stand out. In order destitute Wizards to glory, offering slightly larger contracts to low and medium contract players could prove beneficial, while also including an effective draft strategy. In particular, Kevin Huerter, who is on a low contract, averages just over 32 minutes per game in 25 starts. This gives him approximately 13.5 points per game (PPG). Admittedly, these are not All-Star numbers, but at 22 years old Huerter has plenty of time to grow his game. Another player with serious potential is Devonte’ Graham. While older than Huerter at 25, he is could be an upgrade at point guard, or solid man off the bench. He may also come cheap given breakout star Lonzo Ball recently usurped Graham as the Hornets’ starting point guard while Graham was recovering from injury. In terms of players who will command more expensive contracts, Trae Young and Luka Doncic stand out. Young provides his team over 26 PPG as well as 9.4 assists per game. At 22 years old, he is nearly averaging a double-double. Similarly, Doncic averages nearly 29 PPG and with one more assist per game would also average a double-double. Doncic also has the highest effective field goal percentage of the three at 54.1 percent, which is also higher than league average at 52.9 percent. This means he is a more than capable shooter, with good range, and is also able to distribute. If the Wizards decide to offer a max contract, it should be to Luka Doncic who at 21 years old has many years left to be the face of the franchise.