The ability to make a good decision when taking a shot in basketball is crucial part for player’s success. Connected to this, players position on the court usually defines their roles and objectives on the court. This project simplified player’s positions in two categories, guards(G) and forwards(F), for purposes of the study. Basic knowledge of basketball will tell us that the difference between guards and forwards is that guards generally take shots from longer distances, whereas forwards take shots closer to the rim. Basketball changed, shots that players are taking changed and that is what this study is about.
Research questions that were proposed were: Does a player’s position affect their preference for shot distance? Also, question that is connected to this is: Can we predict player’s position based on their “shot profile”? Shot profile is the percentage of shots that were taken from different distances by a certain player.
Dataset that was used in order to answer these question was from the 2023/2024 NBA season, where all 572 players that competed have few variables, first, Position, that is a binary categorical variable having values F for Forward and G for guard, as well as player’s percentage of shots taken from different distances from the rim, categorized like this: 0-3 feet, 3-10 feet, 10-16 feet, 16 feet until the 3 point line, and 3P shhot attempts. The percentage from each of these categories out of all shot attempts were recorded, having a number between 0 and 1. Data was collected using high-tech cameras that are set up in NBA arenas, and they are similar to hawk-eye camera used in tennis, tracking player’s position on the court, which provides information on shot distance.
In order to answer research questions, nominal logistic regression was used. Nominal logistic regression is great for studying relationships between a dependent variable with three or more categories(ShotDistance) and an independent variable(Position). Nominal logistic regression is used when the response variable is nominal, meaning that there is no inherent order between categories. This technique estimates the probability of each category of the response variable as a function of the predictors, where the output shows coefficients for each category of the dependent variable, relative to the reference category. Coefficients represent the log-odds of being in a certain category of the dependent variable, when compared to the reference category. The reference category is chosen either by software automatically or manually, where we need to see which one is the best for us, when put in the context of the study. In this study, 0-3 feet was the reference category, as it is the closest and the most efficient shot in basketball.
When comparing nominal logistic regression to other techniques in the context of this reserach, the difference is obvious. For example, binary logistic regression could have been used, but multiple individual comparisons would need to be done (0-3 feet vs. 10-16 feet), which would take a lot more time, as we would need to compare each pair of categories in order to get accurate results. Furthermore, nominal logistic regression does not have strict distributional assumptions about the predictors, making it applicable to real-world problems, where distributions do not always follow patterns.
Independence of observatins, no multicollinearity between independent variables, linear relationship between continous variables(if we have them), the logit transformation of the reponse variable and no outliers or highly influential points are the assumption of the model, with already mentioned assumptions about the response and the explanatory variables.
The main purpose of this study was to investigate the relationship between player’s position and their shot distance preferences, and this is how it was executed in R:
library(nnet) #"nnet" package needed for multinom
data <- read.csv("ProjectData - Sheet1.csv", header = TRUE) #loading dataset
data$Position <- as.factor(data$Position) #convert variable position into factor, so we can use it later
data$ShotDistance <- apply(data[, c("X03ftpctAtt", "X310ftpctAtt", "X1016ftpctAtt", "X163PpctAtt", "X3PAtt")],
1, function(row) names(row)[which.max(row)]) #create categorical variable for ShotDistance
data$ShotDistance <- factor(data$ShotDistance,
levels = c("X03ftpctAtt", "X310ftpctAtt", "X1016ftpctAtt", "X163PpctAtt", "X3PAtt"),
labels = c("0-3 ft", "3-10 ft", "10-16 ft", "16ft-3P", "3P")) #renaming variables to understand them easier
# Fit the multinomial logistic regression model, ShotDistance as the dependent variable, Position as the predictor
model <- multinom(ShotDistance ~ Position, data = data)
## # weights: 15 (8 variable)
## initial value 920.598486
## iter 10 value 488.331193
## iter 20 value 487.699000
## iter 30 value 487.696246
## final value 487.692765
## converged
summary(model)
## Call:
## multinom(formula = ShotDistance ~ Position, data = data)
##
## Coefficients:
## (Intercept) PositionG
## 3-10 ft -1.5043524 0.8763084
## 10-16 ft -3.3763564 1.0750520
## 16ft-3P -13.7168604 11.0106203
## 3P 0.4581677 1.3829704
##
## Std. Errors:
## (Intercept) PositionG
## 3-10 ft 0.2168352 0.3779769
## 10-16 ft 0.5085836 0.7906278
## 16ft-3P 87.9990814 88.0021081
## 3P 0.1181186 0.2293334
##
## Residual Deviance: 975.3855
## AIC: 991.3855
By executing this R code we investigated the first question, does a player’s position affect their preference for shot distance? The reference category for the ShotDistance was 0-3 ft, which means that all coefficients represent the log-odds of selecting a certain shot distance relative to the 0-3 ft distance.
What does these numbers mean? For example, for the 3-10 feet category, we have that Intercept is -1.5043524, suggesting that players in general are less likely to take this shot over 0-3 feet distance. Coefficient for PositionG, as it is positive and 0.8763084, suggest that Guards are more likely to take this shot compared to Forwards. In general, if log-odds, or coefficients, are negative, which means that players of a certain position are less likely to take a shot from certain distance over reference category(0-3 feet). Standard error for 16ft - 3P suggests that the model fails here, probably because of the lack of data, as in modern basketball, that shot is one of the rarest shots players take, so I understand the problem there.
Let’s move on and investigate the second research question: Can we predict player’s position based on their “shot profile”? This R chunk will help us with that:
# Reversing predictors, as we are predicting Position based on the Shot Distance
model_position <- multinom(Position ~ ShotDistance, data = data)
## # weights: 6 (5 variable)
## initial value 396.480187
## iter 10 value 366.310123
## final value 366.294641
## converged
summary(model_position)
## Call:
## multinom(formula = Position ~ ShotDistance, data = data)
##
## Coefficients:
## Values Std. Err.
## (Intercept) -1.3606746 0.2046286
## ShotDistance3-10 ft 0.8752678 0.3779312
## ShotDistance10-16 ft 1.0726403 0.7907184
## ShotDistance16ft-3P 7.3281384 14.0107158
## ShotDistance3P 1.3820347 0.2292799
##
## Residual Deviance: 732.5893
## AIC: 742.5893
# accuracy of predictions
data$PredictedPosition <- predict(model_position, newdata = data)
accuracy <- mean(data$Position == data$PredictedPosition)
cat("Prediction Accuracy: ", accuracy)
## Prediction Accuracy: 0.5909091
Output of this code is giving similar results like last part of the code. Again, reference category was 0-3 feet, and value of -1.3606746 suggest that player who take a shot from that distance is more likely to be a Forward than a Guard. All other categories are showing more or less the same - odds of player being a Guard are higher than for Forwards in any other category. Again, high error for 16ft - 3P is expected and understood, but I would also say that you will rarely see a Forward taking a shot from that distance, which this research confirms. From this, we have that player’s shot profile is a significant predictor of their position, where forwards take more shots only close to the basket.
Overall, from the analysis, it is evident that shot profile gives significant information about player’s position. We got information that Guards are favoring mid-range and long-distance shot, whereas Forwards take more shots closer to the rim, which confirms basketball strategy and trend of modern basketball. Important parameter that this research showed is that both Forwards and Guards do take 3 point shots, which we can see in the first output, where both log-odds are positive, 0.458 for Forwards and 1.383 for Guards.
Furthermore, for the second question, where we were trying to guess position of a player based on shots they take, we got generally similar results. The difference was that we are trying to predict a position here, and by the code we have, accuracy of our model was just under 60%, showing that player’s position can be predicted based on shots they are taking with a pretty good percentage of success.
However, there were certain issues with the model, and the one obivous was a large standard error for the 16 feet - 3P distance, which might be because of the fact that shots taken from this distance can be considered as the rarest shots in basketball, leading to less sample size and the reason of the large standard error. Moreover, something that was excluded from this data were dunk attempts, and they were not considered in the 0-3 feet category, and Forwards do attempt a lot more dunks than Guards, which can change this model and log-odds produced out of it.
Conversely, this model was a really good fit for the data, taking in account variables used in it. Nominal logistic regression was appropriate for categorical natures of variables, which is the reason why this model was a good fit for answering research questions. Overall, low standard errors show that our model was effective to answer questions and at the end, it properly fit the real-life basketball style and the expectations we had before conducting this study.