This data is imported from https://www.basketball-reference.com/leagues/NBA_2021_totals.html This data is about the different statistics of different basketball players in the 2020-2021 season where we have the player’s age, the team they play in, different shot attempts, turnovers, personal fouls, the steals,blocks and rebounds of the player.

My questions are what factors affect how many points the basketball player will score? And can I predict the number of personal fouls a player will get?

This chunk would install all the needed libraries and would import the data I scraped from the website. Then the glimpse function would allow me to see what is inside the data frame.

library(readr)
library(tidyverse)
library(tidymodels)
library(stats)
library(cluster)
library(factoextra)
library(magrittr)
library(ISLR)
library("tidyr")
nba_stats <- read.csv("/home/chhengy/Downloads/nba_stats.csv")

I created a histogram of the points of the basketball players in the data to see the distribution of points.

ggplot(nba_stats, aes(x = points)) +
  geom_histogram(binwidth = 150)

I plot the distribution of points scored by each position on the basketball players and there seems to be 5 obvious positions that scores the most points while others score really low points. Foward players score the most points and it is very obvious since their job is to go forward and score. Point guards also score a lot of points and it is very true since they are the best ball handlers and best passers. Shooting guards also score a lot since they are great 3-pointers and can get good open shots. Centers also score a lot since they are versatile and can get a lot of rebounds. Source:https://dunkorthree.com/basketball-positions-explained/

 ggplot(nba_stats, aes(x = points ,color = position, fill=position)) +
  geom_histogram(binwidth = 150) + facet_wrap(~position, nrow = 5)

I removed the playername and the rank column since it is not important to the regression tree data.

nba_stats2 <- subset (nba_stats, select = -playername)
nba_stats2 <- subset (nba_stats2, select = -rank)

This code would create a linear regression of the different factors that would give us a function of the points the basektball player will score based on the attempts they made, the turnovers, and the offensive rebounds. I chose these since I think these factors are the most correlated in how many points the player will score.

This linear regression shows that the position of the player as a shooting guard and a point guard at the same time are very crucial in determining the points that they will score.

attempts_turnovers_fit <- linear_reg() %>%
  set_engine("lm") %>%
  fit(points ~ + field_goals_attempts + turnovers +personal_fouls + free_throw_attempts + position, data = nba_stats2, family = "binomial")
attempts_turnovers_fit
## parsnip model object
## 
## Fit time:  7ms 
## 
## Call:
## stats::lm(formula = points ~ +field_goals_attempts + turnovers + 
##     personal_fouls + free_throw_attempts + position, data = data)
## 
## Coefficients:
##          (Intercept)  field_goals_attempts             turnovers  
##              5.41389               1.12494              -0.38153  
##       personal_fouls   free_throw_attempts          positionC-PF  
##              0.07996               0.84899              10.69671  
##           positionPF          positionPF-C         positionPF-SF  
##            -15.38652              37.58982             -35.86375  
##           positionPG         positionPG-SG            positionSF  
##            -21.61032              37.26856             -13.91065  
##        positionSF-PF         positionSF-SG            positionSG  
##             -6.36173              -8.38614             -22.93461  
##        positionSG-PG         positionSG-SF  
##            -41.11600             -17.11661

Generating a general decision tree and regression tree for the data

library(rpart)
library(rpart.plot)
tree_spec <- decision_tree( ) %>%
  set_engine("rpart")
reg_tree_spec <- tree_spec %>%
  set_mode("regression")

Splitting the data into 25% testing and 75% training data.

set.seed(1122)
nba_stats2 <- nba_stats2 %>% drop_na()
nba_df_split <-initial_split(nba_stats2, prop = 0.75)
train_data <- training(nba_df_split)
test_data <- training(nba_df_split)

This creates a model for the regression tree where the moded is lm

nba_mod <- linear_reg() %>%
  set_engine("lm")

This creates a recipe that tells that age is not a predictor. This would also remove all zero variance predictors.

nba_rec <- recipe(points ~ ., data = train_data) %>%
  update_role(age, new_role = "id") %>%
  
  step_dummy(all_nominal(), -age) %>%
  step_zv(all_predictors())

Create a workflow to fit the model using the recipe built earlier.

nba_wflow <- workflow() %>%
  add_model(nba_mod) %>%
  add_recipe(nba_rec)

Fit the model using the workflow made earlier and use the train data as the dataset.

nba_fit <- nba_wflow %>%
  fit(data = train_data)

tidy(nba_fit)

Create a 5 fold cross validation of the train data.

set.seed(644)
folds <- vfold_cv(train_data, v = 5)
folds

Apply the nba_workflow on the folds made.

set.seed(354)
nba_fit_rs <- nba_wflow %>%
  fit_resamples(folds)

-This would report the \(RMSE\) and \(R^{2}\). I am statisfied with the data since most of the data fits the model which means it is trustworthy.

collect_metrics(nba_fit_rs)

creating a predictor to predict points, removes age from being hte predictor. Since the estimate is really small, this shows that the predictor fits the data really well.

nba_test_pred <- predict(nba_fit, new_data = test_data) %>%
  bind_cols(test_data %>% select(points, age))

rmse(nba_test_pred, truth = points, estimate = .pred)

Decision tree with rpart engine and a regression tree.

nba_tree_spec <-decision_tree() %>%
  set_engine("rpart")
nba_reg_tree_spec <- nba_tree_spec %>%
  set_mode("regression")

workflow created using the tune() mode.

nba_reg_tree_wf <- workflow() %>%
  add_model(nba_tree_spec %>% set_args(cost_complexity = tune())) %>%
  add_formula(points ~ . )

Decision tree using the attempts, the rebounds, the turnovers, personal fouls and position of the player.

nba_reg_tree_fit <- fit(nba_reg_tree_spec, points ~ three_pointers_attempts + two_pointer_attempts + free_throw_attempts + offensive_rebounds+ turnovers + personal_fouls +position, data = train_data)
nba_reg_tree_fit
## parsnip model object
## 
## Fit time:  18ms 
## n= 485 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 485 76408780.0  421.34850  
##    2) two_pointer_attempts< 261 358 12466790.0  234.43580  
##      4) turnovers< 29.5 208  1662392.0  114.79330  
##        8) two_pointer_attempts< 52.5 122   303013.0   60.22131 *
##        9) two_pointer_attempts>=52.5 86   480632.2  192.20930 *
##      5) turnovers>=29.5 150  3698382.0  400.34000  
##       10) three_pointers_attempts< 218 113   958979.2  331.33630 *
##       11) three_pointers_attempts>=218 37   558114.8  611.08110 *
##    3) two_pointer_attempts>=261 127 16178220.0  948.23620  
##      6) two_pointer_attempts< 530.5 80  3060626.0  741.17500  
##       12) three_pointers_attempts< 173 32   303445.5  562.62500 *
##       13) three_pointers_attempts>=173 48  1056908.0  860.20830 *
##      7) two_pointer_attempts>=530.5 47  3849434.0 1300.68100  
##       14) three_pointers_attempts< 374.5 32  1392659.0 1166.90600 *
##       15) three_pointers_attempts>=374.5 15   662438.9 1586.06700 *

Visualize the regression. The results suggests that the computer treats only the two pointer attempts, the tree pointer attempts, and the turnovers as the most important factor in helping to decide how many points the player will score based of the number of attempts and the number of turnovers. For players to socre 1586 points they must make more than 531 two_pointer attempts and more than 375 three_pointer attempts in the season. This is for the top 3% of the players that score the most points in the 20-21 season. If a player makes morer than 531 two pointer attempts, while making only 375 three pointer attemps, they will score 1301 points. If they make less than 375 three pointer attempts, they will only make 1167 points.

nba_reg_tree_fit %>%
  extract_fit_engine() %>%
  rpart.plot(roundint=FALSE)

#Creates a testing  20% and 80% training set
set.seed(934)

nba_split <- initial_split(nba_stats2,prop = 0.8)
nba_train <- training(nba_split)
nba_test <- testing(nba_split)

Plots the amount of fouls and point of the players color coded by the position the player is in. The number of fouls won by a player is positively correlated to the amount of points that a player will score.

ggplot(data = nba_train, aes(x = personal_fouls, y = points)) + 
  geom_point(aes(color=position)) + 
  geom_smooth(method = "lm") + 
  labs( 
    title = "Points vs. Personal Fouls", 
    subtitle = "NBA Stats 2020-2021 Season", 
    x = "Personal Fouls", 
    y = "Number of Points Scored" 
  )

>Plotting the number of fouls each position in the NBA 2020-2021 season. >We see that this histogram facetted by position is really similar to that of the points in each position graph earlier. This might be because the palyers in this position is really good at ball handling and they are more likley to score points, so they attract more fouls.

 ggplot(nba_stats, aes(x = personal_fouls ,color = position, fill=position)) +
  geom_histogram(binwidth = 150) + facet_wrap(~position, nrow = 5)

Creating a regression tree fit that utilizes all variables in the training dataset to predict the number of personal fouls a player will get.

reg_tree_fit <- fit(reg_tree_spec, personal_fouls~. , nba_train)
reg_tree_fit
## parsnip model object
## 
## Fit time:  25ms 
## n= 517 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 517 1388306.000  70.73694  
##    2) minutes_played< 790.5 265  145608.200  31.30566  
##      4) minutes_played< 426 175   25572.430  18.32571  
##        8) defensive_rebounds< 29.5 103    4617.204  10.57282 *
##        9) defensive_rebounds>=29.5 72    5907.500  29.41667 *
##      5) minutes_played>=426 90   33222.320  56.54444 *
##    3) minutes_played>=790.5 252  397384.700 112.20240  
##      6) total_rebounds< 186 90   55067.290  82.91111 *
##      7) total_rebounds>=186 162  222200.400 128.47530  
##       14) games_started< 55.5 109  100872.400 116.33940  
##         28) blocks< 35.5 63   40376.980 104.98410 *
##         29) blocks>=35.5 46   41246.460 131.89130 *
##       15) games_started>=55.5 53   72259.020 153.43400  
##         30) offensive_rebounds< 81.5 32   37591.880 139.43750 *
##         31) offensive_rebounds>=81.5 21   18845.810 174.76190 *

Visualize the data. This data shows that 100% of the players receive 71 personal fouls after playing for 791 minutes. If the play less for 426 minutes, 51% of the players receive 31 fouls. For those who played more than 791 minutes and has 186 total rebounds, they will be 49% likely to receive 112 fouls.

reg_tree_fit %>%
  extract_fit_engine() %>%
  rpart.plot(roundint=FALSE)

> However, taking into account that the number of minutes played shouldn’t be used as one of the variables to predict the number of fouls a player will get since it is very obvious that the morer minutes the player plays, the more fouls they will get. > This new fitt will use all variables ot predict the number of fouls a player will get except for the amount of minutes they play, their age, their number of games and number of free throw attempts.

reg_tree_fit2<- fit(reg_tree_spec, personal_fouls ~ three_pointers_attempts + two_pointer_attempts  + offensive_rebounds+ turnovers + personal_fouls +position + blocks, data = train_data)
reg_tree_fit2
## parsnip model object
## 
## Fit time:  8ms 
## n= 485 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 485 1277615.000  72.47629  
##    2) turnovers< 39.5 255  215231.200  36.54510  
##      4) offensive_rebounds< 12.5 145   32895.450  20.10345  
##        8) three_pointers_attempts< 42.5 102    7121.373  12.92157 *
##        9) three_pointers_attempts>=42.5 43    8033.163  37.13953 *
##      5) offensive_rebounds>=12.5 110   91468.760  58.21818  
##       10) offensive_rebounds< 46.5 95   53823.960  51.97895  
##         20) three_pointers_attempts< 83.5 54   12113.330  41.11111 *
##         21) three_pointers_attempts>=83.5 41   26932.490  66.29268 *
##       11) offensive_rebounds>=46.5 15   10524.930  97.73333 *
##    3) turnovers>=39.5 230  368163.500 112.31300  
##      6) offensive_rebounds< 39.5 117  116160.900  93.97436  
##       12) offensive_rebounds< 24.5 53   44779.890  78.33962  
##         24) three_pointers_attempts< 218 24    9882.000  58.00000 *
##         25) three_pointers_attempts>=218 29   16752.140  95.17241 *
##       13) offensive_rebounds>=24.5 64   47696.610 106.92190 *
##      7) offensive_rebounds>=39.5 113  171913.800 131.30090  
##       14) turnovers< 83.5 64   67403.110 116.82810  
##         28) blocks< 72.5 57   43920.560 111.24560 *
##         29) blocks>=72.5 7    7241.429 162.28570 *
##       15) turnovers>=83.5 49   73595.960 150.20410  
##         30) offensive_rebounds< 78.5 29   37053.030 134.41380 *
##         31) offensive_rebounds>=78.5 20   18827.800 173.10000 *

From this fit, we see that the number of fouls a player will get is based of turnovers, offensive rebounds, blocks, and three pointer attempts. We can see that if a player makes more than 40 turnovers and made 40 rebounds.This applies to 47% of the players in the data. We can also see that the top 3% of the players make more than 84 turnovers and more than 79 offensive rebounds and get abount 173 personal fouls.

reg_tree_fit2 %>%
  extract_fit_engine() %>%
  rpart.plot(roundint=FALSE)

Create a square root of the variance of the residuals The high estimate shows that this model is not very reliable in fitting the data although it is the computer’s best guess of the important factors affecting the number of fouls a person will get.

augment(reg_tree_fit, new_data = nba_train) %>%
  rmse(truth = personal_fouls, estimate = .pred)

The new fit is also not a really reliable predictor and it doesn’t fit the data really well.

augment(reg_tree_fit2, new_data = nba_train) %>%
  rmse(truth = personal_fouls, estimate = .pred)

We use the cost_complexity to find a more optimal complexity which is passed to the workflow.

reg_tree_wf <- workflow() %>%
  add_model(reg_tree_spec %>%
  set_args(cost_complexity = tune())) %>%
  add_formula(personal_fouls ~ . )

create 10 folds on training data.

set.seed(1234)
#Split data into 10 folds
nba_fold <- vfold_cv(nba_train)
nba_fold

I use a regular grid since I am tuning one hyperparameter.

param_grid <- grid_regular(cost_complexity(range = c(-6, -1)), levels = 10)
tune_res <- tune_grid(
  reg_tree_wf, 
  resamples = nba_fold, 
  grid = param_grid
)

This shows that higher cost complexity values help minimize the errors which are prefered.

autoplot(tune_res)

We select the best performing value using rmse and update the cost complexity.

best_complexity <- select_best(tune_res, metric = "rmse")
reg_tree_final <- finalize_workflow(reg_tree_wf, best_complexity)
reg_tree_final_fit <- fit(reg_tree_final, data = nba_train)
reg_tree_final_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: decision_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## personal_fouls ~ .
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## n= 517 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 517 1388306.000  70.73694  
##    2) minutes_played< 790.5 265  145608.200  31.30566  
##      4) minutes_played< 426 175   25572.430  18.32571  
##        8) defensive_rebounds< 29.5 103    4617.204  10.57282 *
##        9) defensive_rebounds>=29.5 72    5907.500  29.41667 *
##      5) minutes_played>=426 90   33222.320  56.54444 *
##    3) minutes_played>=790.5 252  397384.700 112.20240  
##      6) total_rebounds< 186 90   55067.290  82.91111 *
##      7) total_rebounds>=186 162  222200.400 128.47530  
##       14) games_started< 55.5 109  100872.400 116.33940  
##         28) blocks< 35.5 63   40376.980 104.98410 *
##         29) blocks>=35.5 46   41246.460 131.89130 *
##       15) games_started>=55.5 53   72259.020 153.43400  
##         30) offensive_rebounds< 81.5 32   37591.880 139.43750 *
##         31) offensive_rebounds>=81.5 21   18845.810 174.76190 *

Visualize the model The model shows a regression tree fit that is more complex than the one at the begining showing that the more optimal the complexity value gets, the more complex the regression tree would be. From here we can make conclusions based of the values that are shown on the visualization.

reg_tree_final_fit %>%
  extract_fit_engine() %>%
  rpart.plot(roundint=FALSE)

The rmse value of the regression tree final fit has a lower value than that of the previous one. This shows that the regression tree final has a better fit than that of the previous one.

augment(reg_tree_final_fit, new_data = nba_train) %>%
  rmse(truth = personal_fouls, estimate = .pred)

Overall, the computers assumption in having the number of turnovers and the two- and three- pointer attempts, as the main predictors of how many points a player will score is very reasonable. Looking at basketball, those who are able to get turnovers have a higher chance of scoring. Also those who are able to make more attempts at shooting a two- and three- pointer are more likley to have better ball handling or that they have a lot of game awareness knowing where to be and how to successfully make an attempt at the basket. The regression tree is a good predictor of how many points a person will score. The techniques I used to predict the number of points a person will score includes the use of a linear regression which shows the effect of each category such as the position and the shot attempts of each player. This was just a general prediction which shows that the position of the player as a point guard and a shooting guard will score the most points. The next technique was using a regression tree which feeds the decision tree with training and testing data and the computer would find its best fit out of all the variables to find the key variables that are most important in deciding the number of points a player will score. This results in: - For players to socre 1586 points they must make more than 531 two_pointer attempts and more than 375 three_pointer attempts in the season. This is for the top 3% of the players that score the most points in the 20-21 season. - If a player makes morer than 531 two pointer attempts, while making only 375 three pointer attemps, they will score 1301 points. If they make less than 375 three pointer attempts, they will only make 1167 points and so on based of the numbers on the visualized tree.

The final fit for the amount of fouls a player will get is not that good of a fit since it includes the minutes played of the player and it is very obvious that the more a player plays, the more likley the player will be fouled. However, reg_tree_fit2 is a much more reasonable regression tree since the main predictor is offensive rebounds followed by the three pointer attempts, the blocks and the turnovers. This is reasonable since it is easy to foul when the ball is on the rebound since no team is in posession of the ball and players are just jumping to get the lose ball. The number of three-pointer attempts being a good predictor of fouls is reasonable since the more the player is making shots, the more others are trying to block the shots and it is likley for the opposition to foul them while trying to block the shot.All of the fits don’t represent the data really well.