At the end of every NBA season, one player is announced as the most improved player (MIP). To win this award, the player must show the most progress during the regular season compared to their play in previous seasons.
While the award is based on the prior metrics compared to the current season metrics, we wanted to dive deeper to help truly establish what makes a player the MIP. To do so, we want to explore the impact specific metrics play in the decision of the MIP in the NBA. To do so we decided to look at the last 10 MIP Award Winners, and how the season they won the Most Improved Player Award compares to the season before.
The data we obtained is from the hoopR package, version 2.1.0. It was published on 2023-11-25.
Below is the initial task of retrieving the data. The data we collected is MIP Award winner of the last 10 NBA seasons, and their game statistics from the season they won the award, and the season before they won the award. Thus, the award is determined by the level of play the player competes at, and how that compares to their previous seasons play.
To create a functional data set, we used the load_nba_player_box() function to get player data from each game of a specified time range. Since we wanted the last ten MIPs, we used the regular season data from the 2013 season to the 2023 season. By creating our own vectors with the MIP winners, and the corresponding season that they won ordered by their index. By doing this, we were able to create a functional data frame that includes the average of many important player statistics from both seasons, which we believe will help lead us to analyzing how a player won the award.
We don’t suspect any sample bias because the hoopR package contains raw play-by-play data sourced by ESPN and NBA Stats API. Therefore, the data we are using is pure and not influenced/skewed by any means. As for the questions or measurements, there may be a slight bias because we only analyze the MIPs from the last 10 years. This limits how accurate of a conclusion we can come to in terms of what characteristics a play needs to have to become the MIP.
# Creating a vector that contains the last 10 MIP's, from 2023 - 2013 (in order)
last_ten_mip <- c("Lauri Markkanen", "Ja Morant", "Julius Randle", "Brandon Ingram", "Pascal Siakam", "Victor Oladipo", "Giannis Antetokounmpo", "CJ McCollum", "Jimmy Butler", "Goran Dragic")
# Creating a vector that contains the last 10 seasons, in the same order as when the last_ten_mip players won
# 2023 represents the 2022-2023 season, 2022 represents the 2021-2022 season, etc.
last_ten_seasons <- c(2023, 2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013)
# Using a function from the hoopR package to load in NBA player data
mip_season_data <- load_nba_player_box(
seasons = c(2014:2023), # The seasons are based on the ending season (so 2023 is the 2022 - 2023 season)
dbConnection = NULL,
tablename = NULL
) |>
# Filtering only for regular season games
#season_type = 3 is what determines the games to be playoffs, we removed that to gather the regular season data
filter(season_type != 3) |>
# Selecting the columns of importance
select(season,
athlete_display_name,
minutes,
field_goals_made,
field_goals_attempted,
three_point_field_goals_made,
three_point_field_goals_attempted,
free_throws_made,
free_throws_attempted,
rebounds,
assists,
steals,
blocks,
turnovers,
points,
did_not_play,
plus_minus,
athlete_position_abbreviation,
athlete_position_name,
athlete_headshot_href
) |>
# Filtering the data so that we only get the players that won MIP, and the specific statistics from when they won
filter((athlete_display_name == last_ten_mip[1] & season == last_ten_seasons[1]) |
(athlete_display_name == last_ten_mip[2] & season == last_ten_seasons[2]) |
(athlete_display_name == last_ten_mip[3] & season == last_ten_seasons[3]) |
(athlete_display_name == last_ten_mip[4] & season == last_ten_seasons[4]) |
(athlete_display_name == last_ten_mip[5] & season == last_ten_seasons[5]) |
(athlete_display_name == last_ten_mip[6] & season == last_ten_seasons[6]) |
(athlete_display_name == last_ten_mip[7] & season == last_ten_seasons[7]) |
(athlete_display_name == last_ten_mip[8] & season == last_ten_seasons[8]) |
(athlete_display_name == last_ten_mip[9] & season == last_ten_seasons[9]) |
(athlete_display_name == last_ten_mip[10] & season == last_ten_seasons[10])) |>
# Removing any games where they did not play
filter(did_not_play == "FALSE") |>
# Grouping the data by their name, season, and adding position and headshot so they stay with the data
group_by(athlete_display_name,
season,
athlete_position_abbreviation,
athlete_headshot_href) |>
# Calculating the average of each stat to get a comprehensive value from the season
summarize(
avg_mins = mean(minutes),
avg_fg_made = mean(field_goals_made),
avg_fg_attempted = mean(field_goals_attempted),
avg_three_made = mean(three_point_field_goals_made),
avg_three_attempted = mean(three_point_field_goals_attempted),
avg_free_throw_made = mean(free_throws_made),
avg_free_throw_attempted = mean(free_throws_attempted),
avg_points = mean(points),
avg_rebounds = mean(rebounds),
avg_assists = mean(assists),
avg_steals = mean(steals),
avg_blocks = mean(blocks),
avg_turnovers = mean(turnovers))
# Using a function from the hoopR package to load in NBA data
before_mip_season <- load_nba_player_box(
seasons = c(2013:2022), # The seasons are based on the ending season (so 2023 is the 2022 - 2023 season)
dbConnection = NULL,
tablename = NULL
) |>
# Filtering out so we only have regular season data
#season_type = 3 is what determines the games to be playoffs, we removed that to gather the regular season data
filter(season_type != 3) |>
# Selecting only the columns of importance
select(season,
athlete_display_name,
minutes,
field_goals_made,
field_goals_attempted,
three_point_field_goals_made,
three_point_field_goals_attempted,
free_throws_made,
free_throws_attempted,
rebounds,
assists,
steals,
blocks,
turnovers,
points,
did_not_play,
#plus_minus
athlete_position_abbreviation,
#athlete_position_name,
athlete_headshot_href
) |>
# Filtering so that we get the player that won MIP, but the season before they won MIP, so we can compare both seasons
filter((athlete_display_name == last_ten_mip[1] & season == last_ten_seasons[2]) |
(athlete_display_name == last_ten_mip[2] & season == last_ten_seasons[3]) |
(athlete_display_name == last_ten_mip[3] & season == last_ten_seasons[4]) |
(athlete_display_name == last_ten_mip[4] & season == last_ten_seasons[5]) |
(athlete_display_name == last_ten_mip[5] & season == last_ten_seasons[6]) |
(athlete_display_name == last_ten_mip[6] & season == last_ten_seasons[7]) |
(athlete_display_name == last_ten_mip[7] & season == last_ten_seasons[8]) |
(athlete_display_name == last_ten_mip[8] & season == last_ten_seasons[9]) |
(athlete_display_name == last_ten_mip[9] & season == last_ten_seasons[10]) |
(athlete_display_name == last_ten_mip[10] & season == last_ten_seasons[11])) |>
# Removing any games where they didn't play
filter(did_not_play == "FALSE") |>
# Grouping the data by their name, season, and keeping the position and headshot link
group_by(athlete_display_name,
season,
athlete_position_abbreviation,
athlete_headshot_href) |>
# Calculating the mean for each measure, to get a comprehensive value of each stat from the season
summarize(
avg_mins = mean(minutes),
avg_fg_made = mean(field_goals_made),
avg_fg_attempted = mean(field_goals_attempted),
avg_three_made = mean(three_point_field_goals_made),
avg_three_attempted = mean(three_point_field_goals_attempted),
avg_free_throw_made = mean(free_throws_made),
avg_free_throw_attempted = mean(free_throws_attempted),
avg_points = mean(points),
avg_rebounds = mean(rebounds),
avg_assists = mean(assists),
avg_steals = mean(steals),
avg_blocks = mean(blocks),
avg_turnovers = mean(turnovers))
# Unioning the data together, since they have all of the same columns -> This allows the data to stack on each other
mip_final <- rbind(before_mip_season, mip_season_data) |>
# Changing the fg made and the three pointer made into percentages
mutate(fg_percent = (avg_fg_made/avg_fg_attempted)*100,
three_percent = (avg_three_made/avg_three_attempted)*100,
ft_percent = (avg_free_throw_made/avg_free_throw_attempted)*100) |>
# Adding a column that can differentiate if it is the season the player won MIP, or the season before
mutate(mip_season = ifelse((athlete_display_name == last_ten_mip[1] & season == last_ten_seasons[1]) |
(athlete_display_name == last_ten_mip[2] & season == last_ten_seasons[2]) |
(athlete_display_name == last_ten_mip[3] & season == last_ten_seasons[3]) |
(athlete_display_name == last_ten_mip[4] & season == last_ten_seasons[4]) |
(athlete_display_name == last_ten_mip[5] & season == last_ten_seasons[5]) |
(athlete_display_name == last_ten_mip[6] & season == last_ten_seasons[6]) |
(athlete_display_name == last_ten_mip[7] & season == last_ten_seasons[7]) |
(athlete_display_name == last_ten_mip[8] & season == last_ten_seasons[8]) |
(athlete_display_name == last_ten_mip[9] & season == last_ten_seasons[9]) |
(athlete_display_name == last_ten_mip[10] & season == last_ten_seasons[10]),
"MIP",
"Before")) |>
# Removing the columns that are no longer needed in the data frame
select(-c(avg_fg_attempted, avg_fg_made, avg_three_made, avg_three_attempted, avg_free_throw_attempted, avg_free_throw_made))
# Displaying a tibble of the new dataset
tibble(mip_final)
## # A tibble: 20 × 15
## athlete_display_name season athlete_position_abbrevi…¹ athlete_headshot_href
## <chr> <int> <chr> <chr>
## 1 Brandon Ingram 2019 SF https://a.espncdn.co…
## 2 CJ McCollum 2015 SG https://a.espncdn.co…
## 3 Giannis Antetokounmpo 2016 PF https://a.espncdn.co…
## 4 Goran Dragic 2013 PG https://a.espncdn.co…
## 5 Ja Morant 2021 PG https://a.espncdn.co…
## 6 Jimmy Butler 2014 SF https://a.espncdn.co…
## 7 Julius Randle 2020 PF https://a.espncdn.co…
## 8 Lauri Markkanen 2022 PF https://a.espncdn.co…
## 9 Pascal Siakam 2018 PF https://a.espncdn.co…
## 10 Victor Oladipo 2017 SG https://a.espncdn.co…
## 11 Brandon Ingram 2020 SF https://a.espncdn.co…
## 12 CJ McCollum 2016 SG https://a.espncdn.co…
## 13 Giannis Antetokounmpo 2017 PF https://a.espncdn.co…
## 14 Goran Dragic 2014 PG https://a.espncdn.co…
## 15 Ja Morant 2022 PG https://a.espncdn.co…
## 16 Jimmy Butler 2015 SF https://a.espncdn.co…
## 17 Julius Randle 2021 PF https://a.espncdn.co…
## 18 Lauri Markkanen 2023 PF https://a.espncdn.co…
## 19 Pascal Siakam 2019 PF https://a.espncdn.co…
## 20 Victor Oladipo 2018 SG https://a.espncdn.co…
## # ℹ abbreviated name: ¹athlete_position_abbreviation
## # ℹ 11 more variables: avg_mins <dbl>, avg_points <dbl>, avg_rebounds <dbl>,
## # avg_assists <dbl>, avg_steals <dbl>, avg_blocks <dbl>, avg_turnovers <dbl>,
## # fg_percent <dbl>, three_percent <dbl>, ft_percent <dbl>, mip_season <chr>
To create this data set there was a lot of data wrangling and formatting that took place. This was a crucial step to take before the analysis process: it is vital that all the data we are using is looking at the right time periods, parts of the season (regular season), the player, and how the data is measured, which in our case we wanted to summarize the game data to get overall measures of season long data.
To get a better understanding of the basic make-up of the MIP winners, we decided that it would be best to display the season and what position each player plays. Thus, giving us a better understanding of what role a player may be expected to play etc. Before creating the graph, it is important to understand what role each position plays. In basketball there are five positions that a player can be:
Point Guard (PG): Traditionally, point guards are expected to run the team’s offense by controlling the ball and making sure that it gets to the right player at the right time.
Shooting Guard (SG): Traditionally, guard’s main objective is to score points for their team. Shooting guards are often the most efficient shooter from three-point range, and are expected defend the perimeter on defense
Small Forward (SF): Traditionally, small forwards are responsible for scoring points and defending. The styles of small forwards varies greatly, since they are normally larger than point guards and shoot guards, but smaller than power forwards and centers.
Power Forward (PF): Traditionally, the power forward position has a variety of responsibilities, including rebounding, screen setting, rim protecting, and scoring. Power forwards play a similar role to centers, but are smaller.
Center (C): Traditionally, centers are valued for their ability to protect their own goal from high-percentage close attempts on defense, while scoring and rebounding with high efficiency on offense. Centers are likely the tallest player on the team, since the center often needs a large presence in the paint (the rectangular lane underneath the hoop on a basketball court).
With a better understanding of the roles each position has, we can now create the visualization:
# Filtering the data by only the seasons the player won MIP
only_mip <- mip_final |>
filter(mip_season == "MIP")
# Saving the graph
mip_season_graph <- ggplot(data = only_mip,
mapping = aes(x = season,
y = athlete_position_abbreviation,
image = athlete_headshot_href)) +
# Adding the headshots of each player
geom_image(size = .3) +
# Changing the theme
theme_bw() +
# Adding labels for the axis and a title
labs(x = "MIP Season",
y = "Position",
title = "The Last 10 MIP's") +
# Centering the title
theme(plot.title = element_text(hjust = 0.5)) +
# Changing the breaks and the labels for the x axis
scale_x_continuous(breaks = (2014:2023),
labels = c("2013-2014", "2014-2015", "2015-2016", "2016-2017", "2017-2018", "2018-2019", "2019-2020", "2020-2021", "2021-2022", "2022-2023"),
minor_breaks = NULL) +
scale_y_discrete(limits = c("PF","SF", "SG", "PG")) +
# Making the title bold
theme(plot.title = element_text(face = "bold"))
# Displaying the graph
mip_season_graph
From the graph, we can see that in the last 10 years the most common
position to win the Most Improved Player Award is Power Forwards (PF).
With 4 Power Forwards winning the award in the last 10 years, they are
double the next closest position for number of times won. Point Guards
(PG), Shooting Guards (SG), and Small Forwards (SF) have each won 2
times in the last 10 years.
An important thing to note is that in the last 10 years no Center has won the MIP Award. This is a key takeaway, especially since the Center plays a very different roll than the other positions; where they are expected to protect the rim and get rebound more than any other position.
Now that we have a better understanding of the MIP winner data, we can now dive into looking at the player stats from the players MIP season and the season before. To do this, we will make box plots for each important player statistic, and how the statistic differs between the 2 seasons.
# Creating long data to use for facet_wrap()
# Removing columns that are not needed for this analysis
long_mip_data <- mip_final |>
# Pivoting the columns to make long data
pivot_longer(cols = c(avg_mins:ft_percent),
names_to = "Measure",
values_to = "Value")
# Saving the graphs as facet_point
facet_point <- ggplot(data = long_mip_data,
mapping = aes(x = Value,
y = mip_season,
fill = mip_season)) +
# Removing the legend
geom_boxplot(show.legend = FALSE) +
scale_fill_manual(values=c("turquoise","violetred1")) +
# Creating a facet wrap to look individually by the measure
facet_wrap(. ~ Measure,
ncol = 2, # Making it so there are 2 columns of graphs
scales="free_x") + # Making independent x-axis for each feature
# Changing the theme
theme_bw() +
labs(x = NULL,
y = NULL,
title = "MIP Statistics") +
# Centering the title
theme(plot.title = element_text(hjust = 0.5)) +
# Making the title bold
theme(plot.title = element_text(face = "bold"))
# Displaying the graphs
facet_point
By looking at the previous box plots, we can see that some player variables see a more significant change from the season before to the season the player wins the Most Improved Player Award. The three metrics that stand out the most from all of the boxplots are: avg_points, ft_percent, and three_percent.
After looking at the various box plots, it is clear that some player metrics from the 2 seasons differ more than others. To help us establish what the most important variable is for classifying the Most Improved Player, we decided to create a Classification Tree to help up identify what variables are most important for winning the MIP award.
First, we will start by creating the full decision tree, and displaying the CP table to help us establish where to prune.
# Building the full tree
mip_tree_full <-
rpart(
# For the formula, including each variable that could be relevant towards winning the MIP award
formula = mip_season ~ avg_mins + avg_points + avg_rebounds + avg_assists + avg_steals + avg_blocks + avg_turnovers + fg_percent + three_percent + ft_percent,
# Using the mip_final data we created earlier
data = mip_final,
method = "class",
parms = list(split = "information"),
minsplit = 2,
minbucket = 1,
cp = -1
)
# Displaying the cptable of the full tree as a data frame
mip_tree_full$cptable |>
data.frame()
## CP nsplit rel.error xerror xstd
## 1 0.9 0 1.0 1.8 0.1341641
## 2 0.1 1 0.1 0.5 0.1936492
## 3 -1.0 2 0.0 0.5 0.1936492
Next, we will find the appropriate x-error cutoff and the CP prune value.
mip_tree_full$cptable |>
data.frame() |> # dplyr verbs can only be used on data frames
slice_min(xerror, n = 1) |> # Finding the row with the smallest xerror using slice_min()
slice_head(n = 1) |> # For some reason the same value appeared twice, so I used slice_head to only select one of them
mutate(xcutoff = xerror + xstd) |>
pull(xcutoff) -> xcutoff # Gettting xcutoff as just a single numeric object
mip_tree_full$cptable |>
data.frame()|>
# Keeping rows that have xerror < xcutoff
filter(xerror < xcutoff) |>
slice(1) |> # Using slice to pick the first row
pull(CP) -> cp_cutoff # Storing cp cutoff
# Labeling the xerror and cp value for the output
c("xerror cutoff" = xcutoff,
"cp prune value" = cp_cutoff)
## xerror cutoff cp prune value
## 0.6936492 0.1000000
Finally, we can prune the full tree and identify the each variables importance for classifying the Most Improved Player.
# Pruning the tree
prune(tree = mip_tree_full,
cp = cp_cutoff) -> mip_pruned
# Plotting the pruned tree
rpart.plot(
x = mip_pruned,
type = 5,
extra = 101,
box.palette = "Purples"
)
# Looking at the importance of each variable
varImp(object = mip_pruned) |>
arrange(desc(Overall))
## Overall
## avg_points 10.511947
## ft_percent 8.456209
## three_percent 5.507922
## fg_percent 4.315231
## avg_assists 4.100761
## avg_mins 0.000000
## avg_rebounds 0.000000
## avg_steals 0.000000
## avg_blocks 0.000000
## avg_turnovers 0.000000
From looking at the pruned classification tree, and the calculated variable importance, it is very clear that average points is the most important factor when identifying who will win the Most Improved Player Award. In addition, the next 3 highest ranked variables by variable importance (free throw percentage, three point percentage, and field goal percentage) are all directly correlated with a players average points. These 3 variables are all a measure of shooting, just each from a different form:
Free Throw %: The percent of shots that are made from free throws.
Three Point %: The percent of shots that are made from beyond the 3-point line.
Field Goal %: The percent of all shots from the field (This includes three pointers, but excludes free throws).
So, the next variables that help classify an MIP is related to points, which truly shows how important points are when identifying what player will win the MIP award.
But, when looking at the pruned classification tree, we can see that the tree failed to predict one prediction. It is shown that 1 player scored less than 20 points in their MIP season. Since this is an outlier compared to the other values, we can dive further into the data to identify what player this was, and what could’ve made them MIP.
By creating a dumbell plot, we can identify what player averaged less than 20 points in their MIP season, and draw further insights about the change in average points from ever MIP winner in the last 10 seasons.
# Saving the graph
mip_dumbell <- ggplot(data = mip_final,
mapping = aes(
x = avg_points,
y = athlete_display_name,
color = mip_season # Making the color be assigned to if the player was the MIP or the season before
)) +
# Connecting the points
geom_line(color = "black") +
# Adding points for the average points
geom_point(size = 2) +
scale_color_manual(values=c("turquoise","violetred1")) +
# Changing the labels of the graph
labs(x = "Average Points Per Game",
y = NULL,
title = "Points Per Game By Season",
color = "Season"
) +
theme_bw() +
# Centering the title
theme(plot.title = element_text(hjust = 0.5)) +
# Making the title bold
theme(plot.title = element_text(face = "bold")) +
# Used to sort the players
scale_y_discrete(limits = c("Julius Randle", "Ja Morant", "Brandon Ingram", "Giannis Antetokounmpo", "Victor Oladipo", "Lauri Markkanen", "Goran Dragic", "Jimmy Butler", "Pascal Siakam", "CJ McCollum"))
# Displaying the graph
mip_dumbell
From the dumbbell plot, we can see that each player has a significant
increase in points from the previous season to their MIP season. Using
the graph we are also able to identify who the player who was not
classified was: Pascal Siakam. Although Pascal Siakam was not correctly
identified, we still see that he has a significant increase in average
points over the 2 seasons.
Although we’ve established that Pascal Siakam still has a significant increase in average point per game, we still want to see if there are any other factors towards why he wasn’t classified as an MIP:
player_list <- unique(mip_final$athlete_display_name) #put player names into a list
for (player in player_list) { #iterate through list, create a dataset for each player of their before and mip seaons - named accurately
player_data <- subset(long_mip_data, athlete_display_name == player)
assign(paste0("player_", sub(" ", "_", player)), player_data)
}
player_comparison_bar <- function(player_data){ # a function that creates a bar plot of each statistic when given player data
# Creating a bar graph
ggplot(data = player_data,
aes(x = Measure,
y = Value,
fill = mip_season)) +
geom_bar(position = "dodge2", stat = "identity") +
scale_fill_manual(values=c("turquoise","violetred1")) +
# Grouping my the measure for the graphs
facet_wrap(~Measure,
nrow = 2,
scales = "free") +
# Adding labels to the grpah
labs(
title=paste(unique(player_data$athlete_display_name),",",unique(player_data$athlete_position_abbreviation)),
x=NULL,
y=NULL,
fill = "Season:") +
# Changing the theme
theme_minimal() +
theme(axis.text.x = element_blank()
) +
# Editing the formatting of the legend, and title
theme(plot.title = element_text(hjust = 0.5)) +
theme(legend.position = "bottom") +
theme(plot.title=element_text(face="bold"))
}
player_data_list <- lapply(player_list, function(player) { # applying a function to player list to create a list of player data when the name matches
subset(long_mip_data, athlete_display_name == player)
})
for (i in seq_along(player_data_list)) { # looping through player data list and using previous function to create a bar plot for each player
player_plot <- player_comparison_bar(player_data_list[[i]])
assign(paste0("player_plot_", i), player_plot)
}
player_plot_9
Looking at all of Pascal Siakam’s stats from the before MIP season and his MIP season, there is still a significant increase in his points. While his average points is below 20, his points from 7.3 to 16.9, which is almost a 10 point jump. So, it is clear that the average points still plays an important role in winning the MIP. In addition to an increase in average points, Pascal Siakam saw a large increase in three point percentage, which would likely be correlated with the increase in average points. Other than shooting, we see that Pascal Siakam has a clear increase in rebounds, turnovers, and assists.
Below is a linear model of the correlations between each statistic and the statistic with the highest variable importance in determining MIP: Average points.
Before creating the linear model, a correlation matrix was created, and showed that multiple statistics had issues with multicolinearity. The multicolinearity was a result of having multiple point related metrics, specifically measures related to field goals, threes, and free throws. Each of these metrics have direct relationships with points as they are all point scoring opportunities.
By removing average field goals made and attempted, average threes attempted, average free throw attempted, average rebounds, and average assist, the variance inflation factor of each of the remaining variables was reduced to below 10. Next, looking at a plot of the residuals it was found that the average of the residuals was close enough to 0 and the model was appropriate to use for analysis.
# Previous steps taken that have been commented out:
# selecting the truly numeric rows
#mip_final |>
# select(avg_points, fg_percent, three_percent, ft_percent, avg_mins, avg_rebounds, avg_assists, avg_steals, avg_blocks, avg_turnovers) |>
# # formatting the data
# pivot_longer( cols = c(fg_percent, three_percent, ft_percent, avg_mins, avg_rebounds, avg_assists, avg_steals, avg_blocks, avg_turnovers),
# names_to = "variable",
# values_to = "percentage") |>
# #creating a plot to investigate the relationships between each explanatory variable and avg_ponts (response)
# ggplot(mapping = aes(x=percentage,
# y = avg_points)) +
# geom_point() +
# geom_smooth(method = "loess",
# formula = y~x,
# se=F) +
# facet_wrap(facets = ~variable,
# scales = "free_x")
#Commented Out because not wanted in knitted document
# looking for multicolinearity using a correlation matrix
#mip_final |>
#
# selecting the truly numeric rows
# select(avg_points, fg_percent, three_percent, ft_percent, avg_mins, avg_rebounds, avg_assists, avg_steals, avg_blocks, avg_turnovers) |>
#creating the correlation matrix
# ggcorr(low ="red",
# mid = "white",
# high = "blue",
# label = T,
# label_round = 2)
# creating a dataset to use in the regression
player_stats_reg <- mip_final |>
select(avg_points,
fg_percent,
three_percent,
ft_percent,
avg_mins,
avg_rebounds,
avg_assists,
avg_steals,
avg_blocks,
avg_turnovers)
#creating the regrression - subtracted rows above 10 from VIF
player_stats_lm <- lm(formula = avg_points ~ .
- fg_percent
- three_percent
- ft_percent,
data = player_stats_reg)
#Commented Out because not wanted in knitted document
#looking at regression statistics
#summary(player_stats_lm)
#Commented Out because not wanted in knitted document
#more regression statistics
stats_estimates <- tidy(player_stats_lm)
#making the data more legible
stats_estimates |>
mutate(across(.cols = where(is.numeric),
.fns = round,
digits = 3))
## # A tibble: 20 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -9.14e+3 2364. -3.87 0.031
## 2 athlete_display_nameCJ McCollum 1.92e+1 4.51 4.25 0.024
## 3 athlete_display_nameGiannis Antetokounm… 7.33e+0 7.34 0.999 0.391
## 4 athlete_display_nameGoran Dragic 2.87e+1 7.24 3.97 0.029
## 5 athlete_display_nameJa Morant -3.93e+0 3.91 -1.00 0.389
## 6 athlete_display_nameJimmy Butler 1.83e+1 12.5 1.47 0.239
## 7 athlete_display_nameJulius Randle -9.01e+0 5.70 -1.58 0.212
## 8 athlete_display_nameLauri Markkanen -1.55e+1 6.44 -2.40 0.096
## 9 athlete_display_namePascal Siakam -1.36e+0 4.82 -0.281 0.797
## 10 athlete_display_nameVictor Oladipo 9.35e+0 6.18 1.51 0.227
## 11 season 4.53e+0 1.17 3.86 0.031
## 12 athlete_position_abbreviationPG NA NA NA NA
## 13 athlete_position_abbreviationSF NA NA NA NA
## 14 athlete_position_abbreviationSG NA NA NA NA
## 15 avg_mins 2.85e-1 0.22 1.30 0.286
## 16 avg_rebounds 1.30e+0 1.32 0.979 0.4
## 17 avg_assists -4.29e-1 0.637 -0.674 0.549
## 18 avg_steals -1.08e-1 4.37 -0.025 0.982
## 19 avg_blocks 2.02e+0 6.25 0.323 0.768
## 20 avg_turnovers 9.29e-1 3.61 0.257 0.814
#Commented Out because not wanted in knitted document
# using variance inflation factor to determine which variables are causing problems
#VIF(player_stats_lm)
#Commented Out because not wanted in knitted document
#Calculating the fit statistics
#Creating a function to calculate mae and mae reducation
#MAE <-function(actual, predicted){
# mae <- abs(actual-predicted) |> mean()
# mae_red <- 1 - mae/mean(abs(actual - mean(actual)))
# return(list(mae = mae,
# reduction = mae_red))
#}
#Commented Out because not wanted in knitted document
#adding model data to original data set
#player_stats2 <- augment_columns(x = player_stats_lm,
# data = player_stats_reg)
#Commented Out because not wanted in knitted document
#finding fit statistics
#player_stats2 |>
# summarize(
# R2 = cor(avg_points, .fitted)^2,
# MAE = MAE(actual = player_stats_reg$avg_points,
# predicted = player_stats_lm$fit)$mae,
# MAER = MAE(actual = player_stats_reg$avg_points,
# predicted = player_stats_lm$fit)$reduction
# )
# creating the plot for the linear model
resid_plot <- augment_columns(
x = player_stats_lm,
data = player_stats_reg
) |>
ggplot(mapping = aes(x=.fitted,
y=.resid)) +
geom_point() +
labs(x="Predicted Average Points",
y="Residual") +
geom_hline(yintercept = 0,
color = "red",
linetype="dashed",
linewidth=1) +
geom_smooth(method = "loess",
formula = y~x,
se=F) +
theme_bw() +
# Centering the title
theme(plot.title = element_text(hjust = 0.5)) +
# Making the title bold
theme(plot.title = element_text(face = "bold")) +
labs(title = "Residual Plot of Last 10 MIPs")
resid_plot
#diagnose any variable causing nonlinearity in the model
#gg_diagnose(fitted.lm = player_stats_lm,
# plot.all = F)
The above residual plot shows no clear pattern, and therefore the model is not comprehensive or reliable for making predictions about a player’s average points. The lack of pattern is due to the lack of data points the model is based on.
#Creating a new data set to perform a linear regression
player_stats2 <- load_nba_player_box(
seasons = c(2022),
dbConnection = F,
tableNames = F
) |>
# Selecting the columns of importance
select(season,
athlete_display_name,
minutes,
field_goals_made,
field_goals_attempted,
three_point_field_goals_made,
three_point_field_goals_attempted,
free_throws_made,
free_throws_attempted,
rebounds,
assists,
steals,
blocks,
turnovers,
points,
did_not_play,
plus_minus,
athlete_position_abbreviation,
athlete_position_name,
athlete_headshot_href
) |>
# Removing any games where they did not play
filter(did_not_play == "FALSE") |>
# Grouping the data by their name, season, and adding position and headshot so they stay with the data
group_by(athlete_display_name,
season,
athlete_position_abbreviation,
athlete_headshot_href) |>
# Calculating the average of each stat to get a comprehensive value from the season
summarize(
avg_mins = mean(minutes),
avg_fg_made = mean(field_goals_made),
avg_fg_attempted = mean(field_goals_attempted),
avg_three_made = mean(three_point_field_goals_made),
avg_three_attempted = mean(three_point_field_goals_attempted),
avg_free_throw_made = mean(free_throws_made),
avg_free_throw_attempted = mean(free_throws_attempted),
avg_points = mean(points),
avg_rebounds = mean(rebounds),
avg_assists = mean(assists),
avg_steals = mean(steals),
avg_blocks = mean(blocks),
avg_turnovers = mean(turnovers)) |>
filter(avg_steals < 2.2)
#Commented out as to not display in knintted doc
# selecting the truly numeric rows
#player_stats2 |>
# select(avg_points, avg_fg_made, avg_fg_attempted, avg_three_made, avg_three_attempted, avg_free_throw_made, avg_free_throw_attempted, avg_mins, avg_rebounds, avg_assists, avg_steals, avg_blocks, avg_blocks, avg_turnovers) |>
# formatting the data
# pivot_longer( cols = c(avg_fg_made, avg_fg_attempted, avg_three_made, avg_three_attempted, avg_free_throw_made, avg_free_throw_attempted, avg_mins, avg_rebounds, avg_assists, avg_steals, avg_blocks, avg_blocks, avg_turnovers),
# names_to = "variable",
# values_to = "percentage") |>
#creating a plot to investigate the relationships between each explanatory variable and avg_ponts (response)
# ggplot(mapping = aes(x=percentage,
# y = avg_points)) +
# geom_point() +
# geom_smooth(method = "loess",
# formula = y~x,
# se=F) +
# facet_wrap(facets = ~variable,
# scales = "free_x")
# looking for multicolinearity using a correlation matrix
#player_stats2 |>
#
# # selecting the truly numeric rows
# select(avg_points, avg_fg_made, avg_fg_attempted, avg_three_made, avg_three_attempted, avg_free_throw_made, avg_free_throw_attempted, avg_mins, avg_rebounds, avg_assists, avg_steals, avg_blocks, avg_blocks, avg_turnovers) |>
#creating the correlation matrix
# ggcorr(low ="red",
# mid = "white",
# high = "blue",
# label = T,
# label_round = 2)
# creating a dataset to use in the regression
player_stats_reg2 <- player_stats2 |>
column_to_rownames(var = "athlete_display_name") |>
select(avg_points,
avg_fg_made,
avg_fg_attempted,
avg_three_made,
avg_three_attempted,
avg_free_throw_made,
avg_free_throw_attempted,
avg_mins,
avg_rebounds,
avg_assists,
avg_steals,
avg_blocks,
avg_blocks,
avg_turnovers)
#creating the regression - subtracted rows above 10 from VIF
player_stats_lm2 <- lm(formula = avg_points ~ .
- avg_fg_made
- avg_fg_attempted
- avg_three_attempted
- avg_free_throw_attempted
- avg_rebounds - avg_assists,
data = player_stats_reg2)
#Commented out as to not display in knintted doc
#looking at regression statistics
#summary(player_stats_lm2)
#more regression statistics
stats_estimates2 <- tidy(player_stats_lm2)
#making the data more legible
stats_estimates2 |>
mutate(across(.cols = where(is.numeric),
.fns = round,
digits = 3))
## # A tibble: 7 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.748 0.131 -5.72 0
## 2 avg_three_made 1.71 0.105 16.3 0
## 3 avg_free_throw_made 2.25 0.077 29.3 0
## 4 avg_mins 0.19 0.014 13.9 0
## 5 avg_steals -0.425 0.21 -2.03 0.043
## 6 avg_blocks 0.824 0.192 4.29 0
## 7 avg_turnovers 0.938 0.138 6.78 0
#Commented out as to not display in knintted doc
# using variance inflation factor to determine which variables are causing problems
#VIF(player_stats_lm2)
#Calculating the fit statistics
#Creating a function to calculate mae and mae reduction
MAE <-function(actual, predicted){
mae <- abs(actual-predicted) |> mean()
mae_red <- 1 - mae/mean(abs(actual - mean(actual)))
return(list(mae = mae,
reduction = mae_red))
}
#adding model data to original data set
player_stats2 <- augment_columns(x = player_stats_lm2,
data = player_stats_reg2)
#finding fit statistics
fit_stats2 <- player_stats2 |>
summarize(
R2 = cor(avg_points, .fitted)^2,
MAE = MAE(actual = player_stats_reg2$avg_points,
predicted = player_stats_lm2$fit)$mae,
MAER = MAE(actual = player_stats_reg2$avg_points,
predicted = player_stats_lm2$fit)$reduction
)
# creating the plot for the linear model
resid_plot2 <- augment_columns(
x = player_stats_lm2,
data = player_stats_reg2
) |>
ggplot(mapping = aes(x=.fitted,
y=.resid)) +
geom_point() +
labs(x="Predicted Average Points",
y="Residual") +
geom_hline(yintercept = 0,
color = "red",
linetype="dashed",
linewidth=1) +
geom_smooth(method = "loess",
formula = y~x,
se=F) +
theme_bw() +
# Centering the title
theme(plot.title = element_text(hjust = 0.5)) +
# Making the title bold
theme(plot.title = element_text(face = "bold")) +
labs(title = "Residual Plot of 2022-2023 Season")
player_stats_lm2
##
## Call:
## lm(formula = avg_points ~ . - avg_fg_made - avg_fg_attempted -
## avg_three_attempted - avg_free_throw_attempted - avg_rebounds -
## avg_assists, data = player_stats_reg2)
##
## Coefficients:
## (Intercept) avg_three_made avg_free_throw_made
## -0.7478 1.7116 2.2519
## avg_mins avg_steals avg_blocks
## 0.1899 -0.4246 0.8238
## avg_turnovers
## 0.9380
fit_stats2
## # A tibble: 1 × 3
## R2 MAE MAER
## <dbl> <dbl> <dbl>
## 1 0.952 0.990 0.798
resid_plot2
#diagnose any variable causing nonlinearity in the model
#gg_diagnose(fitted.lm = player_stats_lm,
# plot.all = F)
The results of the linear model can be interpreted as follows. For each increase in a player’s average points by one, the value of an explanatory variable changes by the amount produced by the model. For example, for an increase in a player’s average points by one, that player’s average threes made will increase by about 1.7.
The measure of fit statistics describe a 95% increase in prediction accuracy using R squared, or about 80% using the Mean Average Error Reduction.
Finally, the residual plot is approximately linear which is an indication that is an appropriate model for making predictions about a player’s average points
Since we have established that a players average points per season is the most important variable in classifying a MIP Award winner, we want to use our data to try and predict the next MIP Award winner. To do this we need to calculate the average points for each player in the NBA from the 2022-2023 season, and the ongoing 2023-2024 season:
# Using a function from the hoopR package to load in NBA player data
nba_2023_2024 <- load_nba_player_box(
seasons = c(2023:2024), # The seasons are based on the ending season (so 2023 is the 2022 - 2023 season)
dbConnection = NULL,
tablename = NULL
) |>
# Filtering only for regular season games
#season_type = 3 is what determines the games to be playoffs, we removed that to gather the regular season data
filter(season_type != 3)|>
# Removing any games where they did not play
filter(did_not_play == "FALSE") |>
select(season, athlete_display_name, points)|>
group_by(athlete_display_name,
season) |>
summarize(avg_points = mean(points))
nba_2023 <- nba_2023_2024 |>
filter(season == 2023) |> # Only keeping the before
select(athlete_display_name, season, avg_points, ) |> # Only keeping certain columns
mutate(avg_points_2023 = avg_points) # Changing the name
nba_2024 <- nba_2023_2024 |>
filter(season == 2024) |> # Only keeping values from MIP season
select(athlete_display_name, season, avg_points) |> # Only keeping certain columns
mutate(avg_points_2024 = avg_points) # Changing the name
mip_guess_data <- inner_join(x = nba_2023, y = nba_2024, by = "athlete_display_name") |> # Joining the data by the name (Any joing would've worked in this situation)
# Keeping only the relevant columns
select(athlete_display_name, avg_points_2023, avg_points_2024)
tibble(mip_guess_data)
## # A tibble: 441 × 3
## athlete_display_name avg_points_2023 avg_points_2024
## <chr> <dbl> <dbl>
## 1 A.J. Lawson 3.73 3.81
## 2 AJ Green 4.4 4.39
## 3 AJ Griffin 8.88 2.06
## 4 Aaron Gordon 16.3 13.8
## 5 Aaron Holiday 3.92 7.09
## 6 Aaron Nesmith 10.1 12.7
## 7 Aaron Wiggins 6.84 6.2
## 8 Admiral Schofield 4.19 1.15
## 9 Al Horford 9.78 7.85
## 10 Alec Burks 12.8 12.1
## # ℹ 431 more rows
Now that we found the average points for the last two seasons for each NBA player, we need to change the MIP data so we can use it with the newly created data frame:
# Sadly, could not just pivot wider, since the players have two unique identifiers: their name and the season
mip_wide_before <- mip_final |>
# Making sure it only keeps the season before they won MIP
filter(mip_season == "Before") |> # Only keeping the before
select(athlete_display_name, avg_points, mip_season) |> # Only keeping certain columns
mutate(avg_points_before = avg_points) # Changing the name
# Saving a data frame for the MIP season values
mip_wide_after <- mip_final |>
filter(mip_season == "MIP") |> # Only keeping values from MIP season
select(athlete_display_name, avg_points, mip_season) |> # Only keeping certain columns
mutate(avg_points_mip = avg_points) # Changing the name
mip_wide <- inner_join(x = mip_wide_before, y = mip_wide_after, by = "athlete_display_name") |> # Joining the data by the name (Any joing would've worked in this situation)
# Keeping only the relevant columns
select(athlete_display_name, avg_points_before, avg_points_mip)
# Displaying mip_wide
mip_wide
## # A tibble: 10 × 3
## # Groups: athlete_display_name [10]
## athlete_display_name avg_points_before avg_points_mip
## <chr> <dbl> <dbl>
## 1 Brandon Ingram 18.3 23.5
## 2 CJ McCollum 6.84 20.8
## 3 Giannis Antetokounmpo 16.9 23.0
## 4 Goran Dragic 14.7 20.3
## 5 Ja Morant 19.1 27.1
## 6 Jimmy Butler 13.1 19.8
## 7 Julius Randle 19.5 23.8
## 8 Lauri Markkanen 14.8 25.4
## 9 Pascal Siakam 7.27 16.9
## 10 Victor Oladipo 15.9 22.9
With confirmed MIP data and potential MIP data, we can calculate the total distance each current NBA player has from every MIP winner. The “distance” would be measured using Euclidean distance, where we are looking at the difference of the intersection of a player’s last seasons points and the current season points versus the last 10 MIP winners.
The formla for Euclidean Distance is: √Σ(xi-yi)^2
For our data, the calculation is: √Σ((player last season - MIP Player previous season)^2 + (player current season - MIP Player award winning season))^2 Must do the calculation for each MIP winner.
Using this formula, we can calculate the total Euclidean distance a player has from the last 10 MIP winners:
# Creating a function for calculating Euclidean distance
distance_calc <- function(player_s1, player_s2){
# Creating a function to calculate the sum of the euclidean distance from each MIP player
distance <- sqrt( (player_s1 - mip_wide[1, 2])^2 + (player_s2 - mip_wide[1, 3])^2 ) +
sqrt( (player_s1 - mip_wide[2, 2])^2 + (player_s2 - mip_wide[2, 3])^2 ) +
sqrt( (player_s1 - mip_wide[3, 2])^2 + (player_s2 - mip_wide[3, 3])^2 ) +
sqrt( (player_s1 - mip_wide[4, 2])^2 + (player_s2 - mip_wide[4, 3])^2 ) +
sqrt( (player_s1 - mip_wide[5, 2])^2 + (player_s2 - mip_wide[5, 3])^2 ) +
sqrt( (player_s1 - mip_wide[6, 2])^2 + (player_s2 - mip_wide[6, 3])^2 ) +
sqrt( (player_s1 - mip_wide[7, 2])^2 + (player_s2 - mip_wide[7, 3])^2 ) +
sqrt( (player_s1 - mip_wide[8, 2])^2 + (player_s2 - mip_wide[8, 3])^2 ) +
sqrt( (player_s1 - mip_wide[9, 2])^2 + (player_s2 - mip_wide[9, 3])^2 ) +
sqrt( (player_s1 - mip_wide[10, 2])^2 + (player_s2 - mip_wide[10, 3])^2 )
# Returning the value
return(distance)
}
# Creating an empty storage to keep all the distance values
dis_storage <- rep(-1, length(mip_guess_data))
# Creating a loop for every player in the data frame
for(i in 1:411){
dis_storage[i] <- distance_calc(mip_guess_data[i, 2], mip_guess_data[i, 3])
}
# Pivoting the data from dis_storage to get all of the player total distances
dis_test <- dis_storage |>
data.frame() |>
pivot_longer(
cols = everything(), # Using everything to get all of the columns
names_to = "name",
values_to = "distance"
) |>
# Only keeping the distance column
select(distance)
# Using bin_cols to connect the columns together -> since the data is still in the correct order, bind_cols works
#final_distance_data <- bind_cols(n = mip_guess_data$athlete_display_name, SE = dis_test)
# Displaying a tibble of the data
#tibble(final_distance_data)
Now that we have calculated the total distance for each player, we can now find the smallest value to determine who we predict to win the MIP for the 2023-2024 NBA season.
# Using slice_min() to find which player has the smallest total sum of distance from all the MIP winners
#mip_guess <- final_distance_data |> slice_min(distance)
# Displaying the value
#mip_guess
Through comparing the average points from the previous season to the current season to past MIP winners, we found that Alperen Sengun has the smallest sum of distance from the last 10 MIP winners.
An important thing to note is that we only used the average points of a player to calculate this. In the future, if we were to try and predict the MIP again we would try to include other variables in our calculation to find a more accurate prediction of the MIP.
Although there are other factors to help determine MIP, since the variable importance of average points is so high, we are comfortable predicting Alperen Sengun as the 2023-2024 Most Improved Player.
Overall, we found that there were specific variables that are more significant in selecting the Most Improved Player Award. Using a exploratory/visual analysis, and machine learning through a classification tree, we were able to establish that a players increase in average points per season is the most important factor in identifying if a player will win the MIP award. By looking at the season before, compared to the MIP season, we could see a clear change in average points per season. In addition to points, we saw that a players free throw, field goal, and three point percentage also played a role, which is correlated with the points scored. Aside from points, we witnessed that in the last 10 years no center has won the award. This may be in part that defensive measures did not play a large roll in winning the MIP award, which would impact centers due to their defensive responsibilities. In conclusion, finding these patterns, we were able to show how significant the average points play a role in winning MIP.
Although the data we manipulated and analyzed let us create efficient visuals and tables, we know that with more data we would be able to provide more accurate and in depth analysis of our machine learning methods. Due to our initial difficulties retrieving and manipulating the data, we were only able to use the last 10 players to win the Most Improved Player Award to create our project. With more time, a larger more comprehensive data set, and a better understanding of the hoopR package, we believe that we would’ve included more data to support our analysis and findings. A key part of our project that took the hardest hit from the lack of data was our machine learning methods. For our classification tree, we only had 20 points of data to classify, where the average points alone correctly classified 95% of the options. While this could be the case if we added more data, we believe that with a larger sample we would’ve been able to go more in depth. This problem was also relevant when looking at our Linear Regression, due to the small sample size we weren’t able to create a linear model appropriate for predicting a players average points based on other non-point related variables (free throw, three, and field goal %) and find a reliable correlation between the variables.
In the future, the primary thing we would recommend would be to use all of the NBA MIP Award Winners, this would allow for a much more comprehensive analysis, and would allow us to see the trends in the MIP Award Winners over the years. Next, we would first recommend to find a fully comprehensive data set that includes if a player has won MIP. Doing this would save a significant amount of time and computing power, since we had to create our own vectors with the player name and year they won to help identify if a player has won the MIP and in what season they did it. Finally, if possible, a data set with summarized values would save a lot of time and computing power, since we spent most of the first part of the project trying to make a viable data source using the dplyr verb summarize.