Introduction

This project involves analyzing and answering questions based on NBA player and team datasets, including playoff games over the years. Each question is answered using tidyverse to ensure clarity and avoid long printouts.

Note:

Throughout this document, any season column represents the year each season started. For example, the 2015-16 season will be in the dataset as 2015. For most of the rest of the project, we will refer to a season by just this number (e.g. 2015) instead of the full text (e.g. 2015-16).

Setup and Data

library(tidyverse)
library(ggplot2)
library(dplyr)
library(caret)
library(plotly)

# Note, you will likely have to change these paths. If your data is in the same folder as this project, 
player_data <- read_csv("./player_game_data.csv")
team_data <- read_csv("./team_game_data.csv")

Part 1 – Data Cleaning

In this section, the task is to clean and preprocess the data to answer various questions using team and player statistics. All provided stats are on the game level.

Question 1

QUESTION: What was the Warriors’ Team offensive and defensive eFG% in the 2015-16 regular season? Remember that this is in the data as the 2015 season.

# Here and for all future questions, feel free to add as many code chunks as you like. Do NOT put echo = F though, we'll want to see your code.

# Filter data for Warriors offensive stats 2015
warriors_offense_data <- team_data %>%
  filter(off_team_name == "Golden State Warriors",
         season == 2015)

# Calculate offensive efG% 
warriors_offensive_efg <- warriors_offense_data %>%
  summarise(
    total_fg_made = sum(fgmade),
    total_3_fg_made = sum(fg3made),
    total_fga = sum(fgattempted)
  ) %>%
  mutate(eFG = (total_fg_made + 0.5 * total_3_fg_made) / total_fga) %>%
  pull(eFG)

# Filter data for Warriors defensive stats 2015
warriors_defense_data <- team_data %>%
  filter(def_team_name == "Golden State Warriors",
         season == 2015)

## Calculate defensive eFG%
warriors_defensive_efg <- warriors_defense_data %>%
  summarise(
    total_fg_made = sum(fgmade),
    total_3_fg_made = sum(fg3made),
    total_fga = sum(fgattempted)
  ) %>%
  mutate(eFG = (total_fg_made + 0.5 * total_3_fg_made) / total_fga) %>%
  pull(eFG)

## Plot 
efg_data <- data.frame(
  Category = c("Offensive eFG%", "Defensive eFG%"),
  eFG = c(warriors_offensive_efg, warriors_defensive_efg) * 100
)

ggplot(efg_data, aes(x = Category, y = eFG, fill = Category)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = sprintf("%.2f%%", eFG)), 
            vjust = -0.2, 
            size = 5) +
  scale_y_continuous(labels = scales::percent_format(scale = 1)) +
  labs(title = "Warriors 2015 Season",
       y = "eFG%",
       x = "") +
  theme_minimal() +
  theme(
    legend.position = "none", 
    plot.title = element_text(hjust = 0.5),
    panel.grid.major = element_blank(),  
    panel.grid.minor = element_blank(),
  )

ANSWER 1:

Offensive: 55.6% eFG
Defensive: 47.9% eFG

Question 2

QUESTION: What percent of the time does the team with the higher eFG% in a given game win that game? Use games from the 2014-2023 regular seasons. If the two teams have an exactly equal eFG%, remove that game from the calculation.

filtered_games <- team_data %>%
  filter(season >= 2014, season <= 2023, gametype == 2)

filtered_games <- filtered_games %>%
  mutate(efg = (fgmade + 0.5 * fg3made) / fgattempted)

# verify that each game as 2 records 
game_counts <- filtered_games %>% 
  group_by(nbagameid) %>%
  summarise(count = n()) %>%
  filter(count == 2)

filtered_games <- filtered_games %>%
  filter(nbagameid %in% game_counts$nbagameid)

# create dataframe for first team game by game
team1_data <- filtered_games %>%
  group_by(nbagameid) %>%
  slice(1) %>%
  select(nbagameid, off_team_name, efg, off_win) %>%
  rename(team1_name = off_team_name, team1_efg = efg, team1_win = off_win)

# create dataframe for second team game by game
team2_data <- filtered_games %>%
  group_by(nbagameid) %>%
  slice(2) %>%
  select(nbagameid, off_team_name, efg, off_win) %>%
  rename(team2_name = off_team_name, team2_efg = efg, team2_win = off_win)

# merge on game ID
merged_data <- team1_data %>%
  inner_join(team2_data, by = "nbagameid")

# remove games where eFG% is exactly equal
merged_data <- merged_data %>%
  filter(team1_efg != team2_efg)

# determine if team with higher  eFG% won
merged_data <- merged_data %>%
  mutate(higher_efg_win = (team1_efg > team2_efg & team1_win) | (team1_efg < team2_efg & team2_win))

# calculate percentage
percentage_higher_efg_win <- mean(merged_data$higher_efg_win) * 100

cat(percentage_higher_efg_win, "%\n")
## 81.60141 %

ANSWER 2:

81.6%

Question 3

QUESTION: What percent of the time does the team with more offensive rebounds in a given game win that game? Use games from the 2014-2023 regular seasons. If the two teams have an exactly equal number of offensive rebounds, remove that game from the calculation.

# verify that each game as 2 records 
game_counts <- filtered_games %>% 
  group_by(nbagameid) %>%
  summarise(count = n()) %>%
  filter(count == 2)

filtered_games <- filtered_games %>%
  filter(nbagameid %in% game_counts$nbagameid)

# create dataframe for first team game by game
team1_data <- filtered_games %>%
  group_by(nbagameid) %>%
  slice(1) %>%
  select(nbagameid, off_team_name, reboffensive, off_win) %>%
  rename(team1_name = off_team_name, team1_reboff = reboffensive, team1_win = off_win)

# create dataframe for second team game by game
team2_data <- filtered_games %>%
  group_by(nbagameid) %>%
  slice(2) %>%
  select(nbagameid, off_team_name, reboffensive, off_win) %>%
  rename(team2_name = off_team_name, team2_reboff = reboffensive, team2_win = off_win)

# merge on game ID
merged_data <- team1_data %>%
  inner_join(team2_data, by = "nbagameid")

# remove games where eFG% is exactly equal
merged_data <- merged_data %>%
  filter(team1_reboff != team2_reboff)

# determine if team with higher  eFG% won
merged_data <- merged_data %>%
  mutate(higher_reboff_win = (team1_reboff > team2_reboff & team1_win) | (team1_reboff < team2_reboff & team2_win))

# calculate percentage
percentage_higher_reboff_win <- mean(merged_data$higher_reboff_win) * 100

cat(percentage_higher_reboff_win, "%\n")
## 46.21415 %

ANSWER 3:

46.2%

Question 4

QUESTION: Do you have any theories as to why the answer to question 3 is lower than the answer to question 2? Try to be clear and concise with your answer.

ANSWER 4: There could be several reasons for this :-

  1. Higher eFG% means a team has higher scoring efficiency. It means that a team is scoring more points per shot attempt which has a direct impact on outcome of the game. This means that teams with higher eFG% are likely converting their shot attempts including three-pointers which are worth more points
  2. Furthermore, offensive rebounds give teams more possession and opportunities to score but that doesn’t always translate to points. A team can secure more offensive rebounds but still miss subsequent shot attempts.
  3. Defensive rebounds also play a role. Teams with higher eFG% might also have better defensive capabilities reducing opponent scoring opportunities.

Question 5

QUESTION: Look at players who played at least 25% of their possible games in a season and scored at least 25 points per game played. Of those player-seasons, what percent of games were they available for on average? Use games from the 2014-2023 regular seasons.

For example:

  • Ja Morant does not count in the 2023-24 season, as he played just 9 out of 82 games this year, even though he scored 25.1 points per game.
  • Chet Holmgren does not count in the 2023-24 season, as he played all 82 games this year but scored 16.5 points per game.
  • LeBron James does count in the 2023-24 season, as he played 71 games and scored 25.7 points per game.
## total games played per season 
## we can assume that each season has 82 games but using team data
## we can figure out the total games per season 
total_games_per_season <- filtered_games %>%
  group_by(season, offensivenbateamid) %>%
  summarise(total_games = n_distinct(nbagameid)) %>%
  ungroup() %>%
  rename(nbateamid = offensivenbateamid)
## `summarise()` has grouped output by 'season'. You can override using the
## `.groups` argument.
total_games_per_season <- total_games_per_season %>%
  distinct(season, nbateamid, .keep_all = TRUE)

## get player data for regular season 2014 - 2023
filtered_data <- player_data %>%
  filter(season >= 2014, season <= 2023, gametype==2)

# merge total games with player data
filtered_data <- filtered_data %>%
  left_join(total_games_per_season, by=c('season', 'nbateamid'))

## remove players that have missed game but are available 
filtered_data <- filtered_data %>%
  filter(missed == 0)

# calculate number of games plaued and points per game for each player by season 
player_stats <- filtered_data %>%
  group_by(season, nbapersonid, player_name, total_games) %>%
  summarise(
    games_played = n_distinct(nbagameid),
    total_points = sum(points)
  ) %>%
  ungroup() %>%
  mutate(
    points_per_game = total_points / games_played,
    games_played_percentage = games_played / total_games
  )
## `summarise()` has grouped output by 'season', 'nbapersonid', 'player_name'. You
## can override using the `.groups` argument.
# filter players who played more thatn 25% and scored atleast 25 points per game 
eligible_players <- player_stats %>%
  filter(games_played_percentage >= 0.25, points_per_game >= 25)

# calculate avereage percent of games these players were available for 
average_games_played_percentage <- eligible_players %>%
  summarise(avg_percentage = mean(games_played_percentage) * 100) %>%
  pull(avg_percentage)

cat(average_games_played_percentage, "%\n")
## 83.12175 %
# plot 

ggplot(eligible_players, aes(x = player_name, y = games_played_percentage * 100)) +
  geom_bar(stat = "identity", fill = "darkred") +
  labs(title = "Eligible Player Games % (2014-2023)",
       x = "Player",
       y = "Games Played Percentage (%)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1), 
        panel.grid.major = element_blank(),  
    panel.grid.minor = element_blank())

ANSWER 5:

83.1% of games

Question 6

QUESTION: What % of playoff series are won by the team with home court advantage? Give your answer by round. Use playoffs series from the 2014-2022 seasons. Remember that the 2023 playoffs took place during the 2022 season (i.e. 2022-23 season).

## filtering data for the question
playoff_data <- team_data %>%
  filter(season >= 2014, season <= 2022, gametype == 4)

# identify the playoff round based on the nbagameid 
identify_round <- function(nbagameid) {
  round_digit <- (nbagameid %% 1000) %/% 100
  if(round_digit == 1){
    return("Round 1")
  } else if(round_digit == 2){
    return("Round 2")
  } else if(round_digit == 3){
    return("Conference Finals")
  } else if(round_digit == 4){
    return("Finals")
  } else {
    return("Unknown")
  }
}

## apply the function 
playoff_data <- playoff_data %>%
  mutate(round = sapply(nbagameid, identify_round))

## find home court advantage by series
home_court_advantage <- playoff_data %>%
  group_by(season, round, nbagameid) %>%
  summarise(home_court_team = first(off_team_name[off_home == 1])) %>%
  ungroup()
## `summarise()` has grouped output by 'season', 'round'. You can override using
## the `.groups` argument.
## winner of each game 
playoff_data <- playoff_data %>%
  mutate(winner = ifelse(off_win == 1, off_team_name, def_team_name))

## now find the series winner 
series_winner <- playoff_data %>%
  group_by(season, round, nbagameid, winner) %>%
  summarise(win_count = n()) %>%
  ungroup() %>%
  group_by(season, round, nbagameid) %>%
  slice(which.max(win_count)) %>%
  ungroup()
## `summarise()` has grouped output by 'season', 'round', 'nbagameid'. You can
## override using the `.groups` argument.
## merge home court advantage data with series winner 
merged_data <- series_winner %>%
  left_join(home_court_advantage, by = c("season", "round", "nbagameid"))

## check if series winner had home advantage
merged_data <- merged_data %>%
  mutate(home_court_win = winner == home_court_team)

## calculate results 
results <- merged_data %>%
  group_by(round) %>%
  summarise(percentage = mean(home_court_win) * 100)

## change order
results$round <- factor(results$round, levels = c('Round 1', 'Round 2', 'Conference Finals', 'Finals'))

## print 
print(results)
## # A tibble: 4 × 2
##   round             percentage
##   <fct>                  <dbl>
## 1 Conference Finals       58.4
## 2 Finals                  52.9
## 3 Round 1                 60.1
## 4 Round 2                 60.2
## visualize
ggplot(results, aes(x = round, y = percentage, fill = round)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.7) +
  geom_text(aes(label = sprintf("%.1f%%", percentage)), 
            position = position_dodge(width = 0.7), 
            vjust = -0.2, 
            size = 3) +  
  labs(title = "% series won by team with home court advantage",
       x = "Round",
       y = "Percentage") +
  theme_minimal() +
  theme(
    legend.position = "none", 
    plot.title = element_text(hjust = 0.5),  
    panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank()  
  )

ANSWER 6:

Round 1: 60.1%
Round 2: 60.2%
Conference Finals: 58.4%
Finals: 52.9%

Question 7

QUESTION: Among teams that had at least a +5.0 net rating in the regular season, what percent of them made the second round of the playoffs the following year? Among those teams, what percent of their top 5 total minutes played players (regular season) in the +5.0 net rating season played in that 2nd round playoffs series? Use the 2014-2021 regular seasons to determine the +5 teams and the 2015-2022 seasons of playoffs data.

For example, the Thunder had a better than +5 net rating in the 2023 season. If we make the 2nd round of the playoffs next season (2024-25), we would qualify for this question. Our top 5 minutes played players this season were Shai Gilgeous-Alexander, Chet Holmgren, Luguentz Dort, Jalen Williams, and Josh Giddey. If three of them play in a hypothetical 2nd round series next season, it would count as 3/5 for this question.

Hint: The definition for net rating is in the data dictionary.

## PART 1: Among teams that had at least a +5.0 net rating in the regular season, what percent of them made the second round of the playoffs the **following** year?
# filter data 
regular_season_data = team_data %>%
  filter(season >= 2014 & season <= 2021 & gametype == 2)

# calculate offensive rating 
regular_season_data <- regular_season_data %>%
  mutate(off_rtg = points / (possessions / 100))

# merge data with iteself to find the defensive rating 
merged_data <- regular_season_data %>%
  inner_join(regular_season_data, by = c("season", "nbagameid"), suffix = c("_off", "_def"), relationship =
  "many-to-many")

# now filter to remove rows where off team is same as def team 
merged_data <- merged_data %>%
  filter(off_team_name_off != off_team_name_def)

merged_data <- merged_data %>%
  mutate(def_rtg = points_def / (possessions_def / 100))

merged_data <- merged_data %>%
  mutate(net_rtg = off_rtg_off - def_rtg)

# average net rating for each team per season 
avg_net_rtg <- merged_data %>%
  group_by(season, off_team_name_off) %>%
  summarise(net_rtg = mean(net_rtg, na.rm = TRUE)) %>%
  rename(team_name = off_team_name_off)
## `summarise()` has grouped output by 'season'. You can override using the
## `.groups` argument.
# keep teams with avg net rtg >= 5
teams_with_high_net_rtg <- avg_net_rtg %>%
  filter(net_rtg >= 5)

# get playoff data
playoff_data <- team_data %>%
  filter(season >= 2015 & season <= 2022 & gametype == 4)

# identify round 
playoff_data <- playoff_data %>%
  mutate(round = sapply(nbagameid, identify_round))


# get second playoff games
second_round_playoff_data <- playoff_data %>%
  filter(round == 'Round 2')

# check if teams with +5 net rtg made it to second round 
teams_with_high_net_rtg <- teams_with_high_net_rtg %>%
  mutate(next_season = season + 1)

teams_in_second_round <- second_round_playoff_data %>%
  select(season, off_team_name) %>%
  distinct() %>%
  rename(next_season = season, team_name = off_team_name)

# merge to check if team made to second round 
merged_check <- left_join(teams_with_high_net_rtg, teams_in_second_round, by = c("next_season", "team_name"), keep = TRUE)

merged_check <- merged_check %>%
  mutate(made_second_round = !is.na(team_name.y))


# calculate % of teams that made it to second round 
percent_made_second_round <- mean(merged_check$made_second_round, na.rm = TRUE) * 100

cat(paste("Percent of +5.0 net", percent_made_second_round), "\n")
## Percent of +5.0 net 63.6363636363636
## PART 2: Among those teams, what percent of their top 5 total minutes played players (regular season) in the +5.0 net rating season played in that 2nd round playoffs series

# filter for teams with high net rtg and made it to second round 
teams_high_rtg_2nd_round <- merged_check %>%
  filter(made_second_round == TRUE)

# get regular season player 
regular_season_player_data <- player_data %>%
  filter(season >= 2014 & season <= 2021, gametype == 2)

# calculate total seconds played
seconds_played <- regular_season_player_data %>%
  group_by(season, team_name, player_name) %>%
  summarise(seconds = sum(seconds, na.rm = TRUE)) %>%
  ungroup()
## `summarise()` has grouped output by 'season', 'team_name'. You can override
## using the `.groups` argument.
# find top 5 players per team per season 
top_5_players_per_team_per_season <- seconds_played %>%
  group_by(season, team_name) %>%
  top_n(5, wt = seconds) %>%
  ungroup()

# merge with teams_high_rtg_2nd_round to get top players for those teams 
merged_data <- left_join(teams_high_rtg_2nd_round, top_5_players_per_team_per_season, by = c("team_name.x" = "team_name", "season" = "season"))

# get playoff player data
player_playoff_data <- player_data %>%
  filter(season >= 2015 & season <= 2022, gametype == 4)

# apply identify round 
player_playoff_data <- player_playoff_data %>%
  mutate(round = sapply(nbagameid, identify_round))

# filter for second round playoff games
second_round_playoff_player_data <- player_playoff_data %>%
  filter(round == 'Round 2')

# total seconds played by each player in seconds round playoff
second_round_playoff_player_data <- second_round_playoff_player_data %>%
  group_by(season, team_name, player_name) %>%
  summarise(seconds = sum(seconds, na.rm = TRUE)) %>%
  ungroup()
## `summarise()` has grouped output by 'season', 'team_name'. You can override
## using the `.groups` argument.
merged_data <- merged_data %>%
  mutate(next_season = season + 1)

# merge dataframes 
final_data <- inner_join(merged_data, second_round_playoff_player_data, 
                         by = c("player_name", "next_season" = "season"))



size_final_data <- nrow(final_data)
size_merged_data <- nrow(merged_data)
cat(paste("Percent of top 5 minutes", (size_final_data / size_merged_data)), "\n")
## Percent of top 5 minutes 0.838095238095238

ANSWER 7:

Percent of +5.0 net rating teams making the 2nd round next year: 63.6%
Percent of top 5 minutes played players who played in those 2nd round series: 83.8%

Part 2 – Playoffs Series Modeling

In this section, the goal is to fit a model that predicts the winner and the number of games in a playoff series between any given two teams. This involves using data from the regular seasons and playoffs to create predictive models.

  1. The final output includes the probability of each team winning the series. For example, “Team A has a 30% chance to win and Team B has a 70% chance.” It also predicts the number of games in the series, which can be probabilistic or a point estimate.

  2. Only data provided in this project is used, with no external sources of data.

  3. Data available prior to the start of the series is used. For example, a team’s stats from the 2016-17 season can’t be used to predict a playoffs series from the 2015-16 season.

  4. The best models are explainable and lead to actionable insights around team and roster construction. The focus is on thought process and critical thinking rather than specific modeling techniques. Using smart features is more important than using complex mathematical machinery.

  5. As part of the project:

  • A brief written overview of how the model works is provided, targeted towards a decision maker in the front office without a strong statistical background.
  • An assessment of the strengths and weaknesses of the model is included.
  • Suggestions on how to address the weaknesses if more time and/or more data were available are given.
  • The model is applied to the 2024 NBA playoffs (2023 season) with a high-quality visual (a table, a plot, or a plotly) showing the 16 teams’ (that made the first round) chances of advancing to each round.
# filter regular season data
regular_season_data <- team_data %>% filter(gametype == 2)

# aggregate data by team and season 
team_stats <- regular_season_data %>%
  group_by(season, off_team_name) %>%
  summarise(
    points = sum(points),
    possessions = sum(possessions),
    fgmade = sum(fgmade),
    fgattempted = sum(fgattempted),
    fg3made = sum(fg3made),
    fg3attempted = sum(fg3attempted),
    ftmade = sum(ftmade),
    ftattempted = sum(ftattempted),
    reboffensive = sum(reboffensive),
    rebdefensive = sum(rebdefensive),
    turnovers = sum(turnovers),
    shotattemptpoints = sum(shotattemptpoints),
    shotattempts = sum(shotattempts)
  )
## `summarise()` has grouped output by 'season'. You can override using the
## `.groups` argument.
# calculate some additional metrics for additional features 
team_stats <- team_stats %>%
  mutate(
    ORTG = (points / possessions) * 100,
    TOVP = turnovers / shotattempts,
    OREBP = reboffensive / shotattempts,
    DREBP = rebdefensive / shotattempts
  )

playoff_data <- team_data %>% filter(gametype == 4)


identify_round <- function(nbagameid) {
  round_digit <- (nbagameid %% 1000) %/% 100
  if (round_digit == 1) {
    return('Round 1')
  } else if (round_digit == 2) {
    return('Round 2')
  } else if (round_digit == 3) {
    return('Conference Finals')
  } else if (round_digit == 4) {
    return('Finals')
  } else {
    return('Unknown')
  }
}

playoff_data$round <- sapply(playoff_data$nbagameid, identify_round)

# function to identify series outcome 
identify_series_outcome <- function(df) {
  series_outcome <- df %>%
    group_by(season, offensivenbateamid, defensivenbateamid) %>%
    summarise(
      games = n(),
      team1_wins = sum(off_win),
      team2_wins = games - team1_wins,
      team1_win = team1_wins > team2_wins,
      off_team_name = first(off_team_name),
      def_team_name = first(def_team_name)
    ) %>%
    ungroup() %>%
    rename(team1 = off_team_name, team2 = def_team_name) %>%
    select(season, team1, team2, team1_win, games)
  
  return(series_outcome)
}

playoff_series_outcomes <- identify_series_outcome(playoff_data)
## `summarise()` has grouped output by 'season', 'offensivenbateamid'. You can
## override using the `.groups` argument.
# merge playoff series outcomes with team stats
get_team_stats <- function(season_filter, team_filter) {
  stats <- team_stats %>%
    filter(season == season_filter, off_team_name == team_filter)
  if (nrow(stats) > 0) {
    return(stats[1, ])
  } else {
    return(NULL)
  }
}
# extracting features for team1 and team2 in each series
playoff_series_features <- lapply(1:nrow(playoff_series_outcomes), function(i) {
  row <- playoff_series_outcomes[i, ]
  season <- row$season
  team1 <- row$team1
  team2 <- row$team2
  
  team1_stats <- get_team_stats(season, team1)
  team2_stats <- get_team_stats(season, team2)
  if (!is.null(team1_stats) && !is.null(team2_stats)) {
    series_features <- data.frame(
      season = season,
      team1 = team1,
      team2 = team2,
      team1_win = row$team1_win,
      num_games = row$games,
      team1_ORTG = team1_stats$ORTG,
      team1_TOVP = team1_stats$TOVP,
      team1_OREBP = team1_stats$OREBP,
      team1_DREBP = team1_stats$DREBP,
      team2_ORTG = team2_stats$ORTG,
      team2_TOVP = team2_stats$TOVP,
      team2_OREBP = team2_stats$OREBP,
      team2_DREBP = team2_stats$DREBP
    )
    return(series_features)
  } else {
    return(NULL)
  }
})

playoff_series_features <- do.call(rbind, playoff_series_features)

set.seed(42)
# splitting data into training and test sets
train_indices <- createDataPartition(playoff_series_features$team1_win, p = 0.8, list = FALSE)
train_data <- playoff_series_features[train_indices, ]
test_data <- playoff_series_features[-train_indices, ]

required_columns <- c('team1_ORTG', 'team1_TOVP', 'team1_OREBP', 'team1_DREBP',
                      'team2_ORTG', 'team2_TOVP', 'team2_OREBP', 'team2_DREBP')

# training the Logistic Regression model to predict series winner 
X_train <- train_data %>% select(all_of(required_columns))
y_train_classification <- train_data$team1_win

train_data_classification <- data.frame(team1_win = y_train_classification, X_train)

clf <- glm(team1_win ~ ., data = train_data_classification, family = binomial())


# train the regression model ( Linear Regression) to predict number of games 
X_train_reg <- X_train
y_train_regression <- train_data$num_games

train_data_regression <- data.frame(num_games = y_train_regression, X_train_reg)

reg <- glm(num_games ~ ., data = train_data_regression)

# predictions and evaluations
X_test <- test_data %>% select(all_of(required_columns))
y_test_classification <- test_data$team1_win
y_test_regression <- test_data$num_games

y_pred_proba_classification <- predict(clf, newdata = X_test, type = "response")
y_pred_classification <- ifelse(y_pred_proba_classification > 0.5, 1, 0)
y_pred_regression <- predict(reg, X_test)

classification_accuracy <- mean(y_pred_classification == y_test_classification)

regression_mae <- mean(abs(y_pred_regression - y_test_regression))

# the offical 2024 matches for prediction 
playoff_matchups_2024 <- data.frame(
  team1 = c('Boston Celtics', 'Cleveland Cavaliers', 'Milwaukee Bucks', 'New York Knicks',
            'Oklahoma City Thunder', 'Dallas Mavericks', 'Minnesota Timberwolves', 'Denver Nuggets'),
  team2 = c('Miami Heat', 'Orlando Magic', 'Indiana Pacers', 'Philadelphia 76ers',
            'New Orleans Pelicans', 'LA Clippers', 'Phoenix Suns', 'Los Angeles Lakers')
)

# the 2024 playoff series data using stats from 2023
playoff_series_2024 <- lapply(1:nrow(playoff_matchups_2024), function(i) {
  matchup <- playoff_matchups_2024 %>% slice(i)
  team1 <- matchup$team1
  team2 <- matchup$team2
  team1_stats <- get_team_stats(2023, team1)
  team2_stats <- get_team_stats(2023, team2)
  
  if (!is.null(team1_stats) && !is.null(team2_stats)) {
    series_features <- data.frame(
      team1_ORTG = team1_stats$ORTG,
      team1_TOVP = team1_stats$TOVP,
      team1_OREBP = team1_stats$OREBP,
      team1_DREBP = team1_stats$DREBP,
      team2_ORTG = team2_stats$ORTG,
      team2_TOVP = team2_stats$TOVP,
      team2_OREBP = team2_stats$OREBP,
      team2_DREBP = team2_stats$DREBP
    )
    return(series_features)
  } else {
    return(NULL)
  }
})

playoff_series_2024 <- do.call(rbind, playoff_series_2024)

playoff_series_2024_matrix <- playoff_series_2024[required_columns]

# predict win probabilities and series length
win_probabilities <- predict(clf, playoff_series_2024_matrix, type = "response")
predicted_games <- predict(reg, playoff_series_2024_matrix)

results_df <- data.frame(
  Team1 = playoff_matchups_2024$team1,
  Team2 = playoff_matchups_2024$team2,
  Team1_Win_Probability = win_probabilities,
  Predicted_Series_Length = predicted_games
)

# displaying the results as a table 
print(results_df)
##                    Team1                Team2 Team1_Win_Probability
## 1         Boston Celtics           Miami Heat             0.9406674
## 2    Cleveland Cavaliers        Orlando Magic             0.5445493
## 3        Milwaukee Bucks       Indiana Pacers             0.3532428
## 4        New York Knicks   Philadelphia 76ers             0.7701264
## 5  Oklahoma City Thunder New Orleans Pelicans             0.7280913
## 6       Dallas Mavericks          LA Clippers             0.2909491
## 7 Minnesota Timberwolves         Phoenix Suns             0.2856642
## 8         Denver Nuggets   Los Angeles Lakers             0.5517083
##   Predicted_Series_Length
## 1                5.883728
## 2                5.626243
## 3                5.618680
## 4                5.935660
## 5                5.588482
## 6                5.646257
## 7                5.562150
## 8                5.655684
# get plot 
fig <- plot_ly()

for (i in 1:nrow(results_df)) {
  row <- results_df[i, ]
  fig <- fig %>%
    add_trace(
      x = list(row$Team1, row$Team2),
      y = list(row$Team1_Win_Probability, 1 - row$Team1_Win_Probability),
      type = 'bar',
      name = paste('Series', i),
      text = c(sprintf("%.2f%%", row$Team1_Win_Probability * 100),
               sprintf("%.2f%%", (1 - row$Team1_Win_Probability) * 100)),
      textposition = 'auto',
      hoverinfo = 'text'
    )
}


fig <- fig %>%
  layout(
    title = '2024 NBA Playoff Win Chances',
    xaxis = list(title = 'Teams'),
    yaxis = list(title = 'Win Probability', tickformat = ".2%"),
    barmode = 'group',
    template = 'plotly_dark',
    showlegend = FALSE
  )

fig

Overview of the model

The objective of the model is to:

  1. Predict the probability of each team winning a playoff series.
  2. Predict the number of games in each series.

Data from regular seasons is used to calculate advanced stats like Offensive Rating (ORTG), Turnover Percentage (TOV%), Offensive Rebound Percentage (OREB%), and Defensive Rebound Percentage (DREB%). Playoff data is then used to determine the outcome of each series. These features provide a comprehensive view of a team’s offensive and defensive capabilities.

Both are linear models and work by fitting a line to either A) separate a data point into a win or not(classification for team winning or not) via a decision boundary or B) fit a line on data points so that we get the numerical outcome(number of games in series) to then get where the next data point will likely fall.

Strenghts and Weaknesses

Strengths - Explainability: The models use well understood stats measures(ORTG, TOV%, OREB%, DREB%) making the predictions transparent and explainable to decision makers - Accuracy: The logistic regression model performed very well accurate based on the data and very closely match the actual results of the playoffs on 2024. It predicted 6 out of 8 match ups correctly.

Weaknesses - Limited Data: The regular season data used spans from 2014 to 2023, while playoff data extends to 2004, this means I had to drop the older series from the training which limits the training data - Model Assumptions: The models are linear thus assuming linear relationship between features, which may not be true

Addressing Weaknesses Given more time and data, several improvements can be made:

  • Incorporate Player-Level Data: Including data on player injuries, individual performance metrics, and rest days would improve the accuracy of predictions by accounting for the availability and health of key players.
  • Advanced Metrics and Qualitative Data: Adding advanced metrics such as player efficiency ratings, coaching statistics, and qualitative data on team dynamics would provide a comprehensive view of the factors influencing game outcomes.
  • Real-Time Updates: Developing a system for real-time updates on player status, team news, and other relevant information would make the predictions more responsive to current team conditions.

Table and Plot

Please refer to table and plot applied to 2024 first round playoffs.

Part 3 – Finding Insights from Your Model

This part focuses on identifying two teams that had a competitive window of two or more consecutive seasons making the playoffs and underperformed the model’s expectations. One underperformance is classified as due to bad luck and the other as relating to a cause not currently accounted for in the model. Recommendations are provided on how to improve the model with more time and data.

# Predict win probabilities for the entire dataset
X_all <- playoff_series_features %>% select(team1_ORTG, team1_TOVP, team1_OREBP, team1_DREBP, team2_ORTG, team2_TOVP, team2_OREBP, team2_DREBP)

playoff_series_features$win_prob <- predict(clf, newdata = X_all, type = "response")


playoff_series_features <- playoff_series_features %>% arrange(team1, season)

# identify consecutive playoffs
playoff_series_features <- playoff_series_features %>%
  group_by(team1) %>%
  mutate(consecutive_playoffs = cumsum(c(1, diff(season) != 1))) %>%
  ungroup()


# teams with 2 or more consecutive seasons
competitive_teams <- playoff_series_features %>% filter(consecutive_playoffs >= 2)

# determine Underperformance
competitive_teams <- competitive_teams %>%
  mutate(predicted_win = ifelse(win_prob > 0.5, 1, 0))

underperformers <- competitive_teams %>% filter(predicted_win == 1 & team1_win == 0)

# reaons for underperformance, bad luck or other cause based on model's prediction 
underperformers <- underperformers %>%
  mutate(reason = ifelse(win_prob > 0.85, "Bad Luck", "Other Cause"))

cat("Teams that underperformed:\n")
## Teams that underperformed:
bad_luck_teams <- unique(underperformers %>% filter(reason == "Bad Luck") %>% pull(team1))
other_cause_teams <- unique(underperformers %>% filter(reason == "Other Cause") %>% pull(team1))
cat("Teams with bad luck:", paste(bad_luck_teams, collapse = ", "), "\n")
## Teams with bad luck: Golden State Warriors, Houston Rockets, Utah Jazz
cat("Teams with other cause:", paste(other_cause_teams, collapse = ", "), "\n")
## Teams with other cause: Boston Celtics, Brooklyn Nets, Cleveland Cavaliers, Denver Nuggets, Golden State Warriors, LA Clippers, Miami Heat, Milwaukee Bucks, Minnesota Timberwolves, New York Knicks, Philadelphia 76ers, Phoenix Suns, Toronto Raptors, Utah Jazz, Washington Wizards
cat("\nDetails of underperforming teams:\n")
## 
## Details of underperforming teams:
print(underperformers)
## # A tibble: 32 × 17
##    season team1      team2 team1_win num_games team1_ORTG team1_TOVP team1_OREBP
##     <dbl> <chr>      <chr> <lgl>         <int>      <dbl>      <dbl>       <dbl>
##  1   2022 Boston Ce… Miam… FALSE             7       118.      0.139       0.128
##  2   2020 Brooklyn … Milw… FALSE             7       119.      0.143       0.123
##  3   2021 Brooklyn … Bost… FALSE             4       114.      0.148       0.143
##  4   2022 Cleveland… New … FALSE             5       117.      0.144       0.136
##  5   2020 Denver Nu… Phoe… FALSE             4       117.      0.142       0.147
##  6   2021 Denver Nu… Gold… FALSE             5       115.      0.156       0.134
##  7   2015 Golden St… Clev… FALSE             7       114.      0.161       0.134
##  8   2018 Golden St… Toro… FALSE             6       116.      0.148       0.128
##  9   2022 Golden St… Los … FALSE             6       116.      0.169       0.141
## 10   2016 Houston R… San … FALSE             6       115.      0.157       0.146
## # ℹ 22 more rows
## # ℹ 9 more variables: team1_DREBP <dbl>, team2_ORTG <dbl>, team2_TOVP <dbl>,
## #   team2_OREBP <dbl>, team2_DREBP <dbl>, win_prob <dbl>,
## #   consecutive_playoffs <dbl>, predicted_win <dbl>, reason <chr>

ANSWER :

Based on the model’s prediction, there are man teams with competitve window of 2 or more consecutive seasons making the playoffs under performed relative to expectations. Two teams I would pick as examples(from the results printed above) are:

Bad Luck: Golden State Warriors, In their 2015 game against Cleveland Cavaliers they had high win probability (over 85%) according to our model but still lost. In this 2015 - 2016 season and hence 2016 finals Draymond Green was suspended for Game 5 and Andrew Bogut got injured hence leading to their loss.

Other Cause: Cleveland Cavaliers In there playoff game in 2022 playoffs(played in 2023) they lost 4-1 against New York Knicks where model had predicted there chance as 58%. Since this was a big victory for knicks this cannot be attributed to bad luck. This might have happened due to factors unaccounted for in the model, such as team chemistry, coaching decisions and fatigue and rest of the players.

Suggested Improvment and Feedback

Given more time and data, there are several improvements I would make to the model:

  1. Incorporate Player-Level Data: Including data on player injuries, individual performance metrics, and rest days would improve the accuracy of predictions. This adjustment would allow the model to account for the availability and health of key players.

  2. Advanced Metrics and Qualitative Data: Adding advanced metrics such as player efficiency ratings, coaching statistics, and qualitative data on team dynamics would provide a comprehensive view of the factors influencing game outcomes.

  3. Real-Time Updates: Developing a system for real-time updates on player status, team news, and other relevant information would make the predictions more responsive to the current conditions of the teams.