Introduction

The purpose of this project is to gauge your technical skills and problem solving ability by working through something similar to a real NBA data science project. You will work your way through this R Markdown document, answering questions as you go along. Please begin by adding your name to the “author” key in the YAML header. When you’re finished with the document, come back and type your answers into the answer key at the top. Please leave all your work below and have your answers where indicated below as well. Please note that we will be reviewing your code so make it clear, concise and avoid long printouts. Feel free to add in as many new code chunks as you’d like.

Remember that we will be grading the quality of your code and visuals alongside the correctness of your answers. Please try to use the tidyverse as much as possible (instead of base R and explicit loops). Please do not bring in any outside data.

Note:

Throughout this document, any season column represents the year each season started. For example, the 2015-16 season will be in the dataset as 2015. For most of the rest of the project, we will refer to a season by just this number (e.g. 2015) instead of the full text (e.g. 2015-16).

Answers

Part 1

Question 1:

  • Offensive: 56.5% eFG
  • Defensive: 47.9% eFG

Question 2: 81.6%

Question 3: 46.2%

Question 4: This is a written question. Please leave your response in the document under Question 5.

Question 5: 80.9% of games

Question 6:

  • Round 1: 80.6%
  • Round 2: 58.3%
  • Conference Finals: 55.6%
  • Finals: 77.8%

Question 7:

  • Percent of +5.0 net rating teams making the 2nd round next year: 63.6%
  • Percent of top 5 minutes played players who played in those 2nd round series: 82.9%

Part 2

Please show your work in the document, you don’t need anything here.

Part 3

Please write your response in the document, you don’t need anything here.

Setup and Data

# LOAD IN PACKAGES
library(tidyverse)
library(dplyr)
library(stringi)
library(stringr)
library(purrr)
library(ggplot2)
library(plotly)
library(pROC)
library(caret)

#---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# LOAD DATA
team_game_data <- read.csv("C:/Users/fishm/Documents/OKC Thunder Application/Datasets/team_game_data.csv", TRUE, ",") #only post szn data from 2004-2013
player_game_data <- read.csv("C:/Users/fishm/Documents/OKC Thunder Application/Datasets/player_game_data.csv", TRUE, ",")
#---------------------------------------------------------------------------------------------------------------------------------------------------------------------------


#---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# CLEAN THE TEAM GAME DATA  (merge the two rows for each game together with result metrics)
# add home_team_won column (boolean)
team_game_data$home_team_won <- ifelse(team_game_data$off_home == 1 & team_game_data$off_win == 1, TRUE, FALSE)

# Split the data into two separate DataFrames for offense and defense and merge them so all the data is on one row per game
merged_df <- team_game_data %>%
  filter(off_win == 1) %>%
  rename_with(~paste0(., '_a'), -nbagameid) %>%
  inner_join(
    team_game_data %>% filter(def_win == 1) %>% rename_with(~paste0(., '_b'), -nbagameid),
    by = 'nbagameid'
  )

# Specify the home and away columns
home_away_df <- merged_df %>%
  mutate(
    home_team = ifelse(off_home_a == 1, off_team_a, def_team_a),
    away_team = ifelse(off_home_a == 0, off_team_a, def_team_a),
    home_points = ifelse(off_home_a == 1, points_a, points_b),
    away_points = ifelse(off_home_a == 0, points_a, points_b),
    home_fg2made = ifelse(off_home_a == 1, fg2made_a, fg2made_b),
    away_fg2made = ifelse(off_home_a == 0, fg2made_a, fg2made_b),
    home_fg2attempted = ifelse(off_home_a == 1, fg2attempted_a, fg2attempted_b),
    away_fg2attempted = ifelse(off_home_a == 0, fg2attempted_a, fg2attempted_b),
    home_fg3made = ifelse(off_home_a == 1, fg3made_a, fg3made_b),
    away_fg3made = ifelse(off_home_a == 0, fg3made_a, fg3made_b),
    home_fg3attempted = ifelse(off_home_a == 1, fg3attempted_a, fg3attempted_b),
    away_fg3attempted = ifelse(off_home_a == 0, fg3attempted_a, fg3attempted_b),
    home_fgmade = ifelse(off_home_a == 1, fgmade_a, fgmade_b),
    away_fgmade = ifelse(off_home_a == 0, fgmade_a, fgmade_b),
    home_fgattempted = ifelse(off_home_a == 1, fgattempted_a, fgattempted_b),
    away_fgattempted = ifelse(off_home_a == 0, fgattempted_a, fgattempted_b),
    home_shotattempts = ifelse(off_home_a == 1, shotattempts_a, shotattempts_b),
    away_shotattempts = ifelse(off_home_a == 0, shotattempts_a, shotattempts_b),
    home_shotattemptpoints = ifelse(off_home_a == 1, shotattemptpoints_a, shotattemptpoints_b),
    away_shotattemptpoints = ifelse(off_home_a == 0, shotattemptpoints_a, shotattemptpoints_b),
    home_possessions = ifelse(off_home_a == 1, possessions_a, possessions_b),
    away_possessions = ifelse(off_home_a == 0, possessions_a, possessions_b),
    home_defpos = ifelse(off_home_a == 0, possessions_a, possessions_b),
    away_defpos = ifelse(off_home_a == 1, possessions_a, possessions_b),
    home_assists = ifelse(off_home_a == 1, assists_a, assists_b),
    away_assists = ifelse(off_home_a == 0, assists_a, assists_b),
    home_orebs = ifelse(off_home_a == 1, reboffensive_a, reboffensive_b),
    away_orebs = ifelse(off_home_a == 0, reboffensive_a, reboffensive_b),
    home_drebs = ifelse(off_home_a == 1, rebdefensive_a, rebdefensive_b),
    away_drebs = ifelse(off_home_a == 0, rebdefensive_a, rebdefensive_b),
    home_rbs = ifelse(off_home_a == 1, reboundchance_a, reboundchance_b),
    away_rbs = ifelse(off_home_a == 0, reboundchance_a, reboundchance_b),
    home_blocks = ifelse(off_home_a == 1, blocksagainst_a, blocksagainst_b),
    away_blocks = ifelse(off_home_a == 0, blocksagainst_a, blocksagainst_b),
    home_steals = ifelse(off_home_a == 1, stealsagainst_a, stealsagainst_b),
    away_steals = ifelse(off_home_a == 0, stealsagainst_a, stealsagainst_b),
    home_turnovers = ifelse(off_home_a == 1, turnovers_a, turnovers_b),
    away_turnovers = ifelse(off_home_a == 0, turnovers_a, turnovers_b),
    home_ftmade = ifelse(off_home_a == 1, ftmade_a, ftmade_b),
    away_ftmade = ifelse(off_home_a == 0, ftmade_a, ftmade_b),
    home_ftattempted = ifelse(off_home_a == 1, ftattempted_a, ftattempted_b),
    away_ftattempted = ifelse(off_home_a == 0, ftattempted_a, ftattempted_b)
  ) %>%
  rename(
    gametype = gametype_a,  # Rename const columns
    season = season_a,
    date = gamedate_a,
    home_team_won = home_team_won_a
  )  %>%
  select(-matches("_a$"), -matches("_b$"))    # remove unused columns post-merge

# Create calculations for advanced stats
home_away_df <- home_away_df %>%
  mutate(home_ppa =  home_shotattemptpoints / home_shotattempts,
         away_ppa =  away_shotattemptpoints / away_shotattempts,
         home_OREB_perc = home_orebs / (home_orebs + away_orebs), 
         home_DREB_perc = home_drebs / (home_drebs + away_drebs), 
         away_OREB_perc = away_orebs / (home_orebs + away_orebs), 
         away_DREB_perc = away_drebs / (home_drebs + away_drebs), 
         home_TO_perc = home_turnovers / (home_shotattempts + home_turnovers),
         away_TO_perc = away_turnovers / (away_shotattempts + away_turnovers),
         home_STL_perc = home_steals / home_defpos,
         away_STL_perc = away_steals / away_defpos,
         home_BLK_perc = home_blocks / away_fg2attempted,
         away_BLK_perc = away_blocks / home_fg2attempted,
         home_oRTG = home_points / (home_possessions / 100),
         away_oRTG = away_points / (away_possessions / 100),
         home_dRTG = away_points / (away_possessions / 100),
         away_dRTG = home_points / (home_possessions / 100),
         home_NET_RTG = home_oRTG - home_dRTG,
         away_NET_RTG = away_oRTG - away_dRTG,
         home_eFG = (home_fgmade + 0.5 * home_fg3made) / home_fgattempted,
         away_eFG = (away_fgmade + 0.5 * away_fg3made) / away_fgattempted
  ) %>%
  select(nbagameid, gametype, date, season, home_team, away_team, home_team_won, home_points, away_points, home_fgmade, home_fg3made, home_fgattempted, away_fgmade, away_fg3made, away_fgattempted,
         home_assists,away_assists, home_orebs, away_orebs, home_drebs, away_drebs, home_steals, away_steals, home_blocks, away_blocks, home_turnovers, away_turnovers, home_eFG, away_eFG, home_NET_RTG, away_NET_RTG, 
         home_ppa, away_ppa, home_TO_perc, away_TO_perc, home_STL_perc, away_STL_perc, home_BLK_perc, away_BLK_perc) %>%
  arrange(nbagameid)
#---------------------------------------------------------------------------------------------------------------------------------------------------------------------------


#---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# CLEAN THE PLAYER GAME DATA  (calculate advanced stats, design simulated/expected metrics, aggregate to game level)
# Add advanced stats columns based off data_dictionary
player_game_data <- player_game_data %>%
  mutate(ppa =  shotattemptpoints / shotattempts,
         USG =  (shotattempts + turnovers) / (teamshotattempts + teamturnovers),
         eFG = (fgmade + 0.5 * fg3made) / fgattempted,
         assist_perc = assists / (teamfgmade - (fg3made + fg2made)) , 
         oreb_perc = reboffensive / offensivereboundchances, 
         dreb_perc = rebdefensive / defensivereboundchances, 
         turnover_per = turnovers / (shotattempts + turnovers),
         steal_perc = steals / defensivepossessions,
         bklock_perc = blocks/opponentteamfg2attempted
  )

# Calculated simulated scoring and expected points scored metrics
simulated_metrics <- player_game_data %>%
  filter(seconds > 0) %>%
  group_by(season, nbagameid, team, player_name) %>%
  mutate(
    minutes = seconds / 60,
    FG_perc = ifelse(is.na(fgattempted) | fgattempted == 0, 0, fgmade / fgattempted),
    FG2_perc = ifelse(is.na(fg2attempted) | fg2attempted == 0, 0, fg2made / fg2attempted),
    FG3_perc = ifelse(is.na(fg3attempted) | fg3attempted == 0, 0, fg3made / fg3attempted),
    FT_perc = ifelse(is.na(ftattempted) | ftattempted == 0, 0, ftmade / ftattempted),
    simulated_fg_attempts = minutes * USG * fgattempted * .1,  #scaling vaiance .1
    simulated_fg2_attempts = simulated_fg_attempts * FG2_perc,
    simulated_fg3_attempts = ifelse(is.na(FG3_perc) | simulated_fg_attempts == 0 & FG3_perc == 0, 0, simulated_fg_attempts * FG3_perc),
    simulated_ft_attempts = minutes * USG * ftattempted * .1,
    simulated_fg2_points = (FG2_perc * simulated_fg2_attempts) * 2,
    simulated_fg3_points = (FG3_perc * simulated_fg3_attempts)  * 3,
    simulated_ft_points = (FT_perc * simulated_ft_attempts) * 1,
    simulated_eFG = (simulated_fg2_points + 0.5 * simulated_fg3_points) / simulated_fg_attempts,
    simulated_fg_points = simulated_fg_attempts * simulated_eFG * 2,
    expected_points_eFG = ifelse(is.na(simulated_ft_points) | simulated_ft_points == 0, 0, simulated_fg_points + simulated_ft_points),
    expected_points_usg = simulated_fg2_points + simulated_fg3_points + simulated_ft_points,
    expected_points = ifelse(is.nan((simulated_fg_points + simulated_ft_points + expected_points_usg) / 2), 0, (simulated_fg_points + simulated_ft_points + expected_points_usg) / 2)  # Blend eFG and usg, handle NaN
  ) %>%
  ungroup() %>%
  group_by(nbagameid, season, team) %>%
  slice_head(n = 15) %>%  # Select the top 15 players for each team
  mutate(
    starter = if_else(row_number() <= 5, 1, 0)
  ) %>%
  ungroup() %>%
  select(nbagameid, season, player_name, team, starter, seconds, minutes,
         simulated_eFG, simulated_fg_points, simulated_fg_attempts, simulated_fg2_attempts, simulated_fg3_attempts, simulated_ft_attempts, simulated_fg2_points, simulated_fg3_points, simulated_ft_points, expected_points_eFG, expected_points_usg, expected_points) %>%
  arrange(nbagameid, team, desc(minutes))

# Aggregate to game level
game_level_metrics <- simulated_metrics %>%
  group_by(nbagameid, season, team) %>%
  summarize(
    simulated_eFG = mean(simulated_eFG, na.rm = TRUE),
    simulated_fg_attempts = sum(simulated_fg_attempts, na.rm = TRUE),
    simulated_fg2_attempts = sum(simulated_fg2_attempts, na.rm = TRUE),
    simulated_fg3_attempts = sum(simulated_fg3_attempts, na.rm = TRUE),
    simulated_ft_attempts = sum(simulated_ft_attempts, na.rm = TRUE),
    simulated_fg_points = sum(simulated_fg_points, na.rm = TRUE),
    simulated_fg2_points = sum(simulated_fg2_points, na.rm = TRUE),
    simulated_fg3_points = sum(simulated_fg3_points, na.rm = TRUE),
    simulated_ft_points = sum(simulated_ft_points, na.rm = TRUE),
    expected_points_eFG = sum(expected_points_eFG, na.rm = TRUE),
    expected_points_usg = sum(expected_points_usg, na.rm = TRUE),
    expected_points = sum(expected_points, na.rm = TRUE),    .groups = 'drop'
  )
#---------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Part 1 – Data Cleaning

In this section, you’re going to work to answer questions using data from both team and player stats. All provided stats are on the game level.

Question 1

QUESTION: What was the Warriors’ Team offensive and defensive eFG% in the 2015-16 regular season? Remember that this is in the data as the 2015 season.

# Filter the data for the 2015 regular season
warriors_2015_eFG_avg <- home_away_df %>%
  filter(season == 2015, gametype == 2) %>%
  mutate(             
    gsw_off_eFG = ifelse(home_team == "GSW", home_eFG, ifelse(away_team == "GSW", away_eFG, NA)),     # Define off and def GSW eFG 
    gsw_def_eFG = ifelse(home_team == "GSW", away_eFG, ifelse(away_team == "GSW", home_eFG, NA))
  ) %>%
  summarize(
    warriors_2015_off_eFG_avg = mean(gsw_off_eFG, na.rm = TRUE),   # Get the average eFG for both off and def
    warriors_2015_def_eFG_avg = mean(gsw_def_eFG, na.rm = TRUE)
  )

# Visualize Results --------------------------------------------------------------------------------------------------------------------------------------------------------
barplot_data <- bind_rows(
  data.frame(Aspect = "Offensive eFG%", Percentage = warriors_2015_eFG_avg$warriors_2015_off_eFG_avg * 100),
  data.frame(Aspect = "Defensive eFG%", Percentage = warriors_2015_eFG_avg$warriors_2015_def_eFG_avg * 100)
)

# Create grouped bar chart with percentages
bar_chart <- ggplot(barplot_data, aes(x = Aspect, y = Percentage, fill = Aspect)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.7) +
  geom_text(aes(label = paste0(round(Percentage, 1), "%")), 
            position = position_dodge(width = 0.7), 
            vjust = -0.5, 
            size = 4) +  # Add percentages above bars with slight offset
  labs(title = "2015 Warriors eFG% Comparison",
       x = "Aspect",
       y = "Percentage") +
  theme_minimal() +
  theme(legend.position = "none")  # Remove legend since it's redundant

# Print bar chart
print(bar_chart)

ANSWER 1:

Offensive: 56.5% eFG
Defensive: 47.9% eFG

Question 2

QUESTION: What percent of the time does the team with the higher eFG% in a given game win that game? Use games from the 2014-2023 regular seasons. If the two teams have an exactly equal eFG%, remove that game from the calculation.

# Create a column for determining if the home team has a higher eFG 
merged_df <- home_away_df %>%
  mutate(higher_home_eFG = home_eFG > away_eFG)

# Create the "higher_efg_won" column based on the conditions (if home team won and had a higher efg OR is away team won and away team had higher efg, then 1, else 0)
merged_df$higher_efg_won <- ifelse((merged_df$higher_home_eFG & merged_df$home_team_won) | 
                                     (!merged_df$higher_home_eFG & !merged_df$home_team_won), 1, 0)

# Filter the dataframe for the desired season range and game type
filtered_df <- merged_df %>%
  filter(season >= 2014 & season <= 2023, gametype == 2) %>%
  filter(home_eFG != away_eFG)  # Remove rows where home_eFG is equal to away_eFG

# Calculate the percentage of games where the team with higher eFG% won
percentage_higher_efg_won <- round(mean(filtered_df$higher_efg_won) * 100, 1)

# Visualize the results ----------------------------------------------------------------------------------------------------------------------
# Create a summary data frame indicating if the team with higher eFG% won or lost
summary_data <- filtered_df %>%
  mutate(result = ifelse(higher_efg_won == 1, "Won", "Lost")) %>%
  count(result) %>%
  mutate(percentage = round(n / sum(n) * 100, 1),
         label = paste0(result, "\n", percentage, "%"))

# Create the pie chart
ggplot(summary_data, aes(x = "", y = n, fill = result)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  geom_text(aes(label = label), position = position_stack(vjust = 0.5)) +
  labs(title = "Games Won/Lost by Team with Higher eFG%") +
  theme_minimal() +
  theme_void() +
  theme(legend.position = "none")

ANSWER 2:

81.6%

Question 3

QUESTION: What percent of the time does the team with more offensive rebounds in a given game win that game? Use games from the 2014-2023 regular seasons. If the two teams have an exactly equal number of offensive rebounds, remove that game from the calculation.

# Filter the dataframe for the desired season range and game type
filtered_df <- home_away_df %>%
  filter(season >= 2014 & season <= 2023, gametype == 2) %>%
  filter(home_orebs != away_orebs)  # Remove rows where home_oreb is equal to away_oreb

# Test if any equal oreb numbers still persist
#has_same_oreb <- any(filtered_df$home_oreb == filtered_df$away_oreb)

# Calculate "home_higher_orebs" and "most_oreb_won" columns
filtered_oreb <- filtered_df %>%
  mutate(home_higher_orebs = home_orebs > away_orebs,
         most_oreb_won = (home_team_won == home_higher_orebs))

# Calculate the percentage
percentage_most_oreb_won <- mean(filtered_oreb$most_oreb_won) * 100

# Visualize Results ---------------------------------------------------------------------------------------------------------
# Count the number of wins and losses
wins_losses_count <- filtered_oreb %>%
  group_by(most_oreb_won) %>%
  summarise(count = n())

# Modify labels for readability
wins_losses_count$most_oreb_won <- factor(wins_losses_count$most_oreb_won, labels = c("Lost", "Won"))

# Reorder the levels of the factor variable
wins_losses_count$most_oreb_won <- factor(wins_losses_count$most_oreb_won, levels = c("Won", "Lost"))

# Calculate total number of games (sum of wins and losses)
total_games <- sum(wins_losses_count$count)

# Calculate the percentage of wins
wins_losses_count <- wins_losses_count %>%
  mutate(percentage = count / total_games * 100)

# Plot the bar chart
bar_chart <- ggplot(wins_losses_count, aes(x = most_oreb_won, y = count, fill = most_oreb_won)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), vjust = -0.5, color = "black", size = 3) +  # Add text labels for percentages
  labs(title = "Win/Loss Record for Teams with Most Offensive Rebounds",
       x = "Result",
       y = "Games") +
  scale_fill_manual(values = c("green", "red"), labels = c("Won", "Lost")) +  # Customizing fill colors and labels
  theme_minimal() +
  theme(legend.position = "none")  # Removing the legend since it's redundant

# Print the bar chart
print(bar_chart)

ANSWER 3:

46.2%

Question 4

QUESTION: Do you have any theories as to why the answer to question 3 is lower than the answer to question 2? Try to be clear and concise with your answer.

ANSWER 4:

The difference in the percentages between question 2 (teams with a higher eFG% winning the game) and question 3 (teams with more offensive rebounds winning the game) could be attributed to the varying impact of these two statistics on the outcome of a basketball game.

  1. Efficiency vs. Quantity: Effective Field Goal Percentage (eFG%) measures the efficiency of a team’s shooting by accounting for the extra value of three-point shots. Teams with a higher eFG% are generally more efficient at scoring points, which often correlates with winning games. Conversely, offensive rebounds (OREB) represent a team’s ability to secure missed shots and gain additional possessions. While offensive rebounds can lead to second-chance opportunities and extended possessions, they do not necessarily guarantee higher efficiency in scoring.

  2. Opponent Adjustments: Teams facing opponents with a higher eFG% might prioritize defensive strategies to limit their opponent’s shooting efficiency, thus potentially affecting the outcome. In contrast, rebounding battles are more influenced by physicality, positioning, and hustle, which may vary from game to game regardless of opponent strengths.

  3. Game Dynamics: The influence of offensive rebounds on game outcomes can fluctuate depending on various factors such as team playing styles, matchups, pace of play, and game situations. In some games, offensive rebounding dominance might be a decisive factor, while in others, efficient shooting might play a more significant role.

  4. Statistical Dependency: It’s also essential to consider the statistical dependency between eFG% and offensive rebounds. While there may be some correlation between the two metrics, they represent different facets of the game and may not always align in their impact on game outcomes.

In summary, the lower percentage for teams with more offensive rebounds winning the game compared to teams with a higher eFG% winning the game could be due to the nuanced interplay of shooting efficiency, rebounding dynamics, opponent strategies, and game-specific factors influencing basketball outcomes.

Question 5

QUESTION: Look at players who played at least 25% of their possible games in a season and scored at least 25 points per game played. Of those player-seasons, what percent of games were they available for on average? Use games from the 2014-2023 regular seasons.

For example:

  • Ja Morant does not count in the 2023-24 season, as he played just 9 out of 82 games this year, even though he scored 25.1 points per game.
  • Chet Holmgren does not count in the 2023-24 season, as he played all 82 games this year but scored 16.5 points per game.
  • LeBron James does count in the 2023-24 season, as he played 71 games and scored 25.7 points per game.
# Filter, clean, and summarize data
all_players_summary <- player_game_data %>%
  filter(season >= 2014 & season <= 2023, gametype == 2) %>%
  group_by(player_name, season) %>%
  summarize(
    games_played = sum(starter == 1 | (starter == 0 & missed == 0)),  # games played = started or didn't start but didn't miss
    games_missed = sum(missed == 1),
    total_points = sum(points, na.rm = TRUE),
    total_fgm = sum(fgmade, na.rm = TRUE),
    total_fga = sum(fgattempted, na.rm = TRUE),
    ppg = total_points / games_played,              # calculate PPG
    availability = (games_played / 82) * 100
  ) %>%
  mutate(
    player_25p = games_played / 82 > 0.25,          # player_25p = players playing at least 25% of games
    ppg_o25 = ppg >= 25.0                           # ppg_o25 = players with 25 or more ppg
  ) %>%
  filter(player_25p & ppg_o25)
## `summarise()` has grouped output by 'player_name'. You can override using the
## `.groups` argument.
# Calculate the mean availability (ANSWER)
mean_availability <- mean(all_players_summary$availability)

# Calculate average availability per player
average_availability <- all_players_summary %>%
  group_by(player_name) %>%
  summarize(
    average_availability = mean(availability)
  )

# Calculate the total average availability
total_average_availability <- mean(average_availability$average_availability)

# Visualize Results ----------------------------------------------------------------------------------------------------------------------------------------------
# Plot the bar chart
ggplot(average_availability, aes(x = player_name, y = average_availability)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "Average Availability per Qualifying Player",
       x = "Players Who Played in 25% of Games & Scored +25.0 ppg",
       y = "Average Availability (%)") +
  geom_hline(yintercept = total_average_availability, color = "red", linetype = "dashed", size = 1) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

ANSWER 5:

80.9% of games

Question 6

QUESTION: What % of playoff series are won by the team with home court advantage? Give your answer by round. Use playoffs series from the 2014-2022 seasons. Remember that the 2023 playoffs took place during the 2022 season (i.e. 2022-23 season).

# Filter for post-season games between 2014 and 2022 and select necessary columns
post_season_2014_2022 <- home_away_df %>%
  filter(season >= 2014 & season <= 2022, gametype == 4) %>%
  mutate(
    home_win = ifelse(home_team_won == TRUE, 1, 0),
    winningteam = ifelse(home_team_won == TRUE, home_team, away_team)
  ) %>%
  select(
    nbagameid, date, season, home_team, away_team, home_win, winningteam,
    home_points, away_points
  ) 


# Function to assign rounds (split the game ID to get the 3rd to least digit, which represents playoff round)
assign_round <- function(nbagameid) {
  # Extract the last three digits
  last_three_digits <- nbagameid %% 1000
  
  # Get the first digit of the last three digits
  round_digit <- last_three_digits %/% 100
  
  # Determine the round based on the first digit
  if (round_digit == 1) {
    return(1)  # Round 1
  } else if (round_digit == 2) {
    return(2)  # Round 2
  } else if (round_digit == 3) {
    return(3)  # Round 3 (Conference Finals)
  } else if (round_digit == 4) {
    return(4)  # Round 4 (Finals)
  } else {
    return(NA)  # Unknown round
  }
}

# Add round column to the dataset
post_season_2014_2022 <- post_season_2014_2022 %>%
  mutate(round = sapply(nbagameid, assign_round))


# Initialize series_num as 1
post_season_2014_2022$series_num <- 1

# Loop through each row starting from the second row
for (i in 2:nrow(post_season_2014_2022)) {
  # Check if the current row belongs to the same series as the previous row
  if ((post_season_2014_2022$home_team[i] == post_season_2014_2022$home_team[i - 1] &&
       post_season_2014_2022$away_team[i] == post_season_2014_2022$away_team[i - 1]) ||
      (post_season_2014_2022$home_team[i] == post_season_2014_2022$away_team[i - 1] &&
       post_season_2014_2022$away_team[i] == post_season_2014_2022$home_team[i - 1])) {
    # If yes, assign the same series_num as the previous row
    post_season_2014_2022$series_num[i] <- post_season_2014_2022$series_num[i - 1]
  } else {
    # If no, increment the series_num
    post_season_2014_2022$series_num[i] <- post_season_2014_2022$series_num[i - 1] + 1
  }
}

# Count the number of wins for each team in each series
series_wins <- post_season_2014_2022 %>%
  group_by(series_num, winningteam) %>%
  summarize(num_wins = sum(ifelse(winningteam == home_team, home_win, 1 - home_win))) %>%
  ungroup()
## `summarise()` has grouped output by 'series_num'. You can override using the
## `.groups` argument.
# Find the series winner for each series
series_winners <- series_wins %>%
  group_by(series_num) %>%
  slice(which.max(num_wins)) %>%
  ungroup()

# Merge series_winners with post_season_2014_2022 based on series_num
post_season_2014_2022 <- merge(post_season_2014_2022, series_winners, by = "series_num", all.x = TRUE)

# Create series_winner column and fill it with the winningteam values
post_season_2014_2022 <- post_season_2014_2022 %>%
  mutate(series_winner = ifelse(!is.na(num_wins), winningteam.y, NA)) %>%
  select(-num_wins, -winningteam.y)

# Group by series_num and select relevant columns for the summary
summary_df <- post_season_2014_2022 %>%
  group_by(round, series_num) %>%
  summarize(
    nbagameid = first(nbagameid),
    date = first(date),
    season = first(season),
    homecourt_adv = first(home_team),
    opponent = first(away_team),
    series_winner = first(series_winner)
  )
## `summarise()` has grouped output by 'round'. You can override using the
## `.groups` argument.
# Sort the summary_df dataframe by season
summary_df <- summary_df %>%
  arrange(season)

# Calculate the percentage
percentage <- summary_df %>%
  mutate(
    winner_is_home_adv = as.integer(series_winner == homecourt_adv)
  ) %>%
  group_by(round) %>%
  summarize(
    percentage = mean(winner_is_home_adv) * 100
  )

# Visualize Results ----------------------------------------------------------------------------------------------------------------------
# Plot the bar chart with different colors for each bar and percentages at the top of each bar
ggplot(percentage, aes(x = factor(round), y = percentage, fill = factor(round))) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste0(round(percentage), "%")), 
            position = position_stack(vjust = 0.5), 
            size = 3, color = "white") +  # Add percentages to the top of each bar
  labs(title = "Homecourt Advantage Series Win % per Round",
       x = "Round",
       y = "Percentage of Wins (%)") +
  scale_x_discrete(labels = c("Round 1", "Round 2", "Round 3", "Round 4")) +  # Customize x-axis labels
  scale_fill_manual(values = c("#1f78b4", "#33a02c", "#e31a1c", "#ff7f00")) +  # Assign colors to bars
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

ANSWER 6:

Round 1: 80.6%
Round 2: 58.3%
Conference Finals: 55.6%
Finals: 77.8%

Question 7

QUESTION: Among teams that had at least a +5.0 net rating in the regular season, what percent of them made the second round of the playoffs the following year? Among those teams, what percent of their top 5 total minutes played players (regular season) in the +5.0 net rating season played in that 2nd round playoffs series? Use the 2014-2021 regular seasons to determine the +5 teams and the 2015-2022 seasons of playoffs data.

For example, the Thunder had a better than +5 net rating in the 2023 season. If we make the 2nd round of the playoffs next season (2024-25), we would qualify for this question. Our top 5 minutes played players this season were Shai Gilgeous-Alexander, Chet Holmgren, Luguentz Dort, Jalen Williams, and Josh Giddey. If three of them play in a hypothetical 2nd round series next season, it would count as 3/5 for this question.

Hint: The definition for net rating is in the data dictionary.

# QUALIFYING TEAMS SECTION: teams with a +5.0 NRG and made the second round of the playoffs the following season

# ORTG = points/(possessions/100)
# DRTG = points allowed/(defensive possessions/100) [Same as ORTG calculation but for the defensive team]
# NET RTG = ORTG - DRTG

# Calculate ORTG for each team
team_game_data <- team_game_data %>%
  mutate(off_rating = points / (possessions / 100))

# Calculate DRTG for each team
team_game_data <- team_game_data %>%
  mutate(def_rating = points / (possessions / 100))

# Filter for REG SZN 2014-2021
team_off <- team_game_data %>%
  #filter(off_team  == "PHI") %>%
  filter(season >= 2014 & season <= 2021) %>%
  #filter(season == 2014) %>%
  # 2 = reg szn
  filter(gametype == 2) %>%                         
  select(nbagameid, season, off_team, points, possessions, off_rating)


# Filter for REG SZN 2014-2021
team_def <- team_game_data %>%
  #filter(def_team  == "PHI") %>%
  filter(season >= 2014 & season <= 2021) %>%
  #filter(season == 2014) %>%
  # 2 = reg szn
  filter(gametype == 2) %>%                         
  select(nbagameid, season, def_team, points, possessions, def_rating)

team_off <- team_off %>%
  rename(team = off_team)

team_def <- team_def %>%
  rename(team = def_team)

# Combine the dataframes on nbagameid, season, and team
combined_df <- team_off %>%
  inner_join(team_def, by = c("nbagameid", "season", "team"))

# Calculate the net rating for each game
combined_df <- combined_df %>%
  mutate(net_rating = off_rating - def_rating)


# GET TEAMS WITH A NET RATING OF +5.0 ----------------------------------------------------------------------------------------------------
# Compute the average net rating for each team for the season
team_net_ratings <- combined_df %>%
  group_by(season, team) %>%
  summarise(avg_net_rating = mean(net_rating, na.rm = TRUE))
## `summarise()` has grouped output by 'season'. You can override using the
## `.groups` argument.
# Print the result
#print(team_net_ratings, n = 2410)

# Filter for net ratings >= 5.0
plus5_net_ratings <- team_net_ratings %>%
  filter(avg_net_rating >= 5.0) %>%
  arrange(season, desc(avg_net_rating))
# ----------------------------------------------------------------------------------------------------------------------------------------


# GET UNIQUE TEAMS THAT MADE THE SECOND ROUND FOR GIVEN SEASON RANGE ---------------------------------------------------------------------
# Filter for POST SZN 2015-2022 and round 2 games
post_season_2015_2022_round_2 <- team_game_data %>%
  filter(season >= 2015 & season <= 2022) %>%
  #filter(season == 2015) %>%
  filter(gametype == 4) %>%  # 4 = playoffs
  filter(substr(nbagameid, nchar(nbagameid) - 2, nchar(nbagameid) - 2) == "2") %>% #2nd round filter
  select(nbagameid, season, off_team)

# make sure only second round games are present
#print(post_season_2015_2022_round_2)

# Filter unique off_teams
unique_off_teams <- post_season_2015_2022_round_2 %>%
  distinct(season, off_team) %>%
  arrange(season)
# ----------------------------------------------------------------------------------------------------------------------------------------


# Merge the dataframes -------------------------------------------------------------------------------------------------------------------
merged_df <- merge(unique_off_teams, plus5_net_ratings, by.x = c("off_team", "season"), by.y = c("team", "season"))

# Iterate through each row of the first dataframe
result <- map2_df(
  plus5_net_ratings$team,  # Iterate through each team
  plus5_net_ratings$season,  # Iterate through each season
  ~ {
    team <- .x
    current_season <- .y
    
    # Check if the team is found in the second dataframe next to the current season + 1
    next_season <- current_season + 1
    is_found <- unique_off_teams %>%
      filter(season == next_season, off_team == team) %>%
      nrow() > 0  # Check if any rows are found
    
    # Return the result
    tibble(
      team = team,
      current_szn = current_season,
      made_2nd_round_following_szn = is_found,
      next_szn = next_season
    )
  }
)
# ----------------------------------------------------------------------------------------------------------------------------------------
  

# GET THE PERCENTAGE OF TEAMS THAT MADE THE SECOND ROUND W/ +5 NRG PAST SZN  -------------------------------------------------------------
# Filter the team_net_ratings dataframe to get the average net rating for each team and season
team_net_ratings_filtered <- plus5_net_ratings %>%
  rename(avg_net_rating = avg_net_rating) %>%
  select(team, season, avg_net_rating)

# Merge the filtered team_net_ratings dataframe with the result dataframe
result_with_net_rating <- result %>%
  left_join(team_net_ratings_filtered, by = c("team" = "team", "current_szn" = "season"))

# Move the net rating column next to the current_szn column
result_with_net_rating <- result_with_net_rating %>%
  select(team, current_szn, avg_net_rating, everything())

# Count the number of TRUE values in the made_2nd_round_following_szn column
made_2nd_rd_count <- sum(result_with_net_rating$made_2nd_round_following_szn, na.rm = TRUE)

# Count the total number of rows in the result_with_net_rating dataframe
total_count <- nrow(result_with_net_rating)

# Calculate the percentage of teams with a +5 NRG that made the second round the following season (ANSWER)
percentage_true <- (made_2nd_rd_count / total_count) * 100
# ----------------------------------------------------------------------------------------------------------------------------------------


# Visualize Results ----------------------------------------------------------------------------------------------------------------------
# Create the plot using ggplot2
p <- ggplot(result_with_net_rating, aes(x = factor(current_szn), y = avg_net_rating, color = team)) +
  geom_point(aes(shape = as.factor(made_2nd_round_following_szn)), size = 4, stroke = 1.5) +
  scale_shape_manual(values = c("TRUE" = 21, "FALSE" = 4)) + # 21 is circle, 4 is X
  labs(title = "Teams with +5.0 NRG and Making Second Round of Playoffs",
       x = "Season",
       y = "Average Net Rating",
       shape = "Made 2nd Round Following Season",
       color = "Team") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Convert to an interactive plot using plotly
interactive_plot <- ggplotly(p)

# Print the interactive plot
interactive_plot
# GET TOP 5 SECONDS PLAYED PLAYERS FOR EVERY TEAM ---------------------------------------------------------------------------------------
# Get a list of qualifying teams for this question
result_with_net_rating <- result_with_net_rating %>% filter(made_2nd_round_following_szn == TRUE)

# Select only the desired columns
qualifying_teams <- result_with_net_rating %>%
  select(team, current_szn)

# Filter for the desired seasons and regular season games
filtered_data <- player_game_data %>%
  filter(season >= 2014 & season <= 2023) %>%
  filter(gametype == 2) %>%
  select(nbagameid, season, team, player_name, seconds)

# Summarize total seconds played by each player for each team and each season
total_seconds_played <- filtered_data %>%
  group_by(season, team, player_name) %>%
  summarise(total_seconds = sum(seconds, na.rm = TRUE)) %>%
  arrange(season, team, desc(total_seconds))
## `summarise()` has grouped output by 'season', 'team'. You can override using
## the `.groups` argument.
# Take the top 5 players for each team in each season
top_5_players_per_team_per_season <- total_seconds_played %>%
  group_by(season, team) %>%
  slice_max(order_by = total_seconds, n = 5) %>%
  ungroup()
# ----------------------------------------------------------------------------------------------------------------------------------------


# COMBINE THE DATASETS -------------------------------------------------------------------------------------------------------------------
result_with_net_rating <- result_with_net_rating %>% filter(made_2nd_round_following_szn == TRUE)

# Select only the desired columns
qualifying_teams <- result_with_net_rating %>%
  select(team, current_szn)

# Perform a left join to merge the two dataframes
merged_data <- qualifying_teams %>%
  left_join(top_5_players_per_team_per_season, by = c("team" = "team", "current_szn" = "season"))

# Print the result
#print(merged_data, n = 50)

# Select only the desired columns
players_to_check <- merged_data %>%
  select(team, current_szn, player_name)
# ----------------------------------------------------------------------------------------------------------------------------------------


# GET PLAYERS WHO PLAYED IN THE 2ND ROUND OF THE PLAYOFFS FROM 2014-22 -------------------------------------------------------------------
# Filter for POST SZN 2015-2022 and round 2 games
post_season_player_2015_2022_round_2 <- player_game_data %>%
  filter(season >= 2015 & season <= 2022) %>%
  filter(gametype == 4) %>%  # 4 = playoffs
  filter(substr(nbagameid, nchar(nbagameid) - 2, nchar(nbagameid) - 2) == "2") %>% # 2nd round filter
  filter(seconds > 0) %>% # Filter for seconds greater than 0
  group_by(season, team, player_name) %>% # Group by season, team, and player_name
  summarise(total_seconds_played = sum(seconds)) # Sum the seconds played for each player
## `summarise()` has grouped output by 'season', 'team'. You can override using
## the `.groups` argument.
#post_season_player_2015_2022_round_2

played_in_2nd_rnd <- post_season_player_2015_2022_round_2 %>%
  select(team, season, player_name)
# ----------------------------------------------------------------------------------------------------------------------------------------


# CALCULATE % OF PLAYERS ON +5 NRG TEAMS THAT PLAYED IN THE 2ND ROUND THE FOLLOWING SEASON -----------------------------------------------
# Create the 'next_season' column in players_to_check
players_to_check <- players_to_check %>%
  mutate(next_season = current_szn + 1)

# Merge the dataframes on player_name and next_season = season
merged_df <- players_to_check %>%
  inner_join(post_season_player_2015_2022_round_2, by = c("player_name" = "player_name", "next_season" = "season"))

# Select relevant columns
result_df <- merged_df %>%
  select(team.x, current_szn, player_name, next_season, team.y) %>%
  rename(team = team.x, season = current_szn, next_szn = next_season, current_team = team.y)

# Calculate the number of rows in each dataframe
num_rows_players_to_check <- nrow(players_to_check)
num_rows_result_df <- nrow(result_df)

player_percentage <- num_rows_result_df / num_rows_players_to_check * 100

DNP_nxt_szn = num_rows_players_to_check - num_rows_result_df
DNP_percentage = DNP_nxt_szn / num_rows_players_to_check * 100
# ----------------------------------------------------------------------------------------------------------------------------------------


# Visualize ----------------------------------------------------------------------------------------------------------------------
# Create a vector of percentages
percentages <- c(player_percentage, DNP_percentage)
labels <- c("Played In 2nd Round Following Season", "Did Not Play in 2nd Round Following Season")
colors <- c("skyblue", "lightgreen")

# Create the pie chart
pie(percentages, labels = paste(round(percentages, 1), "%", sep = ""), col = colors, main = "Players on +5.0 NRG Teams")
legend("topright", legend = labels, fill = colors, bty = "n")

# REULTS
print(paste("Teams with a +5.0 NRG or higher made the second round the following season", percentage_true, "% of the time."))
## [1] "Teams with a +5.0 NRG or higher made the second round the following season 63.6363636363636 % of the time."
print(paste("Players on teams with a +5.0 NRG or higher played in the second round of the playoffs the following season", player_percentage, "% of the time."))
## [1] "Players on teams with a +5.0 NRG or higher played in the second round of the playoffs the following season 82.8571428571429 % of the time."

ANSWER 7:

Percent of +5.0 net rating teams making the 2nd round next year: 63.6%
Percent of top 5 minutes played players who played in those 2nd round series: 82.9%

Part 2 – Playoffs Series Modeling

For this part, you will work to fit a model that predicts the winner and the number of games in a playoffs series between any given two teams.

This is an intentionally open ended question, and there are multiple approaches you could take. Here are a few notes and specifications:

  1. Your final output must include the probability of each team winning the series. For example: “Team A has a 30% chance to win and team B has a 70% chance.” instead of “Team B will win.” You must also predict the number of games in the series. This can be probabilistic or a point estimate.

  2. You may use any data provided in this project, but please do not bring in any external sources of data.

  3. You can only use data available prior to the start of the series. For example, you can’t use a team’s stats from the 2016-17 season to predict a playoffs series from the 2015-16 season.

  4. The best models are explainable and lead to actionable insights around team and roster construction. We’re more interested in your thought process and critical thinking than we are in specific modeling techniques. Using smart features is more important than using fancy mathematical machinery.

  5. Include, as part of your answer:

  • A brief written overview of how your model works, targeted towards a decision maker in the front office without a strong statistical background.
  • What you view as the strengths and weaknesses of your model.
  • How you’d address the weaknesses if you had more time and/or more data.
  • Apply your model to the 2024 NBA playoffs (2023 season) and create a high quality visual (a table, a plot, or a plotly) showing the 16 teams’ (that made the first round) chances of advancing to each round.
# Merge game level player data to team game data ---------------------------------------------------------------------------------
# Rename columns in game_level_metrics for home and away teams
home_metrics <- game_level_metrics %>%
  rename_with(~ paste0("home_", .), -nbagameid, -team)

away_metrics <- game_level_metrics %>%
  rename_with(~ paste0("away_", .), -nbagameid, -team)

# Merge the data frames
merged_df <- home_away_df %>%
  left_join(home_metrics, by = c("nbagameid", "home_team")) %>%
  left_join(away_metrics, by = c("nbagameid", "away_team"))
# -------------------------------------------------------------------------------------------------------------------------------


# Create a new_game df that mocks a real game to predict the winner of ----------------------------------------------------------
# Calculate average metrics for each team (home and away)
team_metrics_home <- merged_df %>%
  group_by(home_team) %>%
  summarise(
    home_avg_points = mean(home_points, na.rm = TRUE),
    home_avg_fgmade = mean(home_fgmade, na.rm = TRUE),
    home_avg_fg3made = mean(home_fg3made, na.rm = TRUE),
    home_avg_fgattempted = mean(home_fgattempted, na.rm = TRUE),
    home_avg_assists = mean(home_assists, na.rm = TRUE),
    home_avg_orebs = mean(home_orebs, na.rm = TRUE),
    home_avg_drebs = mean(home_drebs, na.rm = TRUE),
    home_avg_steals = mean(home_steals, na.rm = TRUE),
    home_avg_blocks = mean(home_blocks, na.rm = TRUE),
    home_avg_turnovers = mean(home_turnovers, na.rm = TRUE),
    # Add other necessary metrics
    home_eFG = mean(home_eFG, na.rm = TRUE),
    home_NET_RTG = mean(home_NET_RTG, na.rm = TRUE),
    home_ppa = mean(home_ppa, na.rm = TRUE),
    home_TO_perc = mean(home_TO_perc, na.rm = TRUE),
    home_STL_perc = mean(home_STL_perc, na.rm = TRUE),
    home_BLK_perc = mean(home_BLK_perc, na.rm = TRUE),
    home_simulated_eFG = mean(home_simulated_eFG, na.rm = TRUE),
    home_simulated_fg_attempts = mean(home_simulated_fg_attempts, na.rm = TRUE),
    home_simulated_fg2_attempts = mean(home_simulated_fg2_attempts, na.rm = TRUE),
    home_simulated_fg3_attempts = mean(home_simulated_fg3_attempts, na.rm = TRUE),
    home_simulated_ft_attempts = mean(home_simulated_ft_attempts, na.rm = TRUE),
    home_simulated_fg_points = mean(home_simulated_fg_points, na.rm = TRUE),
    home_simulated_fg2_points = mean(home_simulated_fg2_points, na.rm = TRUE),
    home_simulated_fg3_points = mean(home_simulated_fg3_points, na.rm = TRUE),
    home_simulated_ft_points = mean(home_simulated_ft_points, na.rm = TRUE),
    home_expected_points_eFG = mean(home_expected_points_eFG, na.rm = TRUE),
    home_expected_points_usg = mean(home_expected_points_usg, na.rm = TRUE),
    home_expected_points = mean(home_expected_points, na.rm = TRUE)
  )

team_metrics_away <- merged_df %>%
  group_by(away_team) %>%
  summarise(
    away_avg_points = mean(away_points, na.rm = TRUE),
    away_avg_fgmade = mean(away_fgmade, na.rm = TRUE),
    away_avg_fg3made = mean(away_fg3made, na.rm = TRUE),
    away_avg_fgattempted = mean(away_fgattempted, na.rm = TRUE),
    away_avg_assists = mean(away_assists, na.rm = TRUE),
    away_avg_orebs = mean(away_orebs, na.rm = TRUE),
    away_avg_drebs = mean(away_drebs, na.rm = TRUE),
    away_avg_steals = mean(away_steals, na.rm = TRUE),
    away_avg_blocks = mean(away_blocks, na.rm = TRUE),
    away_avg_turnovers = mean(away_turnovers, na.rm = TRUE),
    # Add other necessary metrics
    away_eFG = mean(away_eFG, na.rm = TRUE),
    away_NET_RTG = mean(away_NET_RTG, na.rm = TRUE),
    away_ppa = mean(away_ppa, na.rm = TRUE),
    away_TO_perc = mean(away_TO_perc, na.rm = TRUE),
    away_STL_perc = mean(away_STL_perc, na.rm = TRUE),
    away_BLK_perc = mean(away_BLK_perc, na.rm = TRUE),
    away_simulated_eFG = mean(away_simulated_eFG, na.rm = TRUE),
    away_simulated_fg_attempts = mean(away_simulated_fg_attempts, na.rm = TRUE),
    away_simulated_fg2_attempts = mean(away_simulated_fg2_attempts, na.rm = TRUE),
    away_simulated_fg3_attempts = mean(away_simulated_fg3_attempts, na.rm = TRUE),
    away_simulated_ft_attempts = mean(away_simulated_ft_attempts, na.rm = TRUE),
    away_simulated_fg_points = mean(away_simulated_fg_points, na.rm = TRUE),
    away_simulated_fg2_points = mean(away_simulated_fg2_points, na.rm = TRUE),
    away_simulated_fg3_points = mean(away_simulated_fg3_points, na.rm = TRUE),
    away_simulated_ft_points = mean(away_simulated_ft_points, na.rm = TRUE),
    away_expected_points_eFG = mean(away_expected_points_eFG, na.rm = TRUE),
    away_expected_points_usg = mean(away_expected_points_usg, na.rm = TRUE),
    away_expected_points = mean(away_expected_points, na.rm = TRUE)
  )

# Extract metrics for PHI and BOS
home_team = "PHI"
away_team = "OKC"

home_data <- team_metrics_home %>% filter(home_team == "PHI")
away_data <- team_metrics_away %>% filter(away_team == "BOS")

# Create new game data frame with all the (features) used in the model
new_game <- data.frame(
  nbagameid = "new_game",
  gametype = 2,
  date = as.Date("2024-06-01"),
  season = 2024,
  home_team = home_team,
  away_team = away_team,
  home_points = NA,  # These would be filled post-game
  away_points = NA,  # These would be filled post-game
  home_fgmade = home_data$home_avg_fgmade,
  home_fg3made = home_data$home_avg_fg3made,
  home_fgattempted = home_data$home_avg_fgattempted,
  away_fgmade = away_data$away_avg_fgmade,
  away_fg3made = away_data$away_avg_fg3made,
  away_fgattempted = away_data$away_avg_fgattempted,
  home_assists = home_data$home_avg_assists,
  away_assists = away_data$away_avg_assists,
  home_orebs = home_data$home_avg_orebs,
  away_orebs = away_data$away_avg_orebs,
  home_drebs = home_data$home_avg_drebs,
  away_drebs = away_data$away_avg_drebs,
  home_steals = home_data$home_avg_steals,
  away_steals = away_data$away_avg_steals,
  home_blocks = home_data$home_avg_blocks,
  away_blocks = away_data$away_avg_blocks,
  home_turnovers = home_data$home_avg_turnovers,
  away_turnovers = away_data$away_avg_turnovers,
  home_eFG = home_data$home_eFG,
  away_eFG = away_data$away_eFG,
  home_NET_RTG = home_data$home_NET_RTG,
  away_NET_RTG = away_data$away_NET_RTG,
  home_ppa = home_data$home_ppa,
  away_ppa = away_data$away_ppa,
  home_TO_perc = home_data$home_TO_perc,
  away_TO_perc = away_data$away_TO_perc,
  home_STL_perc = home_data$home_STL_perc,
  away_STL_perc = away_data$away_STL_perc,
  home_BLK_perc = home_data$home_BLK_perc,
  away_BLK_perc = away_data$away_BLK_perc,
  home_simulated_eFG = home_data$home_simulated_eFG,
  away_simulated_eFG = away_data$away_simulated_eFG,
  home_simulated_fg_attempts = home_data$home_simulated_fg_attempts, 
  home_simulated_fg2_attempts = home_data$home_simulated_fg2_attempts, 
  home_simulated_fg3_attempts = home_data$home_simulated_fg3_attempts, 
  home_simulated_ft_attempts = home_data$home_simulated_ft_attempts, 
  home_simulated_fg_points = home_data$home_simulated_fg_points, 
  home_simulated_fg2_points = home_data$home_simulated_fg2_points, 
  home_simulated_fg3_points = home_data$home_simulated_fg3_points, 
  home_simulated_ft_points = home_data$home_simulated_ft_points,
  away_simulated_fg_attempts = away_data$away_simulated_fg_attempts, 
  away_simulated_fg2_attempts = away_data$away_simulated_fg2_attempts, 
  away_simulated_fg3_attempts = away_data$away_simulated_fg3_attempts, 
  away_simulated_ft_attempts = away_data$away_simulated_ft_attempts, 
  away_simulated_fg_points = away_data$away_simulated_fg_points, 
  away_simulated_fg2_points = away_data$away_simulated_fg2_points, 
  away_simulated_fg3_points = away_data$away_simulated_fg3_points, 
  away_simulated_ft_points = away_data$away_simulated_ft_points, 
  home_expected_points_eFG = home_data$home_expected_points_eFG,
  away_expected_points_eFG = away_data$away_expected_points_eFG,
  home_expected_points_usg = home_data$home_expected_points_usg,
  away_expected_points_usg = away_data$away_expected_points_usg,
  home_expected_points = home_data$home_expected_points,
  away_expected_points = away_data$away_expected_points
)
# -------------------------------------------------------------------------------------------------------------------------------


# CREATE THE MODEL (define features for the model, split the data, build model, test performance) -------------------------------
# Select relevant features for the model
features <- c("home_fgmade", "home_fg3made", "home_fgattempted", "away_fgmade", "away_fg3made", "away_fgattempted",
              "home_assists", "home_orebs", "away_orebs", "home_drebs", "away_drebs", "home_steals", "away_steals",
              "home_blocks", "away_blocks", "home_turnovers", "away_turnovers", "home_eFG", "away_eFG",
              "home_NET_RTG", "away_NET_RTG", "home_ppa", "away_ppa", "home_TO_perc", "away_TO_perc",
              "home_STL_perc", "away_STL_perc", "home_BLK_perc", "away_BLK_perc", "home_simulated_eFG",
              "away_simulated_eFG", "home_simulated_fg_attempts", "away_simulated_fg_attempts",
              "home_simulated_fg2_attempts", "away_simulated_fg2_attempts", "home_simulated_fg3_attempts",
              "away_simulated_fg3_attempts", "home_simulated_ft_attempts", "away_simulated_ft_attempts",
              "home_simulated_fg_points", "away_simulated_fg_points", "home_simulated_fg2_points",
              "away_simulated_fg2_points", "home_simulated_fg3_points", "away_simulated_fg3_points",
              "home_simulated_ft_points", "away_simulated_ft_points", "home_expected_points_eFG",
              "away_expected_points_eFG", "home_expected_points_usg", "away_expected_points_usg",
              "home_expected_points", "away_expected_points")

# Prepare the data
merged_df <- merged_df %>%
  select(nbagameid, gametype, date, season, home_team, away_team, home_team_won, all_of(features))


# Split the data into training and test sets
set.seed(123)
train_index <- createDataPartition(merged_df$home_team_won, p = 0.8, list = FALSE)
train_data <- merged_df[train_index, ]
test_data <- merged_df[-train_index, ]

# Train logistic regression model
model <- glm(home_team_won ~ ., data = train_data[, c("home_team_won", features)], family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Make predictions on the test set
test_data$predicted_prob <- predict(model, test_data[, features], type = "response")
test_data$predicted_winner <- ifelse(test_data$predicted_prob > 0.5, TRUE, FALSE)

# Evaluate the model
confusion_matrix <- confusionMatrix(factor(test_data$predicted_winner), factor(test_data$home_team_won))

# Print the confusion matrix
print(confusion_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE  1148    8
##      TRUE      5 1553
##                                           
##                Accuracy : 0.9952          
##                  95% CI : (0.9918, 0.9974)
##     No Information Rate : 0.5752          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9902          
##                                           
##  Mcnemar's Test P-Value : 0.5791          
##                                           
##             Sensitivity : 0.9957          
##             Specificity : 0.9949          
##          Pos Pred Value : 0.9931          
##          Neg Pred Value : 0.9968          
##              Prevalence : 0.4248          
##          Detection Rate : 0.4230          
##    Detection Prevalence : 0.4259          
##       Balanced Accuracy : 0.9953          
##                                           
##        'Positive' Class : FALSE           
## 
# Calculate and plot the ROC curve
roc_curve <- roc(test_data$home_team_won, test_data$predicted_prob)
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls < cases
plot(roc_curve, main = "ROC Curve")

# -------------------------------------------------------------------------------------------------------------------------------


# RUN THE MODEL -----------------------------------------------------------------------------------------------------------------
prediction <- predict(model, new_game, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
# Determine the win probabilities for the home team and away team
home_win_probability <- prediction
away_win_probability <- 1 - prediction

# Print the win probabilities
print(paste("Home Win Probability:", home_win_probability))
## [1] "Home Win Probability: 0.999861337910423"
print(paste("Away Win Probability:", away_win_probability))
## [1] "Away Win Probability: 0.000138662089576913"

Overview:

This model is designed to predict the outcome of an NBA game based on various statistical features. I’ve cleaned and manipulated the data provided on team and player performance metrics, such as PPG, turnover%, USG%, and offsnive/defensive ratings, among others. Using this data, I’ve trained a logistic regression model to predict which team is likely to win a game based on these features.

Strengths and Weaknesses:

Strengths:

  • Advanced Statistics: The model considers a wide range of performance metrics, providing a holistic view of team capabilities.
  • Statistical Rigor: By using logistic regression, we can leverage statistical techniques to make predictions based on historical data.
  • Interpretability: The model’s output, including win probabilities and model performance features, can be easily understood and communicated to decision-makers.

Weaknesses:

  • Limited Predictive Power: While this model considers many factors, it may not capture all nuances of team dynamics or game situations.
  • Result Quality: Although the models accuracy is at 99.5% (via the confusion-matrix), the probabilities discrepancy for each team is too noticeable.
  • Assumptions: Logistic regression assumes a linear relationship between features and the log odds of winning, which may not always hold true in complex sports environments.

Addressing Weaknesses:

  • More Data: Gathering more data, especially on game situations, player injuries, and other contextual factors, could enhance the model’s predictive power.
  • Advanced Techniques: Exploring advanced machine learning techniques, such as neural networks or ensemble methods, may capture nonlinear relationships more effectively.
  • Feature Engineering: Continuously refining and selecting relevant features based on domain knowledge and experimentation could improve model performance. Adding weight to recent game data and past matchup history would be my next steps.

Application to the 2024 NBA Playoffs:

Unfortunately, I couldn’t refine my model adequately within the given time frame to project future outcomes for this challenge. Nevertheless, this experience has been invaluable, and I’m eager to further refine my model. As someone relatively new to the data field with a background in software development, Python has been my primary tool. However, delving into R for this project has been a rewarding experience. Here are a few reflections from working on this challenge:

  • Addressing irregularities such as the COVID-affected seasons, like the 2020-21 season only consisting of 72 games and starting in December, is crucial for model accuracy.
  • Initially, I didn’t extensively preprocess the data until I began developing the predictive model. However, merging the rows for each game in a coherent manner greatly facilitated answering questions 1-4. In the future, I won’t underestimate the power of cleaning and prepping the data.
  • While I’ve previously explored predictive modeling for NBA player point forecasts in Python, achieving a fully functional model has remained elusive. Moving forward, I’m committed to refining my approach and creating a rigorously tested and precise model. (If I have to travel this weekend I’d continue working on this up till the last second.

Part 3 – Finding Insights from Your Model

Find two teams that had a competitive window of 2 or more consecutive seasons making the playoffs and that under performed your model’s expectations for them, losing series they were expected to win. Why do you think that happened? Classify one of them as bad luck and one of them as relating to a cause not currently accounted for in your model. If given more time and data, how would you use what you found to improve your model?

ANSWER :

Thank you sincerely for this opportunity—it’s been an incredible experience playing with this data. Regardless if I’m the right fit for the job, I’ve gained valuable insights and sharpened my skills working on this challenge, which I’m grateful for. It’s frustrating I couldn’t complete Question 7 after all the time I put into it but I’m committed to further developing my model for personal growth. As a passionate NBA student with expertise in software and data, I’d love to connect with someone from the OKC organization for feedback regarding my model and where I went wrong. That being said, this project aligns perfectly with my interests and skills, and I’m determined to pursue this line of work further. Thanks again.