NBA 2023 Champion Prediction

Conclusion

According to my prediction, the 2023 NBA championship will be the Boston Celtics. Milwaukee Bucks and Denver Nuggets are second and third title contenders

EDA Conclusion:

Not surprisingly, playoff teams have a much higher field goal percentage than non-playoff teams on average.
We can see that both 3-Point Field Goals and 3-Point Field Goals Attempt have increased over time, and teams that make it to the playoffs tend to make and attempt more 3-point shots during the regular season compared to teams that do not make the playoffs. Besides, Several peaks in the 3-Point Field Goals Percentage chart were created by the Golden State Warriors, including the regular season before the Warriors won the championship in 15, 17, 18 and 22 years. Insane Field goal percentage!
Interestingly, the number of 2-point attempts by teams decreased significantly over time, and teams that made the playoffs took fewer 2-point shots than teams that did not make the playoffs. This should be because strong teams focus their offense on shooting three-pointers. The lowest point for two-point field goal attempts among championship teams was the Warriors’ performance during the 2021 season.

Possible future improvements :

Take the number of star players into consideration
According to the schedule, the winning percentage when the two teams meet is taken as the ouput, and when it is calculated all the way to the championship game, the team with the highest winning percentage between the two teams is the champion

Executive Summary
Data Introduction
Libraries
Read file into R
Data Cleaning
Data Preparation

6.1 Merging Data

6.2 Factorizing variables
Exploratory Data Analysis

7.1 The response variable : wins_playoff

7.2 Some interesting facts in regular seasons from 2000 to 2023
Preparing data for modeling

8.1 Feature Selection
Modeling

9.1 Random Forest

9.2 XGBoost
Error Analysis
NBA 2023 Champion Prediction

1. Executive Summary

I’m working on a fun project called “NBA 2023 Champion Prediction”. Basically, I’m using historical data and machine learning to try and figure out which NBA team is most likely to win the championship in 2023.

The NBA is so popular, and people love making predictions about which team will come out on top. So, I’m using historical regular season team stats, game results, and other data to train machine learning models that can predict the number of wins of each team in the playoffs. The team winning the most games in the 2023 playoff will be the championship.

It’s interesting to see how accurate my predictions are, and I will buy sports lottery tickets according to my prediction.

2. Data Introduction

For this project, I scraped NBA regular season data from the website https://www.basketball-reference.com/. The website provides a comprehensive collection of NBA statistics, game results, and player information from past seasons.

I collected data for the 2000 - 2023 NBA regular seasons. The data includes information on team statistics, game results, and other relevant metrics such as team standings, player awards, and playoff results.

Overall, the data collected from basketball-reference.com provides a rich and diverse set of features that are essential for accurate prediction of NBA team performance.

3. Libraries

library(rvest)
library(tidyverse)
library(stringr)
library(DataExplorer)
library(corrplot)
library(ranger)
library(janitor)
library(caret)
library(randomForest)
library(xgboost)

4. Read file into R

path = "C:/Users/Jonathan/Documents/BA/R Data Science/More Cases/2023 NBA Prediction/"
#Read in eaah dataset
division_total_E <- read_csv(paste0(path, "division_total_E.csv"))
division_total_W <- read_csv(paste0(path, "division_total_W.csv"))
pergame_stats_total <- read_csv(paste0(path, "pergame_stats_total.csv"))
advanced_stats_total <- read_csv(paste0(path, "advanced_stats_total.csv"))
WL_playoff_total <- read_csv(paste0(path, "WL_playoff_total.csv"))

Make a copy of each dataset and I will edit on the copies.

division_E <- division_total_E
division_W <- division_total_W
pergame <- pergame_stats_total
advanced <- advanced_stats_total
playoff <- WL_playoff_total

5. Data Cleaning

division_E & division_W

Remove the row with unuseful value, including “Central Division”, “Atlantic Division”, “Southeast Division”, “Midwest Division”, ““Pacific Division”, “Southwest Division”, and “Northwest Division”.
GB :“Games behind” is a term used in the NBA standings to indicate the difference in the number of games won and lost between two teams. It represents the distance between a team’s position in the standings and the team that is currently leading the division or conference. The teams with “-” value of “GB” means they ranked #1 of their divisions in the regular season that year so I will replace the value with 0.
There are some values in the “Eastern_Conference” column contain “star”. The teams with “star” mean they made for the playoff that year. However, I don’t need that information so I am going to remove “*“.
Drop the row id column.

#Remove the row with unuseful value
division_E <- division_E %>% 
  filter(!Eastern_Conference %in% c("Central Division", "Atlantic Division", "Southeast Division"))
division_W <- division_W %>% 
  filter(!Western_Conference %in% c("Midwest Division", "Pacific Division", "Southwest Division", "Northwest Division"))
#Replacing "-" with 0 in GB column
division_E$GB <- ifelse(division_E$GB == "—", 0, division_E$GB)
division_W$GB <- ifelse(division_W$GB == "—", 0, division_W$GB)
#Remove "*" in Eastern_Conference and Western_Conference
Eastern_Conference_clean <- division_E$Eastern_Conference %>%
  str_remove_all(fixed("*"))
division_E <- division_E %>% 
  mutate(Eastern_Conference = Eastern_Conference_clean)

Western_Conference_clean <- division_W$Western_Conference %>% 
  str_remove_all(fixed("*"))
division_W <- division_W %>% 
  mutate(Western_Conference = Western_Conference_clean)

#I don't need the row id column
division_E <- division_E[,-1]
division_W <- division_W[,-1]

head(division_E)

## # A tibble: 6 × 9
##   Eastern_Conference W     L     `W/L_percent` GB    `PS/G` `PA/G` SRS    Year
##   <chr>              <chr> <chr> <chr>         <chr> <chr>  <chr>  <chr> <dbl>
## 1 Miami Heat         52    30    .634          0     94.4   91.3   2.75   2000
## 2 New York Knicks    50    32    .610          2.0   92.1   90.7   1.30   2000
## 3 Philadelphia 76ers 49    33    .598          3.0   94.8   93.4   1.02   2000
## 4 Orlando Magic      41    41    .500          11.0  100.1  99.4   0.43   2000
## 5 Boston Celtics     35    47    .427          17.0  99.3   100.1  -1.00  2000
## 6 New Jersey Nets    31    51    .378          21.0  98.0   99.0   -1.18  2000

head(division_W)

## # A tibble: 6 × 9
##   Western_Conference   W     L     `W/L_percent` GB    `PS/G` `PA/G` SRS    Year
##   <chr>                <chr> <chr> <chr>         <chr> <chr>  <chr>  <chr> <dbl>
## 1 Utah Jazz            55    27    .671          0     96.5   92.0   4.52   2000
## 2 San Antonio Spurs    53    29    .646          2.0   96.2   90.2   5.92   2000
## 3 Minnesota Timberwol… 50    32    .610          5.0   98.5   96.0   2.67   2000
## 4 Dallas Mavericks     40    42    .488          15.0  101.4  102.0  -0.29  2000
## 5 Denver Nuggets       35    47    .427          20.0  99.0   101.1  -1.76  2000
## 6 Houston Rockets      34    48    .415          21.0  99.5   100.3  -0.57  2000

pergame

Drop the “League Average” value in “Team” column.
There are some values in the “Team” column contain “star”. The teams with “star” mean they made for the playoff that year. However, I don’t need that information so I am going to remove “*“.
Drop the row id column.

#Remove the row with unuseful value
pergame <- pergame %>% 
  filter(!Team %in% "League Average")
#Remove "*" in Team
team_clean <- pergame$Team %>% 
  str_remove_all(fixed("*"))
pergame <- pergame %>% 
  mutate(Team = team_clean)

#I don't need the row id column
pergame <- pergame[,-1]

head(pergame)

## # A tibble: 6 × 26
##      Rk Team       G    MP    FG   FGA FG_Percent  `3P` `3PA` `3P_Percent`  `2P`
##   <dbl> <chr>  <dbl> <dbl> <dbl> <dbl>      <dbl> <dbl> <dbl>        <dbl> <dbl>
## 1     1 Sacra…    82  242.  40    88.9      0.45    6.5  20.2        0.322  33.4
## 2     2 Detro…    82  242.  37.1  80.9      0.459   5.4  14.9        0.359  31.8
## 3     3 Dalla…    82  241.  39    85.9      0.453   6.3  16.2        0.391  32.6
## 4     4 India…    82  241.  37.2  81        0.459   7.1  18.1        0.392  30  
## 5     5 Milwa…    82  242.  38.7  83.3      0.465   4.8  13          0.369  33.9
## 6     6 Los A…    82  242.  38.3  83.4      0.459   4.2  12.8        0.329  34.1
## # ℹ 15 more variables: `2PA` <dbl>, `2P_Percent` <dbl>, FT <dbl>, FTA <dbl>,
## #   FT_Percent <dbl>, ORB <dbl>, DRB <dbl>, TRB <dbl>, AST <dbl>, STL <dbl>,
## #   BLK <dbl>, TOV <dbl>, PF <dbl>, PTS <dbl>, Year <dbl>

advanced

The table format is incorrect in some columns so I have to fix it first.
Remove the row with unuseful value.
There are some values in the “Team” column contain “star”. The teams with “star” mean they made for the playoff that year. However, I don’t need that information so I am going to remove “*“.
Drop the row id column.

#Fixing the columns
advanced <- subset(advanced, select = -c(eFG_percent_offense, TOV_percent_defense, `Attend/G`))
colnames(advanced)[19:29] <- c("eFG_percent_offense", "TOV_percent_offense", "ORB_percent", "FT/FGA_offense", "eFG_percent_defense", "TOV_percent_defense", "DRB_percent", "FT/FGA_defense", "Arena", "Attend", "Attend/G")

#Remove the row with unuseful value
advanced <- advanced %>% 
  filter(!Team %in% "Team")

#Remove "*" in Team
team_clean <- advanced$Team %>% 
  str_remove_all(fixed("*"))
advanced <- advanced %>% 
  mutate(Team = team_clean)

#I don't need the row id column
advanced <- advanced[,-1]

#Removing comma in "Attend" and "Attend/G"
clean_strings_attend <- gsub(",", "", advanced$Attend)
clean_strings_attendG <- gsub(",", "", advanced$`Attend/G`)
advanced <- advanced %>% 
  mutate(Attend = clean_strings_attend,
         `Attend/G` = clean_strings_attendG)
head(advanced)

## # A tibble: 6 × 29
##   Rk    Team   Age   W     L     PW    PL    MOV   SOS   SRS   OPtg  DRtg  NRtg 
##   <chr> <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1     Los A… 29.2  67    15    64    18    8.55  -0.14 8.41  107.3 98.2  +9.1 
## 2 2     Portl… 29.6  59    23    59    23    6.40  -0.04 6.36  107.9 100.8 +7.1 
## 3 3     San A… 30.9  53    29    58    24    5.94  -0.02 5.92  105.0 98.6  +6.4 
## 4 4     Phoen… 28.6  53    29    56    26    5.22  0.02  5.24  104.6 99.0  +5.6 
## 5 5     Utah … 31.5  55    27    54    28    4.46  0.05  4.52  107.3 102.3 +5.0 
## 6 6     India… 30.4  56    26    54    28    4.60  -0.45 4.15  108.5 103.6 +4.9 
## # ℹ 16 more variables: Pace <chr>, FTr <chr>, `3PAr` <chr>, TS_percent <chr>,
## #   eFG_percent_offense <chr>, TOV_percent_offense <chr>, ORB_percent <chr>,
## #   `FT/FGA_offense` <chr>, eFG_percent_defense <chr>,
## #   TOV_percent_defense <chr>, DRB_percent <chr>, `FT/FGA_defense` <chr>,
## #   Arena <chr>, Attend <chr>, `Attend/G` <chr>, Year <dbl>

playoff

Only keep the columns I need. “W”(wins in playoffs) as my dependent variable(Y). “Team” and “Year” as foregin keys.
Remove the row with unuseful value.
Drop the values in 2023 because this is what we going to predict.

#Keeping "W", "Team" and "Year.
playoff <- playoff[,c(3,5,27)]

#Remove the row with unuseful value
playoff <- playoff %>% 
  filter(!Team %in% c("Tm", "League Average", "Team"))

##Drop 2023
playoff <- playoff %>% 
  filter(!Year %in% 2023)

head(playoff)

## # A tibble: 6 × 3
##   Team                   W      Year
##   <chr>                  <chr> <dbl>
## 1 Portland Trail Blazers 10     2000
## 2 Miami Heat             6      2000
## 3 Indiana Pacers         13     2000
## 4 Los Angeles Lakers     15     2000
## 5 Milwaukee Bucks        2      2000
## 6 New York Knicks        9      2000

More cleaning…

#Some strange number in values of "Eastern_Conference" and "Western_Conference" in the year 2023.
division_E %>% 
  filter(Year %in% 2023) %>% 
  head()

## # A tibble: 6 × 9
##   Eastern_Conference   W     L     `W/L_percent` GB    `PS/G` `PA/G` SRS    Year
##   <chr>                <chr> <chr> <chr>         <chr> <chr>  <chr>  <chr> <dbl>
## 1 Boston Celtics (2)   57    25    .695          0     117.9  111.4  6.38   2023
## 2 Philadelphia 76ers … 54    28    .659          3.0   115.2  110.9  4.37   2023
## 3 New York Knicks (5)  47    35    .573          10.0  116.0  113.1  2.99   2023
## 4 Brooklyn Nets (6)    45    37    .549          12.0  113.4  112.5  1.03   2023
## 5 Toronto Raptors (9)  41    41    .500          16.0  112.9  111.4  1.59   2023
## 6 Milwaukee Bucks (1)  58    24    .707          0     116.9  113.3  3.61   2023

division_W %>% 
  filter(Year %in% 2023) %>% 
  head()

## # A tibble: 6 × 9
##   Western_Conference   W     L     `W/L_percent` GB    `PS/G` `PA/G` SRS    Year
##   <chr>                <chr> <chr> <chr>         <chr> <chr>  <chr>  <chr> <dbl>
## 1 Denver Nuggets (1)   53    29    .646          0     115.8  112.5  3.04   2023
## 2 Minnesota Timberwol… 42    40    .512          11.0  115.8  115.8  -0.22  2023
## 3 Oklahoma City Thund… 40    42    .488          13.0  117.5  116.4  0.96   2023
## 4 Utah Jazz (12)       37    45    .451          16.0  117.1  118.0  -1.03  2023
## 5 Portland Trail Blaz… 33    49    .402          20.0  113.4  117.4  -3.96  2023
## 6 Sacramento Kings (3) 48    34    .585          0     120.7  118.1  2.30   2023

#Remove the pattern "(number)" using gsub()
clean_strings_E <- gsub("\\s*\\(\\d+\\)\\s*", "", division_E$Eastern_Conference)
division_E <- division_E %>% 
  mutate(Eastern_Conference = clean_strings_E)

clean_strings_W <- gsub("\\s*\\(\\d+\\)\\s*", "", division_W$Western_Conference)
division_W <- division_W %>% 
  mutate(Western_Conference = clean_strings_W)
#Remove trail space
division_E$Eastern_Conference <- str_trim(division_E$Eastern_Conference)
division_W$Western_Conference <- str_trim(division_W$Western_Conference)


table(division_E$Eastern_Conference)

## 
##       Atlanta Hawks      Boston Celtics       Brooklyn Nets   Charlotte Bobcats 
##                  24                  24                  11                  10 
##   Charlotte Hornets       Chicago Bulls Cleveland Cavaliers     Detroit Pistons 
##                  12                  24                  24                  24 
##      Indiana Pacers          Miami Heat     Milwaukee Bucks     New Jersey Nets 
##                  24                  24                  24                  13 
## New Orleans Hornets     New York Knicks       Orlando Magic  Philadelphia 76ers 
##                   2                  24                  24                  24 
##     Toronto Raptors  Washington Wizards 
##                  24                  24

length(table(division_E$Eastern_Conference))

## [1] 18

table(division_E$Year)

## 
## 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 
##   15   15   15   15   15   15   15   15   15   15   15   15   15   15   15   15 
## 2016 2017 2018 2019 2020 2021 2022 2023 
##   15   15   15   15   15   15   15   15

table(division_W$Western_Conference)

## 
##                  Dallas Mavericks                    Denver Nuggets 
##                                24                                24 
##             Golden State Warriors                   Houston Rockets 
##                                24                                24 
##              Los Angeles Clippers                Los Angeles Lakers 
##                                24                                24 
##                 Memphis Grizzlies            Minnesota Timberwolves 
##                                22                                24 
##               New Orleans Hornets              New Orleans Pelicans 
##                                 7                                10 
## New Orleans/Oklahoma City Hornets             Oklahoma City Thunder 
##                                 2                                15 
##                      Phoenix Suns            Portland Trail Blazers 
##                                24                                24 
##                  Sacramento Kings                 San Antonio Spurs 
##                                24                                24 
##               Seattle SuperSonics                         Utah Jazz 
##                                 9                                24 
##               Vancouver Grizzlies 
##                                 2

length(table(division_W$Western_Conference))

## [1] 19

table(division_W$Year)

## 
## 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 
##   14   14   14   14   14   15   15   15   15   15   15   15   15   15   15   15 
## 2016 2017 2018 2019 2020 2021 2022 2023 
##   15   15   15   15   15   15   15   15

There were 15 teams every year in the NBA East division and 14 or 15 teams in the West division but there were 18 teams and 19 teams in history, separately. The reason is that some teams changed their names over time.

6. Data Prearation

6.1 Merging Data

Change the names of “W” column in the playoff dataset.
Create a “performance” column in the playoff dataset.
Create a “division” column in division_E and division_W
Change the name of “Western_Conference” column and “Eastern_Conference”

#Give the dependent variable a new name
colnames(playoff)[2] <- "wins_playoff"

#Create a column to distinguish teams that made to the playoff, teams that did not make the playoffs and teams that won the champion. 
playoff$performance <- ""
playoff$wins_playoff <- as.numeric(playoff$wins_playoff)
playoff$performance <- ifelse((playoff$wins_playoff == 15 & playoff$Year == 2000:2002) | (playoff$wins_playoff == 16), "champion", "playoff")

## Warning in playoff$Year == 2000:2002: 較長的物件長度並非較短物件長度的倍數

head(playoff)

## # A tibble: 6 × 4
##   Team                   wins_playoff  Year performance
##   <chr>                         <dbl> <dbl> <chr>      
## 1 Portland Trail Blazers           10  2000 playoff    
## 2 Miami Heat                        6  2000 playoff    
## 3 Indiana Pacers                   13  2000 playoff    
## 4 Los Angeles Lakers               15  2000 champion   
## 5 Milwaukee Bucks                   2  2000 playoff    
## 6 New York Knicks                   9  2000 playoff

#Create a column of division
division_E$division <- "E"
division_W$division <- "W"

#Change the names of columns
colnames(division_E)[1] <- "conference"
colnames(division_W)[1] <- "conference"

create a “division” column in division_E and division_W

division_E$division <- "E"
division_W$division <- "W"

Start merging datasets

#binding division_E and division_W 
division_total <- rbind(division_E, division_W)
#changing the column name "conference" into "Team"
colnames(division_total)[1] <- "Team"
#Drop the duplicate columns that another dataset already have
division_total <- subset(division_total, select = -c(W, L, SRS))
#joining advanced with pergame
regular_season <- pergame %>% 
  left_join(advanced, by = c("Team", "Year"), suffix = c("_pergame", "_advanced")) %>% 
  left_join(division_total, by = c("Team", "Year"))
#joining playoff with regular_season
data <- regular_season %>% 
  left_join(playoff, by = c("Team", "Year"))
#Give those who fail to make to the playoff that year a "fail" value in the column "performance" 
data$performance <- ifelse(is.na(data$performance) == T, "fail", data$performance)
#Give a better name to column "W" and "L", which mean wins and loses in regular seasons
colnames(data)[29] <- "wins_regular"
colnames(data)[30] <- "loses_regular"

dim(data)

## [1] 715  60

6.2 Factorizing variables

Columns that should be numeric

#Columns that should remain character
which(names(data) %in% c("Team", "Arena", "division", "performance"))

## [1]  2 51 58 60

# specify the column indices to convert to numeric
numeric_cols <- setdiff(seq_along(data), c(2, 51, 58, 60))
# convert to numeric
data[, numeric_cols] <- apply(data[, numeric_cols], 2, as.numeric)

Columns that should be factors

data$division <- as.factor(data$division)
data$performance <- as.factor(data$performance)

The goal is to predict the 2023 NBA champion.

nba_2023 <- data %>% 
  filter(Year == 2023) %>% 
  select(-performance)

data <- data %>% 
  filter(!Year %in% 2023)

7. Exploratory Data Analysis

#create_report(data)

Note : I will keep the “performance” column for EDA but I won’t use it later in modeling because we only know the champion “after” the whole season is over.

7.1 The response variable : wins_playoff

summary(data$wins_playoff)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   1.000   3.500   5.181   7.250  16.000     321

data %>% 
  count(wins_playoff) %>% 
  ggplot(aes(wins_playoff, n)) +
  geom_col(fill = "slateblue") +
  labs(title = "Distribution of wins in playoff from 2000 - 2023", 
       x = "wins", y = "count") +
  theme_minimal() +
  theme(plot.title = element_text(size = 13), 
        axis.title=element_text(size=8)) +
   theme(panel.background = element_rect(fill='white'),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank())+
   geom_text(aes(label = n), size = 3, vjust = -0.8, color = "black") +
  ylim(0,60)

## Warning: Removed 1 rows containing missing values (`position_stack()`).

## Warning: Removed 1 rows containing missing values (`geom_text()`).

We can see that the distribution is skewed. This is simply because most of the teams lose in the first round of 16 to 8 matches.

7.2 Some interesting facts in regular seasons from 2000 to 2023

FG_Percent : Field goals percentage per game

data %>% 
  group_by(Year, performance) %>% 
  summarise(mean_FG_percent = mean(FG_Percent)) %>% 
  ggplot(aes(Year, mean_FG_percent, group = performance, color =performance)) +
  geom_point() +
  geom_line()+
  labs(title = "Evolution of Average Field Goals Percentage per game in Regular Season", 
       x = "Year", y = "Field Goals Percentage") +
  theme_minimal() +
  theme(plot.title = element_text(size = 13), 
        axis.title=element_text(size=8)) +
   theme(panel.background = element_rect(fill='white'),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank())+
  theme(legend.position = c(0.13, 0.8),
        legend.background = element_rect(fill = "white", color = "black"))

## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.

Not surprisingly, playoff teams have a much higher field goal percentage than non-playoff teams on average.

3P : 3-Point field goals
3PA : 3-Point field goals attempt
3P_Percent : 3-Point field foals percentage

data %>% 
  group_by(Year, performance) %>% 
  summarise(mean_3P = mean(`3P`)) %>% 
  ggplot(aes(Year, mean_3P, group = performance, color =performance)) +
  geom_point() +
  geom_line()+
  labs(title = "Evolution of Average 3-Point Field Goals per game in Regular Season", 
       x = "Year", y = "3-Point Field Goals") +
  theme_minimal() +
  theme(plot.title = element_text(size = 12), 
        axis.title=element_text(size=8)) +
   theme(panel.background = element_rect(fill='white'),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank())+
  theme(legend.position = c(0.13, 0.8),
        legend.background = element_rect(fill = "white", color = "black"))

## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.

data %>% 
  group_by(Year, performance) %>% 
  summarise(mean_3PA = mean(`3PA`)) %>% 
  ggplot(aes(Year, mean_3PA, group = performance, color =performance)) +
  geom_point() +
  geom_line()+
  labs(title = "Evolution of Average 3-Point Field Goals Attempts per game in Regular Season", 
       x = "Year", y = "3-Point Field Goals Attempts") +
  theme_minimal() +
  theme(plot.title = element_text(size = 12), 
        axis.title=element_text(size=8)) +
   theme(panel.background = element_rect(fill='white'),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank())+
  theme(legend.position = c(0.13, 0.8),
        legend.background = element_rect(fill = "white", color = "black"))

## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.

data %>% 
  group_by(Year, performance) %>% 
  summarise(mean_3P_Percent = mean(`3P_Percent`)) %>% 
  ggplot(aes(Year, mean_3P_Percent, group = performance, color =performance)) +
  geom_point() +
  geom_line()+
  labs(title = "Evolution of Average 3-Point Field Goals Percentage per game in Regular Season", 
       x = "Year", y = "3-Point Field Goals Percentage") +
  theme_minimal() +
  theme(plot.title = element_text(size = 12), 
        axis.title=element_text(size=8)) +
   theme(panel.background = element_rect(fill='white'),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank())+
  theme(legend.position = c(0.13, 0.8),
        legend.background = element_rect(fill = "white", color = "black"))

## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.

We can see that both 3-Point Field Goals and 3-Point Field Goals Attempt have increased over time, and teams that make it to the playoffs tend to make and attempt more 3-point shots during the regular season compared to teams that do not make the playoffs. Besides, Several peaks in the 3-Point Field Goals Percentage chart were created by the Golden State Warriors, including the regular season before the Warriors won the championship in 15, 17, 18 and 22 years. Insane Field goal percentage!

2PA : 2-Point field goals attempt

data %>% 
  group_by(Year, performance) %>% 
  summarise(mean_2PA = mean(`2PA`)) %>% 
  ggplot(aes(Year, mean_2PA, group = performance, color =performance)) +
  geom_point() +
  geom_line()+
  labs(title = "Evolution of Average 2-Point Field Goals Attempt per game in Regular Season", 
       x = "Year", y = "2-Point Field Goals Attempt") +
  theme_minimal() +
  theme(plot.title = element_text(size = 12), 
        axis.title=element_text(size=8)) +
   theme(panel.background = element_rect(fill='white'),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank())+
  theme(legend.position = c(0.13, 0.3),
        legend.background = element_rect(fill = "white", color = "black"))

## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.

Interestingly, the number of 2-point attempts by teams decreased significantly over time, and teams that made the playoffs took fewer 2-point shots than teams that did not make the playoffs. This should be because strong teams focus their offense on shooting three-pointers. The lowest point for two-point field goal attempts among championship teams was the Warriors’ performance during the 2021 season.

PTS : Points

data %>% 
  group_by(Year, performance) %>% 
  summarise(mean_PTS = mean(`PTS`)) %>% 
  ggplot(aes(Year, mean_PTS, group = performance, color =performance)) +
  geom_point() +
  geom_line()+
  labs(title = "Evolution of Points per game in Regular Season", 
       x = "Year", y = "Points") +
  theme_minimal() +
  theme(plot.title = element_text(size = 12), 
        axis.title=element_text(size=8)) +
   theme(panel.background = element_rect(fill='white'),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank())+
  theme(legend.position = c(0.13, 0.8),
        legend.background = element_rect(fill = "white", color = "black"))

## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.

8. Preparing data for modeling

Goal : Predict 2023 NBA champion using “nba_2023” as input. The “wins_playoff” will be the dependent variable(Y), and the teams with most wins will be the champion in prediction. The final independent variables(X) will be selected later using various methods.

nba_2023 <- as.data.frame(nba_2023)
#Convert column "Team" to row index.
rownames(nba_2023) <- nba_2023$Team
#We don't need column "Team" and "Arena"
nba_2023 <- nba_2023 %>% select(-c(Team,Arena))

There are NAs in “Attend” and “Attend/G” column, and because the correlation between those features and “wins_playoff” are low, I decide to drop them.

cor(data$Attend, data$wins_playoff,use="pairwise.complete.obs")

## [1] 0.09790018

cor(data$`Attend/G`, data$wins_playoff,use="pairwise.complete.obs")

## [1] 0.1002934

nba_2023 <- nba_2023 %>% select(-c(Attend, `Attend/G`))
data <- data %>% select(-c(Attend, `Attend/G`))

I only need the teams that made to the playoffs each year, and I don’t need the “performance”, “Team” and “Arena”(Having too much levels) column.

data <- data %>% 
  filter(performance %in% c("champion", "playoff")) %>% 
  select(-c(performance,Arena))

Cleaning columns’ names

data <- clean_names(data)
nba_2023 <- clean_names(nba_2023)

8.1 Feature Selection

Create a copy of data before feature selection

data <- as.data.frame(data)
rownames(data) <- paste0(data$team, data$year)
data_2 <- data

Finding variable importance with a quick Random Forest

set.seed(1234)
quick_rf <- randomForest(x=data[,-56], y=data$wins_playoff, ntree=500, importance = T)
important <- round(importance(quick_rf), 2)

imp_DF <- data.frame(Variables = row.names(important), MSE = important[,1])
imp_DF <- imp_DF[order(imp_DF$MSE, decreasing = TRUE),]
ggplot(imp_DF[1:20,], aes(x=reorder(Variables, MSE), y=MSE, fill=MSE)) + geom_bar(stat = 'identity') + labs(x = 'Variables', y= '% increase MSE if variable is randomly permuted') + coord_flip() + theme(legend.position="none")

Choose features having percent increase MSE higher than 5

rf_features <- imp_DF %>% filter(MSE >5)
rf_features <- rf_features$Variables

Find variables having higher than 0.5 correlation with “wins_playoff”

numericVars <- which(map(data, is.numeric) == T)#index vector numeric variables
numericVarNames <- names(numericVars)#saving names vector for use later on
cat('There are', length(numericVars), 'numeric variables')

## There are 54 numeric variables

#find correlations of all numeric variables
data_numVar <- data[, numericVars]
cor_numVar <- cor(data_numVar, use="pairwise.complete.obs")#pairwise.complete.obs uses the non-NA values when calculating the correlation. 

#sort on decreasing correlations with SalePrice
cor_sorted <- as.matrix(sort(cor_numVar[,'wins_playoff'], decreasing = TRUE))

#select only high correlations with SalePrice(correlation>0.5)
CorHigh <- names(which(apply(cor_sorted, 1, function(x) abs(x)>0.5)))                             
cor_numVar <- cor_numVar[CorHigh, CorHigh]

#visiualization
corrplot.mixed(cor_numVar, tl.col = "black", tl.pos = "lt")

It also becomes clear that multicollinearity is an issue when using some algorithms.

data <- data %>% 
  select(c(rf_features, CorHigh, wins_playoff, year))

## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(rf_features)
## 
##   # Now:
##   data %>% select(all_of(rf_features))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(CorHigh)
## 
##   # Now:
##   data %>% select(all_of(CorHigh))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

nba_2023 <- nba_2023 %>% 
  select(c(rf_features, CorHigh, wins_playoff))

9. Modeling

I will use different training data to train models and test them on the same test set(year 2022).

9.1 Randomforest with training data in year 2000 to 2021

Composing training and testing sets

training = data %>% filter(!year %in% 2022)
testing = data %>% filter(year %in% 2022)
#I don't need the "year" column when building models
training <- training[,-14]
testing <- testing[,-14]
dim(training)

## [1] 348  13

dim(testing)

## [1] 16 13

9.1.1 rf1

rf1 : Basic RF(for baseline)

set.seed(1234)
rf1 <- randomForest(wins_playoff~., data = training, importance = T)
rf1

## 
## Call:
##  randomForest(formula = wins_playoff ~ ., data = training, importance = T) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 11.4862
##                     % Var explained: 48.01

plot(rf1)

#The number of trees that minimize the MSE.
which.min(rf1$mse)

## [1] 98

#The RMSE of the most suitable random forest is (the error value of the average Sale_Price):
sqrt(rf1$mse[which.min(rf1$mse)])

## [1] 3.358748

save(rf1, file = "model.Rdata.rf1")

load(file = "model.Rdata.rf1")

Finding Variable importance

round(importance(rf1), 2)

##               %IncMSE IncNodePurity
## w_l_percent     19.40       1384.17
## gb              17.21        958.86
## wins_regular    14.20        608.53
## loses_regular   14.09        912.59
## sos             14.80        596.62
## n_rtg            9.47        562.44
## mov             12.73        724.51
## rk_advanced     10.76        388.30
## pl               9.09        289.68
## division        13.57        163.97
## srs              8.75        481.99
## pw               9.44        259.14

important <- importance(rf1)
imp_DF <- data.frame(Variables = row.names(important), MSE = important[,1])
imp_DF <- imp_DF[order(imp_DF$MSE, decreasing = TRUE),]
ggplot(imp_DF, aes(x=reorder(Variables, MSE), y=MSE, fill=MSE)) + geom_bar(stat = 'identity') + labs(x = 'Variables', y= '% increase MSE if variable is randomly permuted') + coord_flip() + theme(legend.position="none")

rf1 prediction

prediction_rf1 <- predict(rf1, testing[, -12])
rf1_MSE <- mean((prediction_rf1 - testing$wins_playoff)^2)
rf1_RMSE <- sqrt(rf1_MSE)

9.1.2 rf2

Tuning

tuneRF

Finding best mtry

# Filter out predictor variable names
features <- setdiff(x = names(training), y = "wins_playoff")
set.seed(123)
RF_tune <- tuneRF(x = training[features], y = training$wins_playoff, mtryStart = 1, ntreeTry = 5000, trace = FALSE)

## -0.0206462 0.05

RF_tune

##   mtry OOBError
## 1    1 11.16308
## 2    2 11.39356

Best mtry is 1.

ranger

# hyperparameter grid search
hyper_grid <- expand.grid(
  mtry = seq(1, 10, by = 1),
  node_size = seq(1, 9, by = 1),
  sample_size = c(0.55, 0.632, 0.7, 0.8),
  OOB_RMSE = 0
)

# total number of combinations
nrow(hyper_grid)

## [1] 360

for (i in 1:nrow(hyper_grid)) {
  # train model
  ranger_tune <- ranger(
    formula = wins_playoff ~ .,
    data = training,
    num.trees = 500, 
    mtry = hyper_grid$mtry[i],
    min.node.size = hyper_grid$node_size[i], 
    sample.fraction = hyper_grid$sample_size[i],
    seed = 123
  )


  hyper_grid$OOB_RMSE[i] <- sqrt(ranger_tune$prediction.error)
}


hyper_grid %>% 
  dplyr::arrange(OOB_RMSE) %>% 
  head(10)

##    mtry node_size sample_size OOB_RMSE
## 1     7         8        0.55 3.297908
## 2     7         7        0.55 3.298633
## 3    10         9        0.55 3.300885
## 4    10         8        0.55 3.301806
## 5     5         9        0.55 3.302102
## 6     7         9        0.55 3.304781
## 7     9         9        0.55 3.305087
## 8    10         7        0.55 3.305618
## 9     9         9        0.70 3.307140
## 10    5         8        0.55 3.307395

OOB_RMSE falls roughly around 3 games.
The optimal value of mtry falls in all 5~10 range intervals. Indicates that mtry does not have much impact on OOB_RMSE in this interval.
The number of optimal minimum node observations falls in the range of 7 to 9.
The optimal sampling ratio is about 0.55.

We repeat the model of this parameter setting 100 times to calculate the expected value of the error rate of this model.

OOB_RMSE_ranger <- vector(mode = "numeric", length = 100)

for(i in 1:length(OOB_RMSE_ranger)){
  optimal_ranger <- ranger(
    formula         = wins_playoff ~ ., 
    data            = training, 
    num.trees       = 500,
    mtry            = 7,
    min.node.size   = 8,
    sample.fraction = 0.55,
    importance      = 'impurity'
  )

  OOB_RMSE_ranger[i] <- sqrt(optimal_ranger$prediction.error)
}

hist(OOB_RMSE_ranger, breaks = 20)

Build rf2 using parameters above

rf2 <- ranger(
  formula = wins_playoff ~ .,
    data = training,
    num.trees = 500,
    mtry = 7,
    min.node.size = 8,
    sample.fraction = 0.55,
    seed = 123
)
save(rf2, file = "model.Rdata.rf2")

load(file = "model.Rdata.rf2")

rf2 prediction

prediction_rf2 <- predict(rf2, testing[,-12])
prediction_rf2 <- prediction_rf2$predictions
rf2_MSE <- mean((prediction_rf2 - testing$wins_playoff)^2)
rf2_RMSE <-  sqrt(rf2_MSE)

9.1.3 rf3

try default mtry

OOB_RMSE_ranger <- vector(mode = "numeric", length = 100)

for(i in 1:length(OOB_RMSE_ranger)){
  optimal_ranger <- ranger(
    formula         = wins_playoff ~ ., 
    data            = training, 
    num.trees       = 500,
    min.node.size   = 8,
    sample.fraction = 0.55,
    importance      = 'impurity'
  )

  OOB_RMSE_ranger[i] <- sqrt(optimal_ranger$prediction.error)
}

hist(OOB_RMSE_ranger, breaks = 20)

rf3 <- ranger(
  formula = wins_playoff ~ .,
    data = training,
    num.trees = 500,
    min.node.size   = 8,
    sample.fraction = 0.55,
    seed = 123
)
save(rf3, file = "model.Rdata.rf3")

load(file = "model.Rdata.rf3")

rf3 prediction

prediction_rf3 <- predict(rf3, testing[,-12])
prediction_rf3 <- prediction_rf3$predictions
rf3_MSE <- mean((prediction_rf3 - testing$wins_playoff)^2)
rf3_RMSE <-  sqrt(rf3_MSE)

9.2 Randomforest with training data in 2003 to 2022

The reason I pick this range is because the playoff structure was changed in 2004 so that an NBA Championship must consist of winning 4 best of 7 game series since then.

data = data %>% filter(!year %in% c(2000:2002))
training2 = data %>% filter(!year %in% 2022)
#I don't need the "year" column when building models
training2 <- training2[,-14]
dim(training2)

## [1] 302  13

9.2.1 rf4

rf4 <- ranger(
  formula = wins_playoff ~ .,
    data = training2,
    num.trees = 500,
    mtry = 7,
    min.node.size = 8,
    sample.fraction = 0.55,
    seed = 123
)
save(rf4, file = "model.Rdata.rf4")

load(file = "model.Rdata.rf4")

rf4 prediction

prediction_rf4 <- predict(rf4, testing[,-12])
prediction_rf4 <- prediction_rf4$predictions
rf4_MSE <- mean((prediction_rf4 - testing$wins_playoff)^2)
rf4_RMSE <-  sqrt(rf4_MSE)

9.3 Randomforest with training data in 2014 to 2022

The reason I pick this range is because teams focus more on 3-points field goals since then.

data = data %>% filter(!year %in% c(2000:2013))
training3 = data %>% filter(!year %in% 2022)
#I don't need the "year" column when building models
training3 <- training3[,-14]
dim(training3)

## [1] 127  13

9.3.1 rf5

rf5 <- ranger(
  formula = wins_playoff ~ .,
    data = training3,
    num.trees = 500,
    mtry = 7,
    min.node.size = 8,
    sample.fraction = 0.55,
    seed = 123
)
save(rf5, file = "model.Rdata.rf5")

load(file = "model.Rdata.rf5")

rf5 prediction

prediction_rf5 <- predict(rf5, testing[,-12])
prediction_rf5 <- prediction_rf5$predictions
rf5_MSE <- mean((prediction_rf5 - testing$wins_playoff)^2)
rf5_RMSE <-  sqrt(rf5_MSE)

9.4 XGBoost with training data in 2003 to 2022

# variable names
features <- setdiff(names(training2), "wins_playoff")
# Create the treatment plan from the training data
treatplan <- vtreat::designTreatmentsZ(training2, features, verbose = FALSE)

# Get the "clean" variable names from the scoreFrame
new_vars <- treatplan %>%
  magrittr::use_series(scoreFrame) %>%        
  dplyr::filter(code %in% c("clean", "lev")) %>% 
  magrittr::use_series(varName)     

# Prepare the training data
features_train <- vtreat::prepare(treatplan, training2, varRestriction = new_vars) %>% as.matrix()
response_train <- training2$wins_playoff

# Prepare the test data
features_test <- vtreat::prepare(treatplan, testing, varRestriction = new_vars) %>% as.matrix()
response_test <- testing$wins_playoff

dim(features_train)

## [1] 302  13

dim(features_test)

## [1] 16 13

xgboost cv

To find the best nrounds

set.seed(123)
xgb.fit1 <- xgb.cv(
  data = features_train,
  label = response_train,
  nrounds = 1000,
  nfold = 10,
  objective = 'reg:squarederror',  # for regression models
  verbose = 0  # silent
)

print(xgb.fit1,verbose = TRUE)

## ##### xgb.cv 10-folds
## call:
##   xgb.cv(data = features_train, nrounds = 1000, nfold = 10, label = response_train, 
##     verbose = 0, objective = "reg:squarederror")
## params (as set within xgb.cv):
##   objective = "reg:squarederror", silent = "1"
## callbacks:
##   cb.evaluation.log()
## niter: 1000
## evaluation_log:
##     iter train_rmse_mean train_rmse_std test_rmse_mean test_rmse_std
##        1     5.145431127    0.084181931       5.311948     0.9151569
##        2     4.014819961    0.059591623       4.530873     0.9453641
##        3     3.232333047    0.068063746       4.080418     0.8940909
##        4     2.654793427    0.079155921       3.868666     0.8192614
##        5     2.231607191    0.064028583       3.709725     0.7347971
## ---                                                                 
##      996     0.001576238    0.000121107       3.726573     0.4827094
##      997     0.001576238    0.000121107       3.726573     0.4827094
##      998     0.001576238    0.000121107       3.726573     0.4827094
##      999     0.001576238    0.000121107       3.726573     0.4827094
##     1000     0.001576238    0.000121107       3.726573     0.4827094

xgb.fit1$evaluation_log %>%
  dplyr::summarise(
    ntrees.train = which(train_rmse_mean == min(train_rmse_mean))[1],
    rmse.train   = min(train_rmse_mean),
    ntrees.test  = which(test_rmse_mean == min(test_rmse_mean))[1],
    rmse.test   = min(test_rmse_mean)
  )

##   ntrees.train  rmse.train ntrees.test rmse.test
## 1          424 0.001576238           9  3.577176

The training error keeps decreasing and approaches zero (0.00015) at about 424 trees. However, the cross-validation error reaches the minimum RMSE (3.58) around 9 trees.

The detailed xgboost model iteration information is drawn as follows, the red line represents the training error, and the blue line represents the cross-validation error.

# plot error vs number trees
ggplot(xgb.fit1$evaluation_log) +
  geom_line(aes(iter, train_rmse_mean), color = "red") +
  geom_line(aes(iter, test_rmse_mean), color = "blue")