According to my prediction, the 2023 NBA championship will be the Boston Celtics. Milwaukee Bucks and Denver Nuggets are second and third title contenders
Not surprisingly, playoff teams have a much higher field goal percentage than non-playoff teams on average.
We can see that both 3-Point Field Goals and 3-Point Field Goals Attempt have increased over time, and teams that make it to the playoffs tend to make and attempt more 3-point shots during the regular season compared to teams that do not make the playoffs. Besides, Several peaks in the 3-Point Field Goals Percentage chart were created by the Golden State Warriors, including the regular season before the Warriors won the championship in 15, 17, 18 and 22 years. Insane Field goal percentage!
Interestingly, the number of 2-point attempts by teams decreased significantly over time, and teams that made the playoffs took fewer 2-point shots than teams that did not make the playoffs. This should be because strong teams focus their offense on shooting three-pointers. The lowest point for two-point field goal attempts among championship teams was the Warriors’ performance during the 2021 season.
Take the number of star players into consideration
According to the schedule, the winning percentage when the two teams meet is taken as the ouput, and when it is calculated all the way to the championship game, the team with the highest winning percentage between the two teams is the champion
Executive Summary
Data Introduction
Libraries
Read file into R
Data Cleaning
Data Preparation
6.1 Merging Data
6.2 Factorizing variables
Exploratory Data Analysis
7.1 The response variable : wins_playoff
7.2 Some interesting facts in regular seasons from 2000 to 2023
Preparing data for modeling
8.1 Feature Selection
Modeling
9.1 Random Forest
9.2 XGBoost
Error Analysis
NBA 2023 Champion Prediction
I’m working on a fun project called “NBA 2023 Champion Prediction”. Basically, I’m using historical data and machine learning to try and figure out which NBA team is most likely to win the championship in 2023.
The NBA is so popular, and people love making predictions about which team will come out on top. So, I’m using historical regular season team stats, game results, and other data to train machine learning models that can predict the number of wins of each team in the playoffs. The team winning the most games in the 2023 playoff will be the championship.
It’s interesting to see how accurate my predictions are, and I will buy sports lottery tickets according to my prediction.
For this project, I scraped NBA regular season data from the website https://www.basketball-reference.com/. The website provides a comprehensive collection of NBA statistics, game results, and player information from past seasons.
I collected data for the 2000 - 2023 NBA regular seasons. The data includes information on team statistics, game results, and other relevant metrics such as team standings, player awards, and playoff results.
Overall, the data collected from basketball-reference.com provides a rich and diverse set of features that are essential for accurate prediction of NBA team performance.
library(rvest)
library(tidyverse)
library(stringr)
library(DataExplorer)
library(corrplot)
library(ranger)
library(janitor)
library(caret)
library(randomForest)
library(xgboost)
path = "C:/Users/Jonathan/Documents/BA/R Data Science/More Cases/2023 NBA Prediction/"
#Read in eaah dataset
division_total_E <- read_csv(paste0(path, "division_total_E.csv"))
division_total_W <- read_csv(paste0(path, "division_total_W.csv"))
pergame_stats_total <- read_csv(paste0(path, "pergame_stats_total.csv"))
advanced_stats_total <- read_csv(paste0(path, "advanced_stats_total.csv"))
WL_playoff_total <- read_csv(paste0(path, "WL_playoff_total.csv"))
Make a copy of each dataset and I will edit on the copies.
division_E <- division_total_E
division_W <- division_total_W
pergame <- pergame_stats_total
advanced <- advanced_stats_total
playoff <- WL_playoff_total
Remove the row with unuseful value, including “Central Division”, “Atlantic Division”, “Southeast Division”, “Midwest Division”, ““Pacific Division”, “Southwest Division”, and “Northwest Division”.
GB :“Games behind” is a term used in the NBA standings to indicate the difference in the number of games won and lost between two teams. It represents the distance between a team’s position in the standings and the team that is currently leading the division or conference. The teams with “-” value of “GB” means they ranked #1 of their divisions in the regular season that year so I will replace the value with 0.
There are some values in the “Eastern_Conference” column contain “star”. The teams with “star” mean they made for the playoff that year. However, I don’t need that information so I am going to remove “*“.
Drop the row id column.
#Remove the row with unuseful value
division_E <- division_E %>%
filter(!Eastern_Conference %in% c("Central Division", "Atlantic Division", "Southeast Division"))
division_W <- division_W %>%
filter(!Western_Conference %in% c("Midwest Division", "Pacific Division", "Southwest Division", "Northwest Division"))
#Replacing "-" with 0 in GB column
division_E$GB <- ifelse(division_E$GB == "—", 0, division_E$GB)
division_W$GB <- ifelse(division_W$GB == "—", 0, division_W$GB)
#Remove "*" in Eastern_Conference and Western_Conference
Eastern_Conference_clean <- division_E$Eastern_Conference %>%
str_remove_all(fixed("*"))
division_E <- division_E %>%
mutate(Eastern_Conference = Eastern_Conference_clean)
Western_Conference_clean <- division_W$Western_Conference %>%
str_remove_all(fixed("*"))
division_W <- division_W %>%
mutate(Western_Conference = Western_Conference_clean)
#I don't need the row id column
division_E <- division_E[,-1]
division_W <- division_W[,-1]
head(division_E)
## # A tibble: 6 × 9
## Eastern_Conference W L `W/L_percent` GB `PS/G` `PA/G` SRS Year
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 Miami Heat 52 30 .634 0 94.4 91.3 2.75 2000
## 2 New York Knicks 50 32 .610 2.0 92.1 90.7 1.30 2000
## 3 Philadelphia 76ers 49 33 .598 3.0 94.8 93.4 1.02 2000
## 4 Orlando Magic 41 41 .500 11.0 100.1 99.4 0.43 2000
## 5 Boston Celtics 35 47 .427 17.0 99.3 100.1 -1.00 2000
## 6 New Jersey Nets 31 51 .378 21.0 98.0 99.0 -1.18 2000
head(division_W)
## # A tibble: 6 × 9
## Western_Conference W L `W/L_percent` GB `PS/G` `PA/G` SRS Year
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 Utah Jazz 55 27 .671 0 96.5 92.0 4.52 2000
## 2 San Antonio Spurs 53 29 .646 2.0 96.2 90.2 5.92 2000
## 3 Minnesota Timberwol… 50 32 .610 5.0 98.5 96.0 2.67 2000
## 4 Dallas Mavericks 40 42 .488 15.0 101.4 102.0 -0.29 2000
## 5 Denver Nuggets 35 47 .427 20.0 99.0 101.1 -1.76 2000
## 6 Houston Rockets 34 48 .415 21.0 99.5 100.3 -0.57 2000
Drop the “League Average” value in “Team” column.
There are some values in the “Team” column contain “star”. The teams with “star” mean they made for the playoff that year. However, I don’t need that information so I am going to remove “*“.
Drop the row id column.
#Remove the row with unuseful value
pergame <- pergame %>%
filter(!Team %in% "League Average")
#Remove "*" in Team
team_clean <- pergame$Team %>%
str_remove_all(fixed("*"))
pergame <- pergame %>%
mutate(Team = team_clean)
#I don't need the row id column
pergame <- pergame[,-1]
head(pergame)
## # A tibble: 6 × 26
## Rk Team G MP FG FGA FG_Percent `3P` `3PA` `3P_Percent` `2P`
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 Sacra… 82 242. 40 88.9 0.45 6.5 20.2 0.322 33.4
## 2 2 Detro… 82 242. 37.1 80.9 0.459 5.4 14.9 0.359 31.8
## 3 3 Dalla… 82 241. 39 85.9 0.453 6.3 16.2 0.391 32.6
## 4 4 India… 82 241. 37.2 81 0.459 7.1 18.1 0.392 30
## 5 5 Milwa… 82 242. 38.7 83.3 0.465 4.8 13 0.369 33.9
## 6 6 Los A… 82 242. 38.3 83.4 0.459 4.2 12.8 0.329 34.1
## # ℹ 15 more variables: `2PA` <dbl>, `2P_Percent` <dbl>, FT <dbl>, FTA <dbl>,
## # FT_Percent <dbl>, ORB <dbl>, DRB <dbl>, TRB <dbl>, AST <dbl>, STL <dbl>,
## # BLK <dbl>, TOV <dbl>, PF <dbl>, PTS <dbl>, Year <dbl>
The table format is incorrect in some columns so I have to fix it first.
Remove the row with unuseful value.
There are some values in the “Team” column contain “star”. The teams with “star” mean they made for the playoff that year. However, I don’t need that information so I am going to remove “*“.
Drop the row id column.
#Fixing the columns
advanced <- subset(advanced, select = -c(eFG_percent_offense, TOV_percent_defense, `Attend/G`))
colnames(advanced)[19:29] <- c("eFG_percent_offense", "TOV_percent_offense", "ORB_percent", "FT/FGA_offense", "eFG_percent_defense", "TOV_percent_defense", "DRB_percent", "FT/FGA_defense", "Arena", "Attend", "Attend/G")
#Remove the row with unuseful value
advanced <- advanced %>%
filter(!Team %in% "Team")
#Remove "*" in Team
team_clean <- advanced$Team %>%
str_remove_all(fixed("*"))
advanced <- advanced %>%
mutate(Team = team_clean)
#I don't need the row id column
advanced <- advanced[,-1]
#Removing comma in "Attend" and "Attend/G"
clean_strings_attend <- gsub(",", "", advanced$Attend)
clean_strings_attendG <- gsub(",", "", advanced$`Attend/G`)
advanced <- advanced %>%
mutate(Attend = clean_strings_attend,
`Attend/G` = clean_strings_attendG)
head(advanced)
## # A tibble: 6 × 29
## Rk Team Age W L PW PL MOV SOS SRS OPtg DRtg NRtg
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 Los A… 29.2 67 15 64 18 8.55 -0.14 8.41 107.3 98.2 +9.1
## 2 2 Portl… 29.6 59 23 59 23 6.40 -0.04 6.36 107.9 100.8 +7.1
## 3 3 San A… 30.9 53 29 58 24 5.94 -0.02 5.92 105.0 98.6 +6.4
## 4 4 Phoen… 28.6 53 29 56 26 5.22 0.02 5.24 104.6 99.0 +5.6
## 5 5 Utah … 31.5 55 27 54 28 4.46 0.05 4.52 107.3 102.3 +5.0
## 6 6 India… 30.4 56 26 54 28 4.60 -0.45 4.15 108.5 103.6 +4.9
## # ℹ 16 more variables: Pace <chr>, FTr <chr>, `3PAr` <chr>, TS_percent <chr>,
## # eFG_percent_offense <chr>, TOV_percent_offense <chr>, ORB_percent <chr>,
## # `FT/FGA_offense` <chr>, eFG_percent_defense <chr>,
## # TOV_percent_defense <chr>, DRB_percent <chr>, `FT/FGA_defense` <chr>,
## # Arena <chr>, Attend <chr>, `Attend/G` <chr>, Year <dbl>
Only keep the columns I need. “W”(wins in playoffs) as my dependent variable(Y). “Team” and “Year” as foregin keys.
Remove the row with unuseful value.
Drop the values in 2023 because this is what we going to predict.
#Keeping "W", "Team" and "Year.
playoff <- playoff[,c(3,5,27)]
#Remove the row with unuseful value
playoff <- playoff %>%
filter(!Team %in% c("Tm", "League Average", "Team"))
##Drop 2023
playoff <- playoff %>%
filter(!Year %in% 2023)
head(playoff)
## # A tibble: 6 × 3
## Team W Year
## <chr> <chr> <dbl>
## 1 Portland Trail Blazers 10 2000
## 2 Miami Heat 6 2000
## 3 Indiana Pacers 13 2000
## 4 Los Angeles Lakers 15 2000
## 5 Milwaukee Bucks 2 2000
## 6 New York Knicks 9 2000
More cleaning…
#Some strange number in values of "Eastern_Conference" and "Western_Conference" in the year 2023.
division_E %>%
filter(Year %in% 2023) %>%
head()
## # A tibble: 6 × 9
## Eastern_Conference W L `W/L_percent` GB `PS/G` `PA/G` SRS Year
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 Boston Celtics (2) 57 25 .695 0 117.9 111.4 6.38 2023
## 2 Philadelphia 76ers … 54 28 .659 3.0 115.2 110.9 4.37 2023
## 3 New York Knicks (5) 47 35 .573 10.0 116.0 113.1 2.99 2023
## 4 Brooklyn Nets (6) 45 37 .549 12.0 113.4 112.5 1.03 2023
## 5 Toronto Raptors (9) 41 41 .500 16.0 112.9 111.4 1.59 2023
## 6 Milwaukee Bucks (1) 58 24 .707 0 116.9 113.3 3.61 2023
division_W %>%
filter(Year %in% 2023) %>%
head()
## # A tibble: 6 × 9
## Western_Conference W L `W/L_percent` GB `PS/G` `PA/G` SRS Year
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 Denver Nuggets (1) 53 29 .646 0 115.8 112.5 3.04 2023
## 2 Minnesota Timberwol… 42 40 .512 11.0 115.8 115.8 -0.22 2023
## 3 Oklahoma City Thund… 40 42 .488 13.0 117.5 116.4 0.96 2023
## 4 Utah Jazz (12) 37 45 .451 16.0 117.1 118.0 -1.03 2023
## 5 Portland Trail Blaz… 33 49 .402 20.0 113.4 117.4 -3.96 2023
## 6 Sacramento Kings (3) 48 34 .585 0 120.7 118.1 2.30 2023
#Remove the pattern "(number)" using gsub()
clean_strings_E <- gsub("\\s*\\(\\d+\\)\\s*", "", division_E$Eastern_Conference)
division_E <- division_E %>%
mutate(Eastern_Conference = clean_strings_E)
clean_strings_W <- gsub("\\s*\\(\\d+\\)\\s*", "", division_W$Western_Conference)
division_W <- division_W %>%
mutate(Western_Conference = clean_strings_W)
#Remove trail space
division_E$Eastern_Conference <- str_trim(division_E$Eastern_Conference)
division_W$Western_Conference <- str_trim(division_W$Western_Conference)
table(division_E$Eastern_Conference)
##
## Atlanta Hawks Boston Celtics Brooklyn Nets Charlotte Bobcats
## 24 24 11 10
## Charlotte Hornets Chicago Bulls Cleveland Cavaliers Detroit Pistons
## 12 24 24 24
## Indiana Pacers Miami Heat Milwaukee Bucks New Jersey Nets
## 24 24 24 13
## New Orleans Hornets New York Knicks Orlando Magic Philadelphia 76ers
## 2 24 24 24
## Toronto Raptors Washington Wizards
## 24 24
length(table(division_E$Eastern_Conference))
## [1] 18
table(division_E$Year)
##
## 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
## 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15
## 2016 2017 2018 2019 2020 2021 2022 2023
## 15 15 15 15 15 15 15 15
table(division_W$Western_Conference)
##
## Dallas Mavericks Denver Nuggets
## 24 24
## Golden State Warriors Houston Rockets
## 24 24
## Los Angeles Clippers Los Angeles Lakers
## 24 24
## Memphis Grizzlies Minnesota Timberwolves
## 22 24
## New Orleans Hornets New Orleans Pelicans
## 7 10
## New Orleans/Oklahoma City Hornets Oklahoma City Thunder
## 2 15
## Phoenix Suns Portland Trail Blazers
## 24 24
## Sacramento Kings San Antonio Spurs
## 24 24
## Seattle SuperSonics Utah Jazz
## 9 24
## Vancouver Grizzlies
## 2
length(table(division_W$Western_Conference))
## [1] 19
table(division_W$Year)
##
## 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
## 14 14 14 14 14 15 15 15 15 15 15 15 15 15 15 15
## 2016 2017 2018 2019 2020 2021 2022 2023
## 15 15 15 15 15 15 15 15
There were 15 teams every year in the NBA East division and 14 or 15 teams in the West division but there were 18 teams and 19 teams in history, separately. The reason is that some teams changed their names over time.
Change the names of “W” column in the playoff dataset.
Create a “performance” column in the playoff dataset.
Create a “division” column in division_E and division_W
Change the name of “Western_Conference” column and “Eastern_Conference”
#Give the dependent variable a new name
colnames(playoff)[2] <- "wins_playoff"
#Create a column to distinguish teams that made to the playoff, teams that did not make the playoffs and teams that won the champion.
playoff$performance <- ""
playoff$wins_playoff <- as.numeric(playoff$wins_playoff)
playoff$performance <- ifelse((playoff$wins_playoff == 15 & playoff$Year == 2000:2002) | (playoff$wins_playoff == 16), "champion", "playoff")
## Warning in playoff$Year == 2000:2002: 較長的物件長度並非較短物件長度的倍數
head(playoff)
## # A tibble: 6 × 4
## Team wins_playoff Year performance
## <chr> <dbl> <dbl> <chr>
## 1 Portland Trail Blazers 10 2000 playoff
## 2 Miami Heat 6 2000 playoff
## 3 Indiana Pacers 13 2000 playoff
## 4 Los Angeles Lakers 15 2000 champion
## 5 Milwaukee Bucks 2 2000 playoff
## 6 New York Knicks 9 2000 playoff
#Create a column of division
division_E$division <- "E"
division_W$division <- "W"
#Change the names of columns
colnames(division_E)[1] <- "conference"
colnames(division_W)[1] <- "conference"
division_E$division <- "E"
division_W$division <- "W"
Start merging datasets
#binding division_E and division_W
division_total <- rbind(division_E, division_W)
#changing the column name "conference" into "Team"
colnames(division_total)[1] <- "Team"
#Drop the duplicate columns that another dataset already have
division_total <- subset(division_total, select = -c(W, L, SRS))
#joining advanced with pergame
regular_season <- pergame %>%
left_join(advanced, by = c("Team", "Year"), suffix = c("_pergame", "_advanced")) %>%
left_join(division_total, by = c("Team", "Year"))
#joining playoff with regular_season
data <- regular_season %>%
left_join(playoff, by = c("Team", "Year"))
#Give those who fail to make to the playoff that year a "fail" value in the column "performance"
data$performance <- ifelse(is.na(data$performance) == T, "fail", data$performance)
#Give a better name to column "W" and "L", which mean wins and loses in regular seasons
colnames(data)[29] <- "wins_regular"
colnames(data)[30] <- "loses_regular"
dim(data)
## [1] 715 60
Columns that should be numeric
#Columns that should remain character
which(names(data) %in% c("Team", "Arena", "division", "performance"))
## [1] 2 51 58 60
# specify the column indices to convert to numeric
numeric_cols <- setdiff(seq_along(data), c(2, 51, 58, 60))
# convert to numeric
data[, numeric_cols] <- apply(data[, numeric_cols], 2, as.numeric)
Columns that should be factors
data$division <- as.factor(data$division)
data$performance <- as.factor(data$performance)
The goal is to predict the 2023 NBA champion.
nba_2023 <- data %>%
filter(Year == 2023) %>%
select(-performance)
data <- data %>%
filter(!Year %in% 2023)
#create_report(data)
Note : I will keep the “performance” column for EDA but I won’t use it later in modeling because we only know the champion “after” the whole season is over.
summary(data$wins_playoff)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 1.000 3.500 5.181 7.250 16.000 321
data %>%
count(wins_playoff) %>%
ggplot(aes(wins_playoff, n)) +
geom_col(fill = "slateblue") +
labs(title = "Distribution of wins in playoff from 2000 - 2023",
x = "wins", y = "count") +
theme_minimal() +
theme(plot.title = element_text(size = 13),
axis.title=element_text(size=8)) +
theme(panel.background = element_rect(fill='white'),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())+
geom_text(aes(label = n), size = 3, vjust = -0.8, color = "black") +
ylim(0,60)
## Warning: Removed 1 rows containing missing values (`position_stack()`).
## Warning: Removed 1 rows containing missing values (`geom_text()`).
We can see that the distribution is skewed. This is simply because most of the teams lose in the first round of 16 to 8 matches.
data %>%
group_by(Year, performance) %>%
summarise(mean_FG_percent = mean(FG_Percent)) %>%
ggplot(aes(Year, mean_FG_percent, group = performance, color =performance)) +
geom_point() +
geom_line()+
labs(title = "Evolution of Average Field Goals Percentage per game in Regular Season",
x = "Year", y = "Field Goals Percentage") +
theme_minimal() +
theme(plot.title = element_text(size = 13),
axis.title=element_text(size=8)) +
theme(panel.background = element_rect(fill='white'),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())+
theme(legend.position = c(0.13, 0.8),
legend.background = element_rect(fill = "white", color = "black"))
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
Not surprisingly, playoff teams have a much higher field goal percentage than non-playoff teams on average.
3P : 3-Point field goals
3PA : 3-Point field goals attempt
3P_Percent : 3-Point field foals percentage
data %>%
group_by(Year, performance) %>%
summarise(mean_3P = mean(`3P`)) %>%
ggplot(aes(Year, mean_3P, group = performance, color =performance)) +
geom_point() +
geom_line()+
labs(title = "Evolution of Average 3-Point Field Goals per game in Regular Season",
x = "Year", y = "3-Point Field Goals") +
theme_minimal() +
theme(plot.title = element_text(size = 12),
axis.title=element_text(size=8)) +
theme(panel.background = element_rect(fill='white'),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())+
theme(legend.position = c(0.13, 0.8),
legend.background = element_rect(fill = "white", color = "black"))
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
data %>%
group_by(Year, performance) %>%
summarise(mean_3PA = mean(`3PA`)) %>%
ggplot(aes(Year, mean_3PA, group = performance, color =performance)) +
geom_point() +
geom_line()+
labs(title = "Evolution of Average 3-Point Field Goals Attempts per game in Regular Season",
x = "Year", y = "3-Point Field Goals Attempts") +
theme_minimal() +
theme(plot.title = element_text(size = 12),
axis.title=element_text(size=8)) +
theme(panel.background = element_rect(fill='white'),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())+
theme(legend.position = c(0.13, 0.8),
legend.background = element_rect(fill = "white", color = "black"))
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
data %>%
group_by(Year, performance) %>%
summarise(mean_3P_Percent = mean(`3P_Percent`)) %>%
ggplot(aes(Year, mean_3P_Percent, group = performance, color =performance)) +
geom_point() +
geom_line()+
labs(title = "Evolution of Average 3-Point Field Goals Percentage per game in Regular Season",
x = "Year", y = "3-Point Field Goals Percentage") +
theme_minimal() +
theme(plot.title = element_text(size = 12),
axis.title=element_text(size=8)) +
theme(panel.background = element_rect(fill='white'),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())+
theme(legend.position = c(0.13, 0.8),
legend.background = element_rect(fill = "white", color = "black"))
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
We can see that both 3-Point Field Goals and 3-Point Field Goals Attempt have increased over time, and teams that make it to the playoffs tend to make and attempt more 3-point shots during the regular season compared to teams that do not make the playoffs. Besides, Several peaks in the 3-Point Field Goals Percentage chart were created by the Golden State Warriors, including the regular season before the Warriors won the championship in 15, 17, 18 and 22 years. Insane Field goal percentage!
data %>%
group_by(Year, performance) %>%
summarise(mean_2PA = mean(`2PA`)) %>%
ggplot(aes(Year, mean_2PA, group = performance, color =performance)) +
geom_point() +
geom_line()+
labs(title = "Evolution of Average 2-Point Field Goals Attempt per game in Regular Season",
x = "Year", y = "2-Point Field Goals Attempt") +
theme_minimal() +
theme(plot.title = element_text(size = 12),
axis.title=element_text(size=8)) +
theme(panel.background = element_rect(fill='white'),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())+
theme(legend.position = c(0.13, 0.3),
legend.background = element_rect(fill = "white", color = "black"))
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
Interestingly, the number of 2-point attempts by teams decreased significantly over time, and teams that made the playoffs took fewer 2-point shots than teams that did not make the playoffs. This should be because strong teams focus their offense on shooting three-pointers. The lowest point for two-point field goal attempts among championship teams was the Warriors’ performance during the 2021 season.
data %>%
group_by(Year, performance) %>%
summarise(mean_PTS = mean(`PTS`)) %>%
ggplot(aes(Year, mean_PTS, group = performance, color =performance)) +
geom_point() +
geom_line()+
labs(title = "Evolution of Points per game in Regular Season",
x = "Year", y = "Points") +
theme_minimal() +
theme(plot.title = element_text(size = 12),
axis.title=element_text(size=8)) +
theme(panel.background = element_rect(fill='white'),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())+
theme(legend.position = c(0.13, 0.8),
legend.background = element_rect(fill = "white", color = "black"))
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
Goal : Predict 2023 NBA champion using “nba_2023” as input. The “wins_playoff” will be the dependent variable(Y), and the teams with most wins will be the champion in prediction. The final independent variables(X) will be selected later using various methods.
nba_2023 <- as.data.frame(nba_2023)
#Convert column "Team" to row index.
rownames(nba_2023) <- nba_2023$Team
#We don't need column "Team" and "Arena"
nba_2023 <- nba_2023 %>% select(-c(Team,Arena))
There are NAs in “Attend” and “Attend/G” column, and because the correlation between those features and “wins_playoff” are low, I decide to drop them.
cor(data$Attend, data$wins_playoff,use="pairwise.complete.obs")
## [1] 0.09790018
cor(data$`Attend/G`, data$wins_playoff,use="pairwise.complete.obs")
## [1] 0.1002934
nba_2023 <- nba_2023 %>% select(-c(Attend, `Attend/G`))
data <- data %>% select(-c(Attend, `Attend/G`))
I only need the teams that made to the playoffs each year, and I don’t need the “performance”, “Team” and “Arena”(Having too much levels) column.
data <- data %>%
filter(performance %in% c("champion", "playoff")) %>%
select(-c(performance,Arena))
Cleaning columns’ names
data <- clean_names(data)
nba_2023 <- clean_names(nba_2023)
Create a copy of data before feature selection
data <- as.data.frame(data)
rownames(data) <- paste0(data$team, data$year)
data_2 <- data
set.seed(1234)
quick_rf <- randomForest(x=data[,-56], y=data$wins_playoff, ntree=500, importance = T)
important <- round(importance(quick_rf), 2)
imp_DF <- data.frame(Variables = row.names(important), MSE = important[,1])
imp_DF <- imp_DF[order(imp_DF$MSE, decreasing = TRUE),]
ggplot(imp_DF[1:20,], aes(x=reorder(Variables, MSE), y=MSE, fill=MSE)) + geom_bar(stat = 'identity') + labs(x = 'Variables', y= '% increase MSE if variable is randomly permuted') + coord_flip() + theme(legend.position="none")
Choose features having percent increase MSE higher than 5
rf_features <- imp_DF %>% filter(MSE >5)
rf_features <- rf_features$Variables
numericVars <- which(map(data, is.numeric) == T)#index vector numeric variables
numericVarNames <- names(numericVars)#saving names vector for use later on
cat('There are', length(numericVars), 'numeric variables')
## There are 54 numeric variables
#find correlations of all numeric variables
data_numVar <- data[, numericVars]
cor_numVar <- cor(data_numVar, use="pairwise.complete.obs")#pairwise.complete.obs uses the non-NA values when calculating the correlation.
#sort on decreasing correlations with SalePrice
cor_sorted <- as.matrix(sort(cor_numVar[,'wins_playoff'], decreasing = TRUE))
#select only high correlations with SalePrice(correlation>0.5)
CorHigh <- names(which(apply(cor_sorted, 1, function(x) abs(x)>0.5)))
cor_numVar <- cor_numVar[CorHigh, CorHigh]
#visiualization
corrplot.mixed(cor_numVar, tl.col = "black", tl.pos = "lt")
It also becomes clear that multicollinearity is an issue when using some algorithms.
data <- data %>%
select(c(rf_features, CorHigh, wins_playoff, year))
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(rf_features)
##
## # Now:
## data %>% select(all_of(rf_features))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(CorHigh)
##
## # Now:
## data %>% select(all_of(CorHigh))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
nba_2023 <- nba_2023 %>%
select(c(rf_features, CorHigh, wins_playoff))
I will use different training data to train models and test them on the same test set(year 2022).
Composing training and testing sets
training = data %>% filter(!year %in% 2022)
testing = data %>% filter(year %in% 2022)
#I don't need the "year" column when building models
training <- training[,-14]
testing <- testing[,-14]
dim(training)
## [1] 348 13
dim(testing)
## [1] 16 13
rf1 : Basic RF(for baseline)
set.seed(1234)
rf1 <- randomForest(wins_playoff~., data = training, importance = T)
rf1
##
## Call:
## randomForest(formula = wins_playoff ~ ., data = training, importance = T)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 11.4862
## % Var explained: 48.01
plot(rf1)
#The number of trees that minimize the MSE.
which.min(rf1$mse)
## [1] 98
#The RMSE of the most suitable random forest is (the error value of the average Sale_Price):
sqrt(rf1$mse[which.min(rf1$mse)])
## [1] 3.358748
save(rf1, file = "model.Rdata.rf1")
load(file = "model.Rdata.rf1")
Finding Variable importance
round(importance(rf1), 2)
## %IncMSE IncNodePurity
## w_l_percent 19.40 1384.17
## gb 17.21 958.86
## wins_regular 14.20 608.53
## loses_regular 14.09 912.59
## sos 14.80 596.62
## n_rtg 9.47 562.44
## mov 12.73 724.51
## rk_advanced 10.76 388.30
## pl 9.09 289.68
## division 13.57 163.97
## srs 8.75 481.99
## pw 9.44 259.14
important <- importance(rf1)
imp_DF <- data.frame(Variables = row.names(important), MSE = important[,1])
imp_DF <- imp_DF[order(imp_DF$MSE, decreasing = TRUE),]
ggplot(imp_DF, aes(x=reorder(Variables, MSE), y=MSE, fill=MSE)) + geom_bar(stat = 'identity') + labs(x = 'Variables', y= '% increase MSE if variable is randomly permuted') + coord_flip() + theme(legend.position="none")
rf1 prediction
prediction_rf1 <- predict(rf1, testing[, -12])
rf1_MSE <- mean((prediction_rf1 - testing$wins_playoff)^2)
rf1_RMSE <- sqrt(rf1_MSE)
Tuning
Finding best mtry
# Filter out predictor variable names
features <- setdiff(x = names(training), y = "wins_playoff")
set.seed(123)
RF_tune <- tuneRF(x = training[features], y = training$wins_playoff, mtryStart = 1, ntreeTry = 5000, trace = FALSE)
## -0.0206462 0.05
RF_tune
## mtry OOBError
## 1 1 11.16308
## 2 2 11.39356
Best mtry is 1.
# hyperparameter grid search
hyper_grid <- expand.grid(
mtry = seq(1, 10, by = 1),
node_size = seq(1, 9, by = 1),
sample_size = c(0.55, 0.632, 0.7, 0.8),
OOB_RMSE = 0
)
# total number of combinations
nrow(hyper_grid)
## [1] 360
for (i in 1:nrow(hyper_grid)) {
# train model
ranger_tune <- ranger(
formula = wins_playoff ~ .,
data = training,
num.trees = 500,
mtry = hyper_grid$mtry[i],
min.node.size = hyper_grid$node_size[i],
sample.fraction = hyper_grid$sample_size[i],
seed = 123
)
hyper_grid$OOB_RMSE[i] <- sqrt(ranger_tune$prediction.error)
}
hyper_grid %>%
dplyr::arrange(OOB_RMSE) %>%
head(10)
## mtry node_size sample_size OOB_RMSE
## 1 7 8 0.55 3.297908
## 2 7 7 0.55 3.298633
## 3 10 9 0.55 3.300885
## 4 10 8 0.55 3.301806
## 5 5 9 0.55 3.302102
## 6 7 9 0.55 3.304781
## 7 9 9 0.55 3.305087
## 8 10 7 0.55 3.305618
## 9 9 9 0.70 3.307140
## 10 5 8 0.55 3.307395
OOB_RMSE falls roughly around 3 games.
The optimal value of mtry falls in all 5~10 range intervals. Indicates that mtry does not have much impact on OOB_RMSE in this interval.
The number of optimal minimum node observations falls in the range of 7 to 9.
The optimal sampling ratio is about 0.55.
We repeat the model of this parameter setting 100 times to calculate the expected value of the error rate of this model.
OOB_RMSE_ranger <- vector(mode = "numeric", length = 100)
for(i in 1:length(OOB_RMSE_ranger)){
optimal_ranger <- ranger(
formula = wins_playoff ~ .,
data = training,
num.trees = 500,
mtry = 7,
min.node.size = 8,
sample.fraction = 0.55,
importance = 'impurity'
)
OOB_RMSE_ranger[i] <- sqrt(optimal_ranger$prediction.error)
}
hist(OOB_RMSE_ranger, breaks = 20)
rf2 <- ranger(
formula = wins_playoff ~ .,
data = training,
num.trees = 500,
mtry = 7,
min.node.size = 8,
sample.fraction = 0.55,
seed = 123
)
save(rf2, file = "model.Rdata.rf2")
load(file = "model.Rdata.rf2")
rf2 prediction
prediction_rf2 <- predict(rf2, testing[,-12])
prediction_rf2 <- prediction_rf2$predictions
rf2_MSE <- mean((prediction_rf2 - testing$wins_playoff)^2)
rf2_RMSE <- sqrt(rf2_MSE)
try default mtry
OOB_RMSE_ranger <- vector(mode = "numeric", length = 100)
for(i in 1:length(OOB_RMSE_ranger)){
optimal_ranger <- ranger(
formula = wins_playoff ~ .,
data = training,
num.trees = 500,
min.node.size = 8,
sample.fraction = 0.55,
importance = 'impurity'
)
OOB_RMSE_ranger[i] <- sqrt(optimal_ranger$prediction.error)
}
hist(OOB_RMSE_ranger, breaks = 20)
rf3 <- ranger(
formula = wins_playoff ~ .,
data = training,
num.trees = 500,
min.node.size = 8,
sample.fraction = 0.55,
seed = 123
)
save(rf3, file = "model.Rdata.rf3")
load(file = "model.Rdata.rf3")
rf3 prediction
prediction_rf3 <- predict(rf3, testing[,-12])
prediction_rf3 <- prediction_rf3$predictions
rf3_MSE <- mean((prediction_rf3 - testing$wins_playoff)^2)
rf3_RMSE <- sqrt(rf3_MSE)
The reason I pick this range is because the playoff structure was changed in 2004 so that an NBA Championship must consist of winning 4 best of 7 game series since then.
data = data %>% filter(!year %in% c(2000:2002))
training2 = data %>% filter(!year %in% 2022)
#I don't need the "year" column when building models
training2 <- training2[,-14]
dim(training2)
## [1] 302 13
rf4 <- ranger(
formula = wins_playoff ~ .,
data = training2,
num.trees = 500,
mtry = 7,
min.node.size = 8,
sample.fraction = 0.55,
seed = 123
)
save(rf4, file = "model.Rdata.rf4")
load(file = "model.Rdata.rf4")
rf4 prediction
prediction_rf4 <- predict(rf4, testing[,-12])
prediction_rf4 <- prediction_rf4$predictions
rf4_MSE <- mean((prediction_rf4 - testing$wins_playoff)^2)
rf4_RMSE <- sqrt(rf4_MSE)
The reason I pick this range is because teams focus more on 3-points field goals since then.
data = data %>% filter(!year %in% c(2000:2013))
training3 = data %>% filter(!year %in% 2022)
#I don't need the "year" column when building models
training3 <- training3[,-14]
dim(training3)
## [1] 127 13
rf5 <- ranger(
formula = wins_playoff ~ .,
data = training3,
num.trees = 500,
mtry = 7,
min.node.size = 8,
sample.fraction = 0.55,
seed = 123
)
save(rf5, file = "model.Rdata.rf5")
load(file = "model.Rdata.rf5")
rf5 prediction
prediction_rf5 <- predict(rf5, testing[,-12])
prediction_rf5 <- prediction_rf5$predictions
rf5_MSE <- mean((prediction_rf5 - testing$wins_playoff)^2)
rf5_RMSE <- sqrt(rf5_MSE)
# variable names
features <- setdiff(names(training2), "wins_playoff")
# Create the treatment plan from the training data
treatplan <- vtreat::designTreatmentsZ(training2, features, verbose = FALSE)
# Get the "clean" variable names from the scoreFrame
new_vars <- treatplan %>%
magrittr::use_series(scoreFrame) %>%
dplyr::filter(code %in% c("clean", "lev")) %>%
magrittr::use_series(varName)
# Prepare the training data
features_train <- vtreat::prepare(treatplan, training2, varRestriction = new_vars) %>% as.matrix()
response_train <- training2$wins_playoff
# Prepare the test data
features_test <- vtreat::prepare(treatplan, testing, varRestriction = new_vars) %>% as.matrix()
response_test <- testing$wins_playoff
dim(features_train)
## [1] 302 13
dim(features_test)
## [1] 16 13
To find the best nrounds
set.seed(123)
xgb.fit1 <- xgb.cv(
data = features_train,
label = response_train,
nrounds = 1000,
nfold = 10,
objective = 'reg:squarederror', # for regression models
verbose = 0 # silent
)
print(xgb.fit1,verbose = TRUE)
## ##### xgb.cv 10-folds
## call:
## xgb.cv(data = features_train, nrounds = 1000, nfold = 10, label = response_train,
## verbose = 0, objective = "reg:squarederror")
## params (as set within xgb.cv):
## objective = "reg:squarederror", silent = "1"
## callbacks:
## cb.evaluation.log()
## niter: 1000
## evaluation_log:
## iter train_rmse_mean train_rmse_std test_rmse_mean test_rmse_std
## 1 5.145431127 0.084181931 5.311948 0.9151569
## 2 4.014819961 0.059591623 4.530873 0.9453641
## 3 3.232333047 0.068063746 4.080418 0.8940909
## 4 2.654793427 0.079155921 3.868666 0.8192614
## 5 2.231607191 0.064028583 3.709725 0.7347971
## ---
## 996 0.001576238 0.000121107 3.726573 0.4827094
## 997 0.001576238 0.000121107 3.726573 0.4827094
## 998 0.001576238 0.000121107 3.726573 0.4827094
## 999 0.001576238 0.000121107 3.726573 0.4827094
## 1000 0.001576238 0.000121107 3.726573 0.4827094
xgb.fit1$evaluation_log %>%
dplyr::summarise(
ntrees.train = which(train_rmse_mean == min(train_rmse_mean))[1],
rmse.train = min(train_rmse_mean),
ntrees.test = which(test_rmse_mean == min(test_rmse_mean))[1],
rmse.test = min(test_rmse_mean)
)
## ntrees.train rmse.train ntrees.test rmse.test
## 1 424 0.001576238 9 3.577176
The training error keeps decreasing and approaches zero (0.00015) at about 424 trees. However, the cross-validation error reaches the minimum RMSE (3.58) around 9 trees.
The detailed xgboost model iteration information is drawn as follows, the red line represents the training error, and the blue line represents the cross-validation error.
# plot error vs number trees
ggplot(xgb.fit1$evaluation_log) +
geom_line(aes(iter, train_rmse_mean), color = "red") +
geom_line(aes(iter, test_rmse_mean), color = "blue")
Tuning
# create hyperparameter grid
hyper_grid <- expand.grid(
eta = c(.01, .05, .1, .3),
max_depth = c(1:7),
min_child_weight = c(1:7),
subsample = c(.65,.7,.8, 1),
colsample_bytree = c(.5,.6,.7,.8,.9, 1),
optimal_trees = 0, # a place to dump results
min_RMSE = 0 # a place to dump results
)
nrow(hyper_grid)
## [1] 4704
# grid search
for(i in 1:nrow(hyper_grid)) {
# create parameter list
params <- list(
eta = hyper_grid$eta[i],
max_depth = hyper_grid$max_depth[i],
min_child_weight = hyper_grid$min_child_weight[i],
subsample = hyper_grid$subsample[i],
colsample_bytree = hyper_grid$colsample_bytree[i]
)
# reproducibility
set.seed(123)
# train model
xgb.tune <- xgb.cv(
params = params,
data = features_train,
label = response_train,
nrounds = 500,
nfold = 10,
objective = 'reg:squarederror', # for regression models
verbose = 0, # silent,
early_stopping_rounds = 10 # stop if no improvement for 10 consecutive trees
)
# add min training error and trees to grid
hyper_grid$optimal_trees[i] <- which.min(xgb.tune$evaluation_log$test_rmse_mean)
hyper_grid$min_RMSE[i] <- min(xgb.tune$evaluation_log$test_rmse_mean)
}
save(xgb.tune, file = "model.Rdata.xgb.tune")
save(hyper_grid, file = "model.Rdata.xgb_hyper_grid")
load(file = "model.Rdata.xgb.tune")
load(file = "model.Rdata.xgb_hyper_grid")
hyper_grid %>%
dplyr::arrange(min_RMSE) %>%
head(10)
## eta max_depth min_child_weight subsample colsample_bytree optimal_trees
## 1 0.10 5 6 0.7 1.0 29
## 2 0.10 5 7 0.7 1.0 29
## 3 0.10 4 5 0.7 1.0 29
## 4 0.10 5 7 0.8 1.0 29
## 5 0.30 2 6 0.8 1.0 27
## 6 0.10 4 2 0.7 1.0 29
## 7 0.05 3 5 0.8 1.0 88
## 8 0.10 4 1 0.8 0.6 43
## 9 0.10 6 7 0.7 1.0 29
## 10 0.10 4 6 0.7 1.0 29
## min_RMSE
## 1 3.290599
## 2 3.295362
## 3 3.305409
## 4 3.309496
## 5 3.313112
## 6 3.313654
## 7 3.314141
## 8 3.317956
## 9 3.317989
## 10 3.321103
eta = 0.10
max_depth = 5
min_child_weight = 6
subsample = 0.7
colsample_bytree = 1.0
# parameter list
params <- list(
eta = 0.1,
max_depth = 5,
min_child_weight = 6,
subsample = 0.7,
colsample_bytree = 1.0
)
# train final model
xgb.fit.final <- xgboost(
params = params,
data = features_train,
label = response_train,
nrounds = 424, #From xgb.cv earlier
objective = 'reg:squarederror',
verbose = 0
)
save(xgb.fit.final, file = "model.Rdata.xgb.fit.final")
load(file = "model.Rdata.xgb.fit.final")
Variable importance
# create importance matrix
importance_matrix <- xgb.importance(model = xgb.fit.final)
importance_matrix
## Feature Gain Cover Frequency
## 1: w_l_percent 0.17684653 0.077981210 0.08586724
## 2: gb 0.16667365 0.073023757 0.08693790
## 3: sos 0.12830137 0.177857033 0.21541756
## 4: mov 0.12189073 0.144721516 0.12141328
## 5: srs 0.08462406 0.156085153 0.12655246
## 6: rk_advanced 0.06520816 0.052272246 0.05567452
## 7: loses_regular 0.06236744 0.034365254 0.03618844
## 8: n_rtg 0.05247345 0.108645220 0.10492505
## 9: wins_regular 0.05143428 0.081550094 0.07066381
## 10: pl 0.03758172 0.038697008 0.03875803
## 11: pw 0.03356304 0.047721497 0.04753747
## 12: division_lev_x_E 0.01903556 0.007080012 0.01006424
# variable importance plot
xgb.plot.importance(importance_matrix, top_n = 10, measure = "Gain")
# variable importance plot using 'cover'
xgb.plot.importance(importance_matrix, top_n = 10, measure = "Cover")
xgb.fit.final prediction
# predict values for test data
prediction_xgbfinal <- predict(xgb.fit.final, features_test)
xgbfinal_MSE <- mean((prediction_xgbfinal - testing$wins_playoff)^2)
xgbfinal_RMSE <- sqrt(xgbfinal_MSE)
# variable names
features <- setdiff(names(training3), "wins_playoff")
# Create the treatment plan from the training data
treatplan <- vtreat::designTreatmentsZ(training3, features, verbose = FALSE)
# Get the "clean" variable names from the scoreFrame
new_vars <- treatplan %>%
magrittr::use_series(scoreFrame) %>%
dplyr::filter(code %in% c("clean", "lev")) %>%
magrittr::use_series(varName)
# Prepare the training data
features_train2 <- vtreat::prepare(treatplan, training3, varRestriction = new_vars) %>% as.matrix()
response_train2 <- training3$wins_playoff
dim(features_train2)
## [1] 127 13
# parameter list
params <- list(
eta = 0.1,
max_depth = 5,
min_child_weight = 6,
subsample = 0.7,
colsample_bytree = 1.0
)
# train final model
xgb.fit.final2 <- xgboost(
params = params,
data = features_train2,
label = response_train2,
nrounds = 424, #From xgb.cv earlier
objective = 'reg:squarederror',
verbose = 0
)
save(xgb.fit.final2, file = "model.Rdata.xgb.fit.final2")
load(file = "model.Rdata.xgb.fit.final2")
xgb.fit.final2 prediction
# predict values for test data
prediction_xgbfinal2 <- predict(xgb.fit.final2, features_test)
xgbfinal2_MSE <- mean((prediction_xgbfinal2 - testing$wins_playoff)^2)
xgbfinal2_RMSE <- sqrt(xgbfinal2_MSE)
algorithm <- c("RF", "RF", "RF", "RF", "RF", "XGB", "XGB")
models <- c("rf1", "rf2", "rf3", "rf4", "rf5", "xgb.fit.final1", "xgb.fit.final2")
train_data_year <- c("2000-2022", "2000-2022", "2000-2022", "2003-2022", "2014-2022", "2003-2022", "2014-2022")
RMSE <- c(rf1_RMSE, rf2_RMSE, rf3_RMSE, rf4_RMSE, rf5_RMSE, xgbfinal_RMSE, xgbfinal2_RMSE)
models_RMSE <- data.frame(algorithm, models, train_data_year, RMSE)
models_RMSE %>% arrange(RMSE)
## algorithm models train_data_year RMSE
## 1 RF rf5 2014-2022 3.805945
## 2 RF rf4 2003-2022 3.975823
## 3 RF rf3 2000-2022 4.003766
## 4 RF rf2 2000-2022 4.112936
## 5 RF rf1 2000-2022 4.142098
## 6 XGB xgb.fit.final1 2003-2022 4.364409
## 7 XGB xgb.fit.final2 2014-2022 4.658134
We can see that Random Forest perform better than XGBoost in this case. I will still use the best one of each algorithm to predict the NBA 2023 champion by giving weight.
#predict by rf5
rf5_nba2023 <- predict(rf5, nba_2023)
rf5_nba2023 <- rf5_nba2023$predictions
#predict by xgb.fit.final
features_nba2023 <- vtreat::prepare(treatplan, nba_2023[,-12], varRestriction = new_vars) %>% as.matrix()
xgb_nba2023 <- predict(xgb.fit.final, features_nba2023)
#Combine the predictions to the nba_2023
nba_2023$wins_prediction_RF = rf5_nba2023
nba_2023$wins_prediction_XGB = xgb_nba2023
#Filter teams that made to the playoff in 2023
nba_2023 <- tibble::rownames_to_column(nba_2023, "Team")
nba_2023_champion <- nba_2023 %>%
select(Team, wins_prediction_RF, wins_prediction_XGB) %>%
filter(Team %in% c("Denver Nuggets", "Minnesota Timberwolves", "Phoenix Suns", "Los Angeles Clippers", "Sacramento Kings", "Golden State Warriors", "Memphis Grizzlies", "Los Angeles Lakers", "Milwaukee Bucks", "Miami Heat", "Cleveland Cavaliers", "New York Knicks", "Philadelphia 76ers", "Brooklyn Nets", "Boston Celtics", "Atlanta Hawks"))
#Giving weight 0.7 to the RF prediction and 0.3 to the XGB prediction
nba_2023_champion <- nba_2023_champion %>%
mutate(wins_prediction_final = 0.7*wins_prediction_RF + 0.3*wins_prediction_XGB)
nba_2023_champion %>%
select(Team, wins_prediction_final) %>%
arrange(desc(wins_prediction_final))
## Team wins_prediction_final
## 1 Boston Celtics 11.7704007
## 2 Milwaukee Bucks 11.0499745
## 3 Denver Nuggets 9.1142745
## 4 Memphis Grizzlies 8.3473483
## 5 Philadelphia 76ers 6.3116063
## 6 Cleveland Cavaliers 4.8958520
## 7 Sacramento Kings 3.7613739
## 8 Brooklyn Nets 3.2039380
## 9 Miami Heat 2.8239872
## 10 Golden State Warriors 2.5018025
## 11 Los Angeles Clippers 2.1778907
## 12 Phoenix Suns 2.1547612
## 13 New York Knicks 2.1518132
## 14 Los Angeles Lakers 1.4124989
## 15 Atlanta Hawks 0.9388401
## 16 Minnesota Timberwolves 0.8539958