# Set working directory and path to data
setwd("C:/Users/LENOVO/Downloads/Regression Model") # Example path on Windows
# Clear the workspace
rm(list = ls()) # Clear environment
gc() # Clear unused memory
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 524671 28.1 1168867 62.5 660385 35.3
## Vcells 968739 7.4 8388608 64.0 1769879 13.6
cat("\f") # Clear the console
dev.off # Clear the charts
## function (which = dev.cur())
## {
## if (which == 1)
## stop("cannot shut down device 1 (the null device)")
## .External(C_devoff, as.integer(which))
## dev.cur()
## }
## <bytecode: 0x000001c77887ea78>
## <environment: namespace:grDevices>
Research Question:
How does the increase in the number of 3-point shots taken by NBA teams impact the following statistical measures: winning percentage, total points scored, and field-goal percentage?How is this relationship influenced by the pace of the game?
Here,
Dependent Variables are winning percentage, total points scored and field-goal percentage.
Independent Variables are the number of 3-point shots taken by NBA team i.e. 3-point, 3-point attempts and 3-point percentage.
Control Variables are games, field goal and pace of the game.
For this research question we have taken the data from the NBA stats from the year 2019 to 2022.
Data Source : Basketball Reference.
We have pulled 2 different data from the basketball references site i.e Per Game stats and Advanced Stats. Here, there is a need for Win, Loss and Pace variables data which was not present in the Per Game Stats data for each year.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
library(readxl)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Season Stats Data
df <- read.csv("C:\\Users\\LENOVO\\Downloads\\Regression Model\\Project\\sportsref_download.csv")
head(df)
## Rk Team G MP FG FGA FG. X3P X3PA X3P. X2P X2PA
## 1 1 Milwaukee Bucks* 73 241.0 43.3 90.9 0.476 13.8 38.9 0.355 29.5 52.0
## 2 2 Houston Rockets* 72 241.4 40.8 90.4 0.451 15.6 45.3 0.345 25.1 45.2
## 3 3 Dallas Mavericks* 75 242.3 41.7 90.3 0.461 15.1 41.3 0.367 26.5 49.0
## 4 4 Los Angeles Clippers* 72 241.4 41.6 89.2 0.466 12.4 33.5 0.371 29.1 55.8
## 5 5 New Orleans Pelicans 72 242.1 42.6 91.6 0.465 13.6 36.9 0.370 28.9 54.8
## 6 6 Portland Trail Blazers* 74 241.0 42.2 91.2 0.463 12.9 34.1 0.377 29.3 57.1
## X2P. FT FTA FT. ORB DRB TRB AST STL BLK TOV PF PTS
## 1 0.567 18.3 24.7 0.742 9.5 42.2 51.7 25.9 7.2 5.9 15.1 19.6 118.7
## 2 0.557 20.6 26.1 0.791 9.8 34.5 44.3 21.6 8.7 5.2 14.7 21.8 117.8
## 3 0.541 18.6 23.8 0.779 10.5 36.4 46.9 24.7 6.1 4.8 12.7 19.5 117.0
## 4 0.522 20.8 26.3 0.791 10.7 37.0 47.7 23.7 7.1 4.7 14.6 22.1 116.3
## 5 0.528 17.1 23.4 0.729 11.1 35.4 46.5 26.8 7.5 5.0 16.4 21.2 115.8
## 6 0.514 17.7 22.1 0.804 10.2 35.1 45.3 20.6 6.3 6.1 12.8 21.7 115.0
#Advance Stats Data
df1 <- read.csv("C:\\Users\\LENOVO\\Downloads\\sportsref_download.csv")
head(df1)
## Rk Team Age W L PW PL MOV SOS SRS ORtg DRtg NRtg
## 1 1 Milwaukee Bucks* 29.2 56 17 57 16 10.08 -0.67 9.41 112.4 102.9 9.5
## 2 2 Boston Celtics* 25.3 48 24 50 22 6.31 -0.47 5.83 113.3 107.0 6.3
## 3 3 Los Angeles Clippers* 27.4 49 23 50 22 6.44 0.21 6.66 113.9 107.6 6.3
## 4 4 Toronto Raptors* 26.6 53 19 50 22 6.24 -0.26 5.97 111.1 105.0 6.1
## 5 5 Los Angeles Lakers* 29.5 52 19 48 23 5.79 0.49 6.28 112.0 106.3 5.7
## 6 6 Dallas Mavericks* 26.1 43 32 49 26 4.95 -0.07 4.87 116.7 111.7 5.0
## Pace FTr X3PAr TS. X Offense.Four.Factors X.1 X.2 X.3 X.4
## 1 105.1 0.271 0.428 0.583 NA eFG% TOV% ORB% FT/FGA NA
## 2 99.5 0.259 0.386 0.570 NA 0.552 12.9 20.7 0.201 NA
## 3 101.5 0.295 0.375 0.577 NA 0.531 12.2 23.9 0.207 NA
## 4 100.9 0.264 0.421 0.574 NA 0.535 12.6 23.5 0.233 NA
## 5 100.9 0.276 0.358 0.573 NA 0.536 13.1 21.3 0.21 NA
## 6 99.3 0.264 0.457 0.581 NA 0.542 13.3 24.5 0.201 NA
## Defense.Four.Factors X.5 X.6 X.7 X.8 X.9 X.10 X.11
## 1 eFG% TOV% DRB% FT/FGA NA Arena Attend. Attend./G
## 2 0.489 12 81.6 0.178 NA Fiserv Forum 549,036 17,711
## 3 0.509 13.5 77.4 0.215 NA TD Garden 610,864 19,090
## 4 0.506 12.2 77.6 0.206 NA STAPLES Center 610,176 19,068
## 5 0.502 14.6 76.7 0.202 NA Scotiabank Arena 633,456 19,796
## 6 0.515 14.1 78.8 0.205 NA STAPLES Center 588,907 18,997
We are going to modify the dataset with mutate function and use gsub, This replaces any asterisks (**) with an empty string in all columns. This is typically done to clean data where asterisks might be used to denote missing values or special cases.
df <- df %>%
mutate(across(everything(), ~ gsub("\\*", "", .)))
df1 <- df1 %>%
mutate(across(everything(), ~ gsub("\\*", "", .)))
We are checking the column names so that we can know the variables in both datasets. Then we will remove the variables that we don’t need for this project.
names(df)
## [1] "Rk" "Team" "G" "MP" "FG" "FGA" "FG." "X3P" "X3PA" "X3P."
## [11] "X2P" "X2PA" "X2P." "FT" "FTA" "FT." "ORB" "DRB" "TRB" "AST"
## [21] "STL" "BLK" "TOV" "PF" "PTS"
names(df1)
## [1] "Rk" "Team" "Age"
## [4] "W" "L" "PW"
## [7] "PL" "MOV" "SOS"
## [10] "SRS" "ORtg" "DRtg"
## [13] "NRtg" "Pace" "FTr"
## [16] "X3PAr" "TS." "X"
## [19] "Offense.Four.Factors" "X.1" "X.2"
## [22] "X.3" "X.4" "Defense.Four.Factors"
## [25] "X.5" "X.6" "X.7"
## [28] "X.8" "X.9" "X.10"
## [31] "X.11"
Here’s a breakdown of the variables in your NBA stats dataset, including their definitions:
These definitions cover the main statistical categories typically recorded in an NBA dataset, reflecting various aspects of player and team performance.
In the Data Merging process below, here we have merged the Advance Stats dataset with Stats dataset by team and used the key variables from the stats i.e Win ,Loss and Pace. Apart from these key variables, the remaining variables have been removed as we don’t need those variables in this particular research question.
# Remove column 'b'
df_new <- subset(df1, select = -c(Age, Rk, PW, PL, MOV, SOS, SRS, ORtg, DRtg, NRtg, FTr, X3PAr, TS., X, Offense.Four.Factors, X.1, X.2, X.3, X.4, Defense.Four.Factors, X.5, X.6, X.7, X.8, X.9, X.10, X.11))
head(df_new)
## Team W L Pace
## 1 Milwaukee Bucks 56 17 105.1
## 2 Boston Celtics 48 24 99.5
## 3 Los Angeles Clippers 49 23 101.5
## 4 Toronto Raptors 53 19 100.9
## 5 Los Angeles Lakers 52 19 100.9
## 6 Dallas Mavericks 43 32 99.3
merged_df <- merge(df, df_new, by = "Team")
names(merged_df)
## [1] "Team" "Rk" "G" "MP" "FG" "FGA" "FG." "X3P" "X3PA" "X3P."
## [11] "X2P" "X2PA" "X2P." "FT" "FTA" "FT." "ORB" "DRB" "TRB" "AST"
## [21] "STL" "BLK" "TOV" "PF" "PTS" "W" "L" "Pace"
Then removing Rank and 2 point shots variables from the Stats dataset.
df_2019 <- subset(merged_df, select = -c(Rk, X2P, X2PA, X2P.))
Since there isn’t a year column in the dataset, we will create one for the dataset for 2019. We need to combine four-year datasets. So that the year in the dataset is known.
df_2019 <- df_2019 %>% mutate(Year = 2019)
names(df_2019)
## [1] "Team" "G" "MP" "FG" "FGA" "FG." "X3P" "X3PA" "X3P." "FT"
## [11] "FTA" "FT." "ORB" "DRB" "TRB" "AST" "STL" "BLK" "TOV" "PF"
## [21] "PTS" "W" "L" "Pace" "Year"
In the dataset, the variables names is confusing and difficult to understand for eg. The 3 point column was named X3P. So to make it readable, the column is renamed accordingly.
names(df_2019) <- c("Team", "Games", "MinutesPlayed", "FieldGoal", "FieldGoalsAttempted", "FieldGoalsPercentage", "3Point", "3PointAttempted", "3PointPercentage", "FreeThrows", "FreeThrowsAttempted", "FreeThrowPercentage", "Offensive Rebounds", "Defensive Rebounds", "Total Rebounds", "Assists", "Steals", "Blocks", "Turnovers", "PersonalFouls", "Points", "Wins", "Losses", "Pace", "Year")
Here, the relocating of the Win, Loss and Year is done because during merging the datasets these columns were located at last. So in order to get these columns at the start necessary changes were made.
df_2019 <- df_2019 %>%
relocate(Wins, Losses, .before = MinutesPlayed) %>%
relocate(Year, .before = Games) %>% relocate(Pace, .before = Points)
The same above steps are executed for the remaining years dataset too.
#Stats Dataset
df21 <- read.csv("C:\\Users\\LENOVO\\Downloads\\Regression Model\\Project\\sportsref_download (1).csv")
#Advance Dataset
df2 <- read.csv("C:\\Users\\LENOVO\\Downloads\\2020-21.csv")
head(df21)
## Rk Team G MP FG FGA FG. X3P X3PA X3P. X2P X2PA
## 1 1 Milwaukee Bucks* 72 240.7 44.7 91.8 0.487 14.4 37.1 0.389 30.3 54.7
## 2 2 Brooklyn Nets* 72 241.7 43.1 87.3 0.494 14.2 36.1 0.392 29.0 51.2
## 3 3 Washington Wizards* 72 241.7 43.2 90.9 0.475 10.2 29.0 0.351 33.0 61.9
## 4 4 Utah Jazz* 72 241.0 41.3 88.1 0.468 16.7 43.0 0.389 24.5 45.1
## 5 5 Portland Trail Blazers* 72 240.3 41.3 91.1 0.453 15.7 40.8 0.385 25.6 50.3
## 6 6 Phoenix Suns* 72 242.8 43.3 88.3 0.490 13.1 34.6 0.378 30.3 53.7
## X2P. FT FTA FT. ORB DRB TRB AST STL BLK TOV PF PTS
## 1 0.554 16.2 21.4 0.760 10.3 37.8 48.1 25.5 8.1 4.6 13.8 17.3 120.1
## 2 0.565 18.1 22.5 0.804 8.9 35.5 44.4 26.8 6.7 5.3 13.5 19.0 118.6
## 3 0.533 20.1 26.2 0.769 9.7 35.5 45.2 25.5 7.3 4.1 14.4 21.6 116.6
## 4 0.544 17.2 21.5 0.799 10.6 37.6 48.3 23.7 6.6 5.2 14.2 18.5 116.4
## 5 0.509 17.8 21.6 0.823 10.6 33.9 44.5 21.3 6.9 5.0 11.1 18.9 116.1
## 6 0.563 15.6 18.7 0.834 8.8 34.2 42.9 26.9 7.2 4.3 12.5 19.1 115.3
# Asterisks remove
df21 <- df21 %>%
mutate(across(everything(), ~ gsub("\\*", "", .)))
df2 <- df2 %>%
mutate(across(everything(), ~ gsub("\\*", "", .)))
#Removing Columns
df_new1 <- subset(df2, select = -c(Age, Rk, PW, PL, MOV, SOS, SRS, ORtg, DRtg, NRtg, FTr, X3PAr, TS., X, Offense.Four.Factors, X.1, X.2, X.3, X.4, Defense.Four.Factors, X.5, X.6, X.7, X.8, X.9, X.10, X.11))
head(df_new1)
## Team W L Pace
## 1 Utah Jazz 52 20 98.5
## 2 Los Angeles Clippers 47 25 96.9
## 3 Phoenix Suns 51 21 97.2
## 4 Milwaukee Bucks 46 26 102.2
## 5 Philadelphia 76ers 49 23 99.5
## 6 Denver Nuggets 47 25 97.1
#Merge by Team
merged_df1 <- merge(df21, df_new1, by = "Team")
head(merged_df1)
## Team Rk G MP FG FGA FG. X3P X3PA X3P. X2P X2PA
## 1 Atlanta Hawks 11 72 241.7 40.8 87.2 0.468 12.4 33.4 0.373 28.4 53.9
## 2 Boston Celtics 16 72 241.4 41.5 88.9 0.466 13.6 36.4 0.374 27.9 52.5
## 3 Brooklyn Nets 2 72 241.7 43.1 87.3 0.494 14.2 36.1 0.392 29 51.2
## 4 Charlotte Hornets 23 72 241 39.9 87.8 0.455 13.7 37 0.369 26.3 50.8
## 5 Chicago Bulls 21 72 241.4 42.2 88.6 0.476 12.6 34 0.37 29.6 54.6
## 6 Cleveland Cavaliers 30 72 242.1 38.6 85.8 0.45 10 29.7 0.336 28.6 56
## X2P. FT FTA FT. ORB DRB TRB AST STL BLK TOV PF PTS W L Pace
## 1 0.526 19.7 24.2 0.812 10.6 35.1 45.6 24.1 7 4.8 13.2 19.3 113.7 41 31 97.6
## 2 0.53 16.1 20.8 0.775 10.6 33.6 44.3 23.5 7.7 5.3 14.1 20.4 112.6 36 36 98.3
## 3 0.565 18.1 22.5 0.804 8.9 35.5 44.4 26.8 6.7 5.3 13.5 19 118.6 48 24 99.5
## 4 0.517 15.9 20.9 0.761 10.6 33.2 43.8 26.8 7.8 4.8 14.8 18 109.5 33 39 98.3
## 5 0.542 13.8 17.5 0.791 9.6 35.3 45 26.8 6.7 4.2 15.1 18.9 110.7 31 41 99
## 6 0.51 16.7 22.4 0.743 10.4 32.3 42.8 23.8 7.8 4.5 15.5 18.2 103.8 22 50 97.3
df_2020 <- subset(merged_df1, select = -c(Rk, X2P, X2PA, X2P.))
# Create Year Variable
df_2020 <- df_2020 %>% mutate(Year = 2020)
head(df_2020)
## Team G MP FG FGA FG. X3P X3PA X3P. FT FTA FT.
## 1 Atlanta Hawks 72 241.7 40.8 87.2 0.468 12.4 33.4 0.373 19.7 24.2 0.812
## 2 Boston Celtics 72 241.4 41.5 88.9 0.466 13.6 36.4 0.374 16.1 20.8 0.775
## 3 Brooklyn Nets 72 241.7 43.1 87.3 0.494 14.2 36.1 0.392 18.1 22.5 0.804
## 4 Charlotte Hornets 72 241 39.9 87.8 0.455 13.7 37 0.369 15.9 20.9 0.761
## 5 Chicago Bulls 72 241.4 42.2 88.6 0.476 12.6 34 0.37 13.8 17.5 0.791
## 6 Cleveland Cavaliers 72 242.1 38.6 85.8 0.45 10 29.7 0.336 16.7 22.4 0.743
## ORB DRB TRB AST STL BLK TOV PF PTS W L Pace Year
## 1 10.6 35.1 45.6 24.1 7 4.8 13.2 19.3 113.7 41 31 97.6 2020
## 2 10.6 33.6 44.3 23.5 7.7 5.3 14.1 20.4 112.6 36 36 98.3 2020
## 3 8.9 35.5 44.4 26.8 6.7 5.3 13.5 19 118.6 48 24 99.5 2020
## 4 10.6 33.2 43.8 26.8 7.8 4.8 14.8 18 109.5 33 39 98.3 2020
## 5 9.6 35.3 45 26.8 6.7 4.2 15.1 18.9 110.7 31 41 99 2020
## 6 10.4 32.3 42.8 23.8 7.8 4.5 15.5 18.2 103.8 22 50 97.3 2020
#Renaming the columns
names(df_2020) <- c("Team", "Games", "MinutesPlayed", "FieldGoal", "FieldGoalsAttempted", "FieldGoalsPercentage", "3Point", "3PointAttempted", "3PointPercentage", "FreeThrows", "FreeThrowsAttempted", "FreeThrowPercentage", "Offensive Rebounds", "Defensive Rebounds", "Total Rebounds", "Assists", "Steals", "Blocks", "Turnovers", "PersonalFouls", "Points", "Wins", "Losses", "Pace", "Year")
#Relocating the columns
df_2020 <- df_2020 %>%
relocate(Wins, Losses, .before = MinutesPlayed) %>%
relocate(Year, .before = Games) %>% relocate(Pace, .before = Points)
# Combine datasets
combined_data <- bind_rows(df_2019, df_2020)
#Stats Dataset
df31 <- read.csv("C:\\Users\\LENOVO\\Downloads\\Regression Model\\Project\\sportsref_download (2).csv")
#Advance Dataset
df3 <- read.csv("C:\\Users\\LENOVO\\Downloads\\2021-22.csv")
#Removing Aestricks
df31 <- df31 %>%
mutate(across(everything(), ~ gsub("\\*", "", .)))
df3 <- df3 %>%
mutate(across(everything(), ~ gsub("\\*", "", .)))
df_new2 <- subset(df3, select = -c(Age, Rk, PW, PL, MOV, SOS, SRS, ORtg, DRtg, NRtg, FTr, X3PAr, TS., X, eFG., TOV., ORB., FT.FGA, X.1, X.2, eFG..1, TOV..1, DRB., FT.FGA.1, Arena, Attend., Attend..G))
#Merge By Team
merged_df2 <- merge(df31, df_new2, by = "Team")
df_2021 <- subset(merged_df2, select = -c(Rk, X2P, X2PA, X2P.))
#Create Year Variabele
df_2021 <- df_2021 %>% mutate(Year = 2021)
#Renaming the columns
names(df_2021) <- c("Team", "Games", "MinutesPlayed", "FieldGoal", "FieldGoalsAttempted", "FieldGoalsPercentage", "3Point", "3PointAttempted", "3PointPercentage", "FreeThrows", "FreeThrowsAttempted", "FreeThrowPercentage", "Offensive Rebounds", "Defensive Rebounds", "Total Rebounds", "Assists", "Steals", "Blocks", "Turnovers", "PersonalFouls", "Points", "Wins", "Losses", "Pace", "Year")
#Relocating the columns
df_2021 <- df_2021 %>%
relocate(Wins, Losses, .before = MinutesPlayed) %>%
relocate(Year, .before = Games) %>% relocate(Pace, .before = Points)
#Stats Dataset
df41 <- read.csv("C:\\Users\\LENOVO\\Downloads\\Regression Model\\Project\\sportsref_download (3).csv")
#Advance Dataset
df4 <- read.csv("C:\\Users\\LENOVO\\Downloads\\2022-23.csv")
#Removing Aestricks
df41 <- df41 %>%
mutate(across(everything(), ~ gsub("\\*", "", .)))
df4 <- df4 %>%
mutate(across(everything(), ~ gsub("\\*", "", .)))
#Removing columns
df_new4 <- subset(df4, select = -c(Age, Rk, PW, PL, MOV, SOS, SRS, ORtg, DRtg, NRtg, FTr, X3PAr, TS., X, Offense.Four.Factors, X.1, X.2, X.3, X.4, Defense.Four.Factors, X.5, X.6, X.7, X.8, X.9, X.10, X.11))
#Merge by Team
merged_df3 <- merge(df41, df_new4, by = "Team")
df_2022 <- subset(merged_df3, select = -c(Rk, X2P, X2PA, X2P.))
#Create Year Variable
df_2022 <- df_2022 %>% mutate(Year = 2022)
#Renaming the columns
names(df_2022) <- c("Team", "Games", "MinutesPlayed", "FieldGoal", "FieldGoalsAttempted", "FieldGoalsPercentage", "3Point", "3PointAttempted", "3PointPercentage", "FreeThrows", "FreeThrowsAttempted", "FreeThrowPercentage", "Offensive Rebounds", "Defensive Rebounds", "Total Rebounds", "Assists", "Steals", "Blocks", "Turnovers", "PersonalFouls", "Points", "Wins", "Losses", "Pace", "Year")
#Relocating the columns
df_2022 <- df_2022 %>%
relocate(Wins, Losses, .before = MinutesPlayed) %>%
relocate(Year, .before = Games) %>% relocate(Pace, .before = Points)
# Combine datasets
combined_data <- bind_rows(combined_data, df_2021, df_2022)
“I’ve merged the datasets from 2019 to 2022, and now the data is ready for exploration, visualization, and regression modeling.”
data <- combined_data %>%
mutate(
Team = as.factor(Team),
Year = as.factor(Year)
)
This code transforms the Team and Year columns in the combined_data data frame into factors, which is useful for categorical data analysis, especially when dealing with statistical models or plotting where categorical distinctions are needed.
#Structure of the dataset
str(data)
## 'data.frame': 120 obs. of 25 variables:
## $ Team : Factor w/ 30 levels "Atlanta Hawks",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Year : Factor w/ 4 levels "2019","2020",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Games : chr "67" "72" "72" "65" ...
## $ Wins : chr "20" "48" "35" "23" ...
## $ Losses : chr "47" "24" "37" "42" ...
## $ MinutesPlayed : chr "243" "242.1" "242.8" "242.3" ...
## $ FieldGoal : chr "40.6" "41.3" "40.4" "37.3" ...
## $ FieldGoalsAttempted : chr "90.6" "89.6" "90.3" "85.9" ...
## $ FieldGoalsPercentage: chr "0.449" "0.461" "0.448" "0.434" ...
## $ 3Point : chr "12" "12.6" "13.1" "12.1" ...
## $ 3PointAttempted : chr "36.1" "34.5" "38.1" "34.3" ...
## $ 3PointPercentage : chr "0.333" "0.364" "0.343" "0.352" ...
## $ FreeThrows : chr "18.5" "18.6" "17.9" "16.2" ...
## $ FreeThrowsAttempted : chr "23.4" "23.2" "24.1" "21.6" ...
## $ FreeThrowPercentage : chr "0.79" "0.801" "0.745" "0.748" ...
## $ Offensive Rebounds : chr "9.9" "10.7" "10.6" "11" ...
## $ Defensive Rebounds : chr "33.4" "35.4" "37.3" "31.8" ...
## $ Total Rebounds : chr "43.3" "46.1" "47.9" "42.8" ...
## $ Assists : chr "24" "23" "24.5" "23.8" ...
## $ Steals : chr "7.8" "8.3" "6.4" "6.6" ...
## $ Blocks : chr "5.1" "5.6" "4.5" "4.1" ...
## $ Turnovers : chr "16.2" "13.8" "15.3" "14.6" ...
## $ PersonalFouls : chr "23.1" "21.6" "21" "18.8" ...
## $ Pace : chr "103" "99.5" "101.4" "95.8" ...
## $ Points : chr "111.8" "113.7" "111.8" "102.9" ...
Columns like Games, Wins, Losses, MinutesPlayed, FieldGoal, etc., are stored as character types instead of numeric. This prevents you from performing mathematical operations or analyses directly on these columns.We have to change the datasets to mumeric.
# Convert columns to numeric
data$Wins <- as.numeric(as.character(data$Wins))
data$Losses <- as.numeric(as.character(data$Losses))
data$MinutesPlayed <- as.numeric(as.character(data$MinutesPlayed))
data$FieldGoal <- as.numeric(as.character(data$FieldGoal))
data$FieldGoalsAttempted <- as.numeric(as.character(data$FieldGoalsAttempted))
data$FieldGoalsPercentage <- as.numeric(as.character(data$FieldGoalsPercentage))
data$Pace <- as.numeric(as.character(data$Pace))
data$`3Point` <- as.numeric(as.character(data$`3Point`))
data$`3PointAttempted` <- as.numeric(as.character(data$`3PointAttempted`))
data$`3PointPercentage` <- as.numeric(as.character(data$`3PointPercentage`))
data$Points <- as.numeric(as.character(data$Points))
data$Games <- as.numeric(as.character(data$Games))
data$FieldGoalsPercentage <- as.numeric(as.character(data$FieldGoalsPercentage))
data$FreeThrows <- as.numeric(as.character(data$FreeThrows))
data$FreeThrowsAttempted <- as.numeric(as.character(data$FreeThrowsAttempted))
data$FreeThrowPercentage <- as.numeric(as.character(data$FreeThrowPercentage))
data$`Offensive Rebounds` <- as.numeric(as.character(data$`Offensive Rebounds`))
data$`Defensive Rebounds` <- as.numeric(as.character(data$`Defensive Rebounds`))
data$`Total Rebounds` <- as.numeric(as.character(data$`Total Rebounds`))
data$Assists <- as.numeric(as.character(data$Assists))
data$Steals <- as.numeric(as.character(data$Steals))
data$Blocks <- as.numeric(as.character(data$Blocks))
data$Turnovers <- as.numeric(as.character(data$Turnovers))
data$PersonalFouls <- as.numeric(as.character(data$PersonalFouls))
We need winning percentage for the y variable. Incorporating winning percentage into your analysis helps quantify the success of various strategies and provides a direct measure of team performance, which is essential for understanding the impact of 3-point shooting and other metrics.
Formula for calculate winning percentage: WP = Wins / Wins + Losses.
data$WinningPercentage <- round(data$Wins / (data$Wins + data$Losses), 3)
# Write the merged dataset to a CSV file
write.csv(data, file = "Final_NBA_dataset.csv", row.names = FALSE)
str(data)
## 'data.frame': 120 obs. of 26 variables:
## $ Team : Factor w/ 30 levels "Atlanta Hawks",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Year : Factor w/ 4 levels "2019","2020",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Games : num 67 72 72 65 65 65 75 73 66 65 ...
## $ Wins : num 20 48 35 23 22 19 43 46 20 15 ...
## $ Losses : num 47 24 37 42 43 46 32 27 46 50 ...
## $ MinutesPlayed : num 243 242 243 242 241 ...
## $ FieldGoal : num 40.6 41.3 40.4 37.3 39.6 40.3 41.7 42 39.3 38.6 ...
## $ FieldGoalsAttempted : num 90.6 89.6 90.3 85.9 88.6 87.9 90.3 88.9 85.7 88.2 ...
## $ FieldGoalsPercentage: num 0.449 0.461 0.448 0.434 0.447 0.458 0.461 0.473 0.459 0.438 ...
## $ 3Point : num 12 12.6 13.1 12.1 12.2 11.2 15.1 11 12 10.4 ...
## $ 3PointAttempted : num 36.1 34.5 38.1 34.3 35.1 31.8 41.3 30.6 32.7 31.3 ...
## $ 3PointPercentage : num 0.333 0.364 0.343 0.352 0.348 0.351 0.367 0.359 0.367 0.334 ...
## $ FreeThrows : num 18.5 18.6 17.9 16.2 15.5 15.1 18.6 16.2 16.6 18.7 ...
## $ FreeThrowsAttempted : num 23.4 23.2 24.1 21.6 20.5 19.9 23.8 20.9 22.4 23.2 ...
## $ FreeThrowPercentage : num 0.79 0.801 0.745 0.748 0.755 0.758 0.779 0.777 0.743 0.803 ...
## $ Offensive Rebounds : num 9.9 10.7 10.6 11 10.5 10.8 10.5 10.8 9.8 10 ...
## $ Defensive Rebounds : num 33.4 35.4 37.3 31.8 31.4 33.4 36.4 33.4 32 32.9 ...
## $ Total Rebounds : num 43.3 46.1 47.9 42.8 41.9 44.2 46.9 44.1 41.7 42.8 ...
## $ Assists : num 24 23 24.5 23.8 23.2 23.1 24.7 26.7 24.1 25.6 ...
## $ Steals : num 7.8 8.3 6.4 6.6 10 6.9 6.1 8 7.4 8.2 ...
## $ Blocks : num 5.1 5.6 4.5 4.1 4.1 3.2 4.8 4.6 4.5 4.6 ...
## $ Turnovers : num 16.2 13.8 15.3 14.6 15.5 16.5 12.7 13.8 15.3 14.9 ...
## $ PersonalFouls : num 23.1 21.6 21 18.8 21.8 18.3 19.5 20.3 19.7 20.1 ...
## $ Pace : num 103 99.5 101.4 95.8 99.7 ...
## $ Points : num 112 114 112 103 107 ...
## $ WinningPercentage : num 0.299 0.667 0.486 0.354 0.338 0.292 0.573 0.63 0.303 0.231 ...
We can see that every variable in the dataset is numerical, with the exception of the team and year variables. So that we can quickly analyze this dataset.
dim_info <- dim(data)
num_rows <- dim_info[1]
num_cols <- dim_info[2]
cat("Dimension of training set: Number of rows:", num_rows, ", Number of cols:", num_cols, "\n")
## Dimension of training set: Number of rows: 120 , Number of cols: 26
stargazer(data, type = "text", summary.stat = c("mean", "min", "max", "sd", "median"))
##
## =============================================================
## Statistic Mean Min Max St. Dev. Median
## -------------------------------------------------------------
## Games 76.650 64 82 5.640 78.5
## Wins 38.325 15 64 11.108 41
## Losses 38.325 17 65 10.549 38.5
## MinutesPlayed 241.583 240.000 243.700 0.810 241.500
## FieldGoal 41.163 37.300 44.700 1.563 41.300
## FieldGoalsAttempted 88.407 83.700 94.400 2.234 88.400
## FieldGoalsPercentage 0.466 0.429 0.504 0.015 0.468
## 3Point 12.420 9.600 16.700 1.495 12.200
## 3PointAttempted 34.532 28.000 45.300 3.609 34.200
## 3PointPercentage 0.359 0.323 0.411 0.016 0.358
## FreeThrows 17.536 13.800 21.000 1.455 17.500
## FreeThrowsAttempted 22.575 17.500 26.600 1.816 22.400
## FreeThrowPercentage 0.777 0.694 0.839 0.028 0.779
## Offensive Rebounds 10.172 7.600 14.100 1.127 10.150
## Defensive Rebounds 34.077 30.300 42.200 1.708 34.050
## Total Rebounds 44.242 38.800 51.700 1.982 44.200
## Assists 24.788 20.600 29.800 1.774 24.700
## Steals 7.538 6.100 10.000 0.790 7.450
## Blocks 4.787 3.000 6.600 0.683 4.750
## Turnovers 14.063 11.100 16.500 1.095 14.150
## PersonalFouls 19.922 17.200 23.100 1.272 19.900
## Pace 99.216 95.400 105.100 2.015 98.950
## Points 112.266 102.900 120.700 3.853 112.850
## WinningPercentage 0.499 0.207 0.780 0.138 0.512
## -------------------------------------------------------------
sapply(data, function(x) sum(is.na(x)))
## Team Year Games
## 0 0 0
## Wins Losses MinutesPlayed
## 0 0 0
## FieldGoal FieldGoalsAttempted FieldGoalsPercentage
## 0 0 0
## 3Point 3PointAttempted 3PointPercentage
## 0 0 0
## FreeThrows FreeThrowsAttempted FreeThrowPercentage
## 0 0 0
## Offensive Rebounds Defensive Rebounds Total Rebounds
## 0 0 0
## Assists Steals Blocks
## 0 0 0
## Turnovers PersonalFouls Pace
## 0 0 0
## Points WinningPercentage
## 0 0
The dataset contains no missing values.
hist(data$Points, main="Distribution of Points", xlab="Points", col="lightblue")
hist(data$'3PointPercentage', main="Distribution of 3-Point Percentage", xlab="3-Point Percentage", col="lightgreen")
hist(data$Pace, main = "Distribution of Pace", xlab = "Pace", col = "lightblue")
Distribution Of Points:
From the histogram, there is systematic shape(Normal Distribution) it may indicate a consistent pattern in the distribution of points. This could suggest that most teams score within a specific range, showing a regularity in scoring performance across the dataset.
Range: The spread from 105 to 120 points indicates that the majority of the teams score between these values. This range represents the central cluster of scoring performances.
The center of the distribution at 112 points indicates that the average or most common point value for the teams in your dataset is around 112. This suggests that a typical team’s score is close to this value.
Frequency The peak at between 112-114 indicates that this is the most common point value among the teams in your dataset. This suggests that the majority of teams score 113 points more frequently than any other specific score. Some team has above 120 points and some team has below 105 with low frequency.
Distribution Of 3 Point Percentage
Distribution Type: The graph shows with a peak at 0.36 and a central tendency around 0.35 to 0.36, combined with the spread, suggests a normal-like distribution or a single-peaked distribution. This indicates that most teams have shooting percentages clustered around 0.35 to 0.36, with decreasing frequencies as you move away from this center.
Frequency: With the highest frequency at 0.36 and a central spread, it appears that most teams have a 3-point shooting percentage close to this value.
Spread: The range from 0.32 to 0.42 shows there is some variability in shooting accuracy, but most teams are concentrated around the center.
Uniform with a Gap: The histogram shows a uniform distribution with a notable gap between 0.40 to 0.41, suggesting that these paces are less common.
Distribution of Pace
From the histogram, there is systematic shape(Normal Distribution) it may indicate a consistent pattern in the distribution of points. This could suggest that most teams score within a specific range, showing a regularity in scoring performance across the dataset.
Central Tendency: With a center at 100, the typical pace for most teams is around this value. Spread: The range from 96 to 106 indicates overall variability in team paces.
Peak Frequency: The highest frequency being above 30 suggests a common pace range where most teams fall.
plot(data$'3PointPercentage' ~ data$WinningPercentage, main="3-Point Perecentage vs Winning Percentage", xlab="Winning Percentage", ylab="3-Point Percentage")
abline(lm(data$'3PointPercentage' ~ data$WinningPercentage), col="red")
The scatter plot shows a positive trend, meaning as the winning percentage increases, the 3-point field goal percentage also tends to increase.
pairs(data[, c("Points", "3PointPercentage", "FieldGoalsPercentage", "Pace")], main="Pairwise Plot")
Average Metrics by Year:
library(dplyr)
data %>%
group_by(Year) %>%
summarise(across(where(is.numeric), mean, na.rm = TRUE))
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(where(is.numeric), mean, na.rm = TRUE)`.
## ℹ In group 1: `Year = 2019`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
##
## # Previously
## across(a:b, mean, na.rm = TRUE)
##
## # Now
## across(a:b, \(x) mean(x, na.rm = TRUE))
## # A tibble: 4 × 25
## Year Games Wins Losses MinutesPlayed FieldGoal FieldGoalsAttempted
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2019 70.6 35.3 35.3 242. 40.8 88.8
## 2 2020 72 36 36 241. 41.2 88.4
## 3 2021 82 41 41 241. 40.6 88.1
## 4 2022 82 41 41 242. 42.0 88.3
## # ℹ 18 more variables: FieldGoalsPercentage <dbl>, `3Point` <dbl>,
## # `3PointAttempted` <dbl>, `3PointPercentage` <dbl>, FreeThrows <dbl>,
## # FreeThrowsAttempted <dbl>, FreeThrowPercentage <dbl>,
## # `Offensive Rebounds` <dbl>, `Defensive Rebounds` <dbl>,
## # `Total Rebounds` <dbl>, Assists <dbl>, Steals <dbl>, Blocks <dbl>,
## # Turnovers <dbl>, PersonalFouls <dbl>, Pace <dbl>, Points <dbl>,
## # WinningPercentage <dbl>
The number of games increased from 70.6 in 2019 to 82 in 2021 and 2022. This could be due to changes in the season length or adjustments for the pandemic.
Wins and Losses:
Wins and losses are balanced (41 wins and 41 losses) for 2021 and 2022, indicating that teams have had a more balanced performance compared to 2019 and 2020.
Minutes Played:
The minutes played per game remain fairly consistent across the years, with a slight decrease in 2021 compared to 2019 and little bit increase in 2022.
Field Goals and 3-Point Statistics:
Field Goals: The percentage of field goals attempted and made has decreased slightly from 2019 to 2022.
3-Point Statistics: There is a slight increase in 3-point field goals and percentage over the years, indicating a growing emphasis on the 3-point shot.
Pace:
The pace has slightly decreased from 2019 to 2022. This variation may reflect changes in playing style or game strategies over the years.
Points:
Points scored per game have generally increased, indicating higher scoring games or improved offensive strategies.
Winning Percentage:
The winning percentage has been relatively stable, with a small decrease in 2021 and 2022. This stability suggests that winning percentages have not fluctuated drastically despite changes in other metrics.
data %>%
group_by(Team) %>%
summarise(AverageWinningPercentage = mean(WinningPercentage, na.rm = TRUE)) %>%
arrange(desc(AverageWinningPercentage))
## # A tibble: 30 × 2
## Team AverageWinningPercentage
## <fct> <dbl>
## 1 Milwaukee Bucks 0.684
## 2 Philadelphia 76ers 0.638
## 3 Denver Nuggets 0.629
## 4 Phoenix Suns 0.626
## 5 Boston Celtics 0.621
## 6 Los Angeles Clippers 0.596
## 7 Utah Jazz 0.596
## 8 Miami Heat 0.586
## 9 Memphis Grizzlies 0.575
## 10 Dallas Mavericks 0.563
## # ℹ 20 more rows
We can see Milawaukee Bucks has the highest average winning percentage from 2019-2022. In the meantime Detroit Pistons registered lowest average winning percentage which is 0.27.Other teams has 0.4, 0.5 and 0.6 percentage respectively.
boxplot(Points ~ Year, data=data, main="Points by Year", xlab="Year", ylab="Points")
boxplot(`3PointPercentage` ~ Year, data=data, main="3-Point Percentage by Year", xlab="Year", ylab="3-Point Percentage")
The boxplot represents the distribution of points for the years 2019 to 2022. Each year is shown on the x-axis, while the y-axis indicates the number of points. The boxplot for each year includes the median (represented by the horizontal line within the box), the interquartile range (the box itself), and the whiskers that extend from the box indicating the data range excluding outliers.
In the second boxplot, the distribution of 3-point percentage over the years 2019, 2020, 2021, and 2022. Each box in the plot represents the interquartile range (IQR) for a specific year, with the central line indicating the median value. The whiskers extend to the minimum and maximum values within a certain range, while any potential outliers are represented as individual dots outside this range.
library(ggplot2)
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.3.3
cor_matrix <- cor(data[, sapply(data, is.numeric)])
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.3
## corrplot 0.92 loaded
# Plot the correlation matrix with adjustments
corrplot(cor_matrix,
method = 'square',
order = 'FPC',
type = 'lower',
diag = FALSE,
addCoef.col = "black",
number.cex = 0.3, # adjust coefficient font size
tl.cex = 0.5)
# Size of the correlation coefficients
Wins and WinningPercentage:
There is a strong positive correlation (0.97) between the number of wins and the winning percentage. This makes sense because as the number of wins increases, the winning percentage naturally increases as well.
FieldGoalsPercentage and Points:
A strong positive correlation (0.69) exists between field goal percentage and points scored. Teams with a higher field goal percentage tend to score more points.
3PointAttempts and 3PointPercentage:
The correlation between 3-point attempts and 3-point percentage is relatively low (0.16). This suggests that a team’s volume of 3-point attempts doesn’t strongly predict their accuracy in 3-point shooting.
Pace and FieldGoalsAttempted:
There is a moderate positive correlation (0.64) between pace and field goals attempted. Teams that play faster (higher pace) tend to attempt more field goals.
MinutesPlayed and Points:
Minutes played and points scored have a weak positive correlation (0.09). This value is close to 0, suggesting a very weak positive correlation. This means that as “Minutes Played” increases, “Points Scored” tends to increase slightly, but the relationship is not strong.
FieldGoal and FieldGoalsPercentage:
A strong positive correlation (0.75) between field goal percentage and field goals made suggests that teams that are more accurate in their shooting will make more field goals.
Losses and WinningPercentage:
The number of losses has a very strong negative correlation (-0.95) with the winning percentage, indicating that as losses increase, the winning percentage decreases significantly.
3PointPercentage and WinningPercentage:
There is a moderate positive correlation (0.61) between 3-point shooting percentage and winning percentage, suggesting that teams that shoot well from beyond the arc are more likely to win.
Notable Observations:
FieldGoalsAttempted and FieldGoalsPercentage have a negative correlation (-0.13), meaning that a higher number of attempts doesn’t necessarily correlate with a higher percentage. This might indicate variability in shooting accuracy based on shot volume.
A correlation of 0 suggests that changes in “Pace” are not associated with changes in “Field Goal Percentage.” In other words, knowing the pace at which a game is played gives you no information about the expected field goal percentage.
Points and WinningPercentage are moderately positively correlated (0.55), which is intuitive as teams that score more points are more likely to win games.
library(ggplot2)
ggplot(data, aes(x=Year, y=Points, group=Team)) +
geom_line(aes(color=Team)) +
labs(title="Points Trend Over Time", x="Year", y="Points") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
library(stargazer)
library(car) # For VIF function
## Warning: package 'car' was built under R version 4.3.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.3
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
# Assuming 'data' contains all independent variables
vif_model <- lm(WinningPercentage ~ `3PointPercentage` + `3PointAttempted` + `3Point` + Games + FieldGoal + Pace + Team, data = data)
vif(vif_model)
## GVIF Df GVIF^(1/(2*Df))
## `3PointPercentage` 109.451084 1 10.461887
## `3PointAttempted` 582.655811 1 24.138264
## `3Point` 772.263898 1 27.789637
## Games 1.676959 1 1.294975
## FieldGoal 4.194103 1 2.047951
## Pace 3.669094 1 1.915488
## Team 19.353422 29 1.052411
Based on the Variance Inflation Factor(VIF),
VIF Values:
3PointPercentage: 10.46, 3PointAttempted: 24.14, 3Point: 27.79
Choose a Key Independent Variable:
Given that 3Point, 3PointAttempted, and 3PointPercentage exhibit high multicollinearity with each other, you should select only one to include in your models. Based on VIF values and relevance, 3PointPercentage is a suitable choice for inclusion. It is commonly used to measure shooting efficiency and is relevant for your analysis.
Model Specification:
Proceed with models using only 3PointPercentage as the key independent variable.
Model 1: Impact on Winning Percentage
We used to fit a series of linear regression models to understand the relationship between “WinningPercentage” (the dependent variable) and various predictors (independent variables)
Model 1
The model 1 examines the relationship between WinningPercentage and 3PointPercentage, with no other predictors included.
Estimated Equation: WinningPercentage =𝛽0 + 𝛽1 × 3PointPercentage +𝜖
Where,
𝛽0 is the intercept. 𝛽1 is the coefficient for 3PointPercentage. 𝜖is the error term.
# Model 1: y ~ key x
model1_wp <- lm(WinningPercentage ~ `3PointPercentage`, data = data)
Model 2
The model 2 includes 3PointPercentage along with additional control variables: Games, FieldGoal, and Pace.
Estimated Equation: WinningPercentage =𝛽0 + 𝛽1 × 3PointPercentage + β2 × Games + β3 × FieldGoal + β4 × Pace +𝜖
Where,
𝛽0 is the intercept. 𝛽1 is the is the coefficient for 3PointPercentage. β2 is the coefficient for Games. β4 is the coefficient for Pace. 𝜖is the error term.
# Model 2: y ~ key x + controls
model2_wp <- lm(WinningPercentage ~ `3PointPercentage` + Games + FieldGoal + Pace, data = data)
Model 3
The model 3 adds dummy variables for Team to account for team-specific effects. factor(Team) creates a set of dummy variables for each team.
Estimated Equation: WinningPercentage =𝛽0 + 𝛽1 × 3PointPercentage + β2 × Games + β3 × FieldGoal + β4 × Pace + γ1 × Team1 + γ2 × Team2 + … + γk × Teamk + ϵ
Where,
𝛽0 is the intercept. 𝛽1 is the is the coefficient for 3PointPercentage. β2 is the coefficient for Games. β4 is the coefficient for Pace. γ1 to γk are the coefficients for the team dummies (excluding one reference team to avoid multicollinearity). 𝜖is the error term.
# Model 3: y ~ key x + controls + team dummies
model3_wp <- lm(WinningPercentage ~ `3PointPercentage` + Games + FieldGoal + Pace + factor(Team), data = data)
Model 4
The model 4 further includes dummy variables for Year to account for time-specific effects. factor(Year) creates a set of dummy variables for each year.
Estimated Equation:
WinningPercentage =𝛽0 + 𝛽1 × 3PointPercentage + β2 × Games + β3 × FieldGoal + β4 × Pace + γ1 × Team1 + γ2 × Team2 + … + γk × Teamk + δ1 × Year1 + δ2 × Year2 + … + δm × Yearm + ϵ
Where,
𝛽0 is the intercept. 𝛽1 is the is the coefficient for 3PointPercentage. β2 is the coefficient for Games. β4 is the coefficient for Pace. γ1 to γk are the coefficients for the team dummies (excluding one reference team to avoid multicollinearity). δ1 to δm are the coefficients for the year dummies (excluding one reference year). 𝜖is the error term.
# Model 4: y ~ key x + controls + team dummies + year dummies
model4_wp <- lm(WinningPercentage ~ `3PointPercentage` + Games + FieldGoal + Pace + factor(Team) + factor(Year), data = data)
library(stargazer)
stargazer(
model1_wp, model2_wp, model3_wp, model4_wp,
type = 'text',
dep.var.labels = c("Winning Percentage"),
column.labels = c("Model 1", "Model 2", "Model 3", "Model 4"),
title = "Regression Results for Winning Percentage",
align = TRUE,
no.space = TRUE,
column.sep.width = "0.5pt",
keep = c("3PointPercentage", "Games", "FieldGoal", "^Pace$"),
add.lines = list(c("Entity FE", "No", "No", "Yes", "Yes")),
out = "model_summary_wp_cleaned.txt"
)
##
## Regression Results for Winning Percentage
## =================================================================================================================
## Dependent variable:
## ---------------------------------------------------------------------------------------------
## Winning Percentage
## Model 1 Model 2 Model 3 Model 4
## (1) (2) (3) (4)
## -----------------------------------------------------------------------------------------------------------------
## `3PointPercentage` 5.181*** 3.678*** 2.736*** 2.337***
## (0.626) (0.683) (0.750) (0.713)
## Games 0.001 0.0001 0.015***
## (0.002) (0.002) (0.006)
## FieldGoal 0.036*** 0.033*** 0.045***
## (0.008) (0.010) (0.010)
## Pace -0.016*** -0.014** -0.023***
## (0.006) (0.007) (0.007)
## -----------------------------------------------------------------------------------------------------------------
## Entity FE No No Yes Yes
## Observations 120 120 120 120
## R2 0.367 0.483 0.705 0.768
## Adjusted R2 0.362 0.465 0.592 0.667
## Residual Std. Error 0.110 (df = 118) 0.101 (df = 115) 0.088 (df = 86) 0.080 (df = 83)
## F Statistic 68.497*** (df = 1; 118) 26.900*** (df = 4; 115) 6.225*** (df = 33; 86) 7.635*** (df = 36; 83)
## =================================================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The regression results you’ve provided give a detailed view of how the inclusion of additional variables impacts the relationship between 3PointPercentage and WinningPercentage.
Model 1 (Simple Relationship)
Coefficient of 3PointPercentage: 5.181 (p < 0.01)
Interpretation: In the simplest model, each one-unit increase in 3PointPercentage is associated with a 5.181 percentage point increase in winning percentage. This indicates a strong positive relationship between the proportion of 3-point shots made and winning percentage.
Model 2 (Including Controls)
Coefficient of 3PointPercentage: 3.678 (p < 0.01)
Interpretation: When controlling for other factors like Games, FieldGoal, and Pace, the impact of 3PointPercentage on winning percentage decreases but remains positive and significant. This suggests that while 3PointPercentage is important, its effect is somewhat influenced by other aspects of team performance.
Games: Coefficient of 0.001 suggests that for each additional game, the Winning Percentage increases by 0.001 percentage points. This effect is not statistically significant.
FieldGoal: Coefficient of 0.036 indicates that for each additional percentage point in FieldGoal, the Winning Percentage increases by 0.036 percentage points. This effect is statistically significant.
Pace: Coefficient of -0.016 suggests that for each additional unit increase in Pace, the Winning Percentage decreases by 0.016 percentage points. This effect is statistically significant.
Model 3 (Including Team Dummies)
Coefficient of 3PointPercentage: 2.736 (p < 0.01)
Interpretation: Adding team-specific effects reduces the coefficient further. This decrease reflects that part of the impact of 3PointPercentage on winning percentage can be attributed to differences between teams.
Games: The coefficient is now 0.015, suggesting that the effect of Games on Winning Percentage becomes more noticeable with team dummies included.
FieldGoal: The coefficient remains at 0.045, indicating a strong positive effect on Winning Percentage, with statistical significance.
Pace: The coefficient is -0.023, showing a negative effect on Winning Percentage, which is still statistically significant.
Team Dummies: The coefficients for each team, adjusting for team-specific differences in the output.
Model 4 (Including Year Dummies)
Coefficient of 3PointPercentage: 2.337 (p < 0.01)
Interpretation: Including both team and year dummies further reduces the coefficient. This suggests that the effect of 3PointPercentage is also influenced by temporal factors affecting team performance across different seasons.
Games: The effect remains at 0.015.
FieldGoal: The effect remains at 0.045.
Pace: The effect remains at -0.023
Magnitude of Impact: The coefficient of 3PointPercentage decreases from 5.181 in Model 1 to 2.337 in Model 4. This decrease occurs as account for additional variables, indicating that while 3PointPercentage has a substantial positive effect on winning percentage, other factors (such as team quality and year-specific effects) also play a role.
Effect After Controls: Even in the most comprehensive model (Model 4), where the control for team and year effects, the coefficient remains significant and positive. This suggests that an increase in the number of 3-point shots taken (and made) continues to have a beneficial effect on winning percentage, although the effect is somewhat moderated when other factors are considered.
Practical Implications: The results imply that increasing the number of 3-point shots can lead to a higher winning percentage, but the magnitude of this effect is influenced by additional variables. Teams should consider the benefits of a strong 3-point shooting game while also accounting for other aspects such as overall team strategy and seasonal variations.
Contextual Factors: The significant year dummies indicate that the impact of 3PointPercentage might vary across seasons. This could be due to changes in game dynamics, rule changes, or evolving strategies in the league.
Why Pace is Negative
Increased Pace and Mistakes: A faster pace might lead to more mistakes, turnovers, or less control over the game, which could negatively impact the winning percentage. Teams may struggle to maintain high performance under a faster tempo.
Fatigue: Higher pace could result in increased player fatigue over time, which might impair performance and reduce the chances of winning.
Model 2: Imapct on Total Points Scored
Model 1
The model 1 evaluates the relationship between Points and 3PointPercentage, with no additional variables.
Estimated Equation: Points = 𝛽0 +𝛽1 × 3PointPercentage +𝜖
β0 is the intercept (the expected value of Points when 3PointPercentage is zero). β1 is the coefficient for 3PointPercentage (indicating how Points change with a one-unit change in 3PointPercentage). 𝜖is the error term (captures the variability in Points not explained by 3PointPercentage).
# Model 1: y ~ key x
model1_pts <- lm(Points ~ `3PointPercentage`, data = data)
Model 2
The model 2 extends Model 1 by including additional control variables: Games, FieldGoal, and Pace.
Estimated Equation: Points=β0 + β1 × 3PointPercentage + β2 × Games + β3 × FieldGoal + β4 × Pace + ϵ
Where,
β0 is the intercept. β1 is the coefficient for 3PointPercentage. β2 is the coefficient for Games (shows how Points change with the number of games). β3 is the coefficient for FieldGoal (shows how Points change with the field goal percentage). β4 is the coefficient for Pace (shows how Points change with the pace of the game). ϵ is the error term.
# Model 2: y ~ key x + controls
model2_pts <- lm(Points ~ `3PointPercentage` + Games + FieldGoal + Pace, data = data)
Model 3
This model adds dummy variables for each team (i.e., factor(Team)) to account for team-specific effects.
Estimated Equation: Points =𝛽0 +𝛽1 × 3PointPercentage +𝛽2 × Games + 𝛽3 × FieldGoal + 𝛽4 × Pace + 𝛾1 × Team1 +𝛾2 × Team2 + … + 𝛾𝑘× Teamk + 𝜖
β0 is the intercept. β 1 to β4 are the coefficients for 3PointPercentage, Games, FieldGoal, and Pace. 𝛾1,𝛾2,…,𝛾𝑘 are the coefficients for the team dummy variables. Each 𝛾𝑖represents the effect of being in a specific team (relative to a reference team, which is excluded to avoid multicollinearity). ϵ is the error term.
# Model 3: y ~ key x + controls + team dummies
model3_pts <- lm(Points ~ `3PointPercentage` + Games + FieldGoal + Pace + factor(Team), data = data)
Model 4
The model 4 further includes dummy variables for each year (factor(Year)) to account for year-specific effects.
Estimated Equation: Points =𝛽0 +𝛽1 × 3PointPercentage +𝛽2 × Games + 𝛽3 × FieldGoal + 𝛽4 × Pace + 𝛾1 × Team1 +𝛾2 × Team2 + … + 𝛾𝑘× Teamk + δ1 × Year1 + δ2 × Year2 + … + δm × Yearm + ϵ
Where,
β0 is the intercept. β 1 to β4 are the coefficients for 3PointPercentage, Games, FieldGoal, and Pace. 𝛾1,𝛾2,…,𝛾𝑘 are the coefficients for the team dummy variables. Each 𝛾𝑖represents the effect of being in a specific team (relative to a reference team, which is excluded to avoid multicollinearity). δ1 to δm are the coefficients for the year dummies (excluding one reference year). ϵ is the error term.
# Model 4: y ~ key x + controls + team dummies + year dummies
model4_pts <- lm(Points ~ `3PointPercentage` + Games + FieldGoal + Pace + factor(Team) + factor(Year), data = data)
stargazer(
model1_pts, model2_pts, model3_pts, model4_pts,
type = 'text',
dep.var.labels = c("Winning Percentage"),
column.labels = c("Model 1", "Model 2", "Model 3", "Model 4"),
title = "Regression Results for Winning Percentage",
align = TRUE,
no.space = TRUE,
column.sep.width = "0.5pt",
keep = c("3PointPercentage", "Games", "FieldGoal", "^Pace$"),
add.lines = list(c("Entity FE", "No", "No", "Yes", "Yes")),
out = "model_summary_wp_cleaned.txt"
)
##
## Regression Results for Winning Percentage
## ====================================================================================================================
## Dependent variable:
## ------------------------------------------------------------------------------------------------
## Winning Percentage
## Model 1 Model 2 Model 3 Model 4
## (1) (2) (3) (4)
## --------------------------------------------------------------------------------------------------------------------
## `3PointPercentage` 115.372*** 81.892*** 62.416*** 65.209***
## (19.252) (11.136) (12.226) (12.546)
## Games 0.131*** 0.103*** 0.056
## (0.030) (0.029) (0.097)
## FieldGoal 1.403*** 1.732*** 1.613***
## (0.131) (0.162) (0.171)
## Pace 0.587*** 0.543*** 0.512***
## (0.098) (0.115) (0.121)
## --------------------------------------------------------------------------------------------------------------------
## Entity FE No No Yes Yes
## Observations 120 120 120 120
## R2 0.233 0.824 0.900 0.908
## Adjusted R2 0.227 0.818 0.861 0.868
## Residual Std. Error 3.388 (df = 118) 1.643 (df = 115) 1.436 (df = 86) 1.399 (df = 83)
## F Statistic 35.914*** (df = 1; 118) 134.838*** (df = 4; 115) 23.355*** (df = 33; 86) 22.759*** (df = 36; 83)
## ====================================================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Model 1:
3PointPercentage has a large positive effect on Total Points (115.372). This is the raw effect without controlling for other factors.
Model 2:
3PointPercentage effect decreases to 81.892 after adding controls for Games, FieldGoal, and Pace. This suggests the initial estimate in Model 1 was partly due to these other factors.
Games, FieldGoal, and Pace all positively affect Total Points.
Model 3:
3PointPercentage effect decreases further to 62.416 after accounting for team-specific differences. This indicates some of the effect was due to differences between teams.
FieldGoal and Pace effects are adjusted for team differences, showing their influence is significant but slightly reduced.
Model 4:
3PointPercentage effect slightly increases to 65.209 after including year-specific differences, suggesting some of the effect was due to variations over time.
Games effect becomes insignificant, indicating year-specific factors may explain its previous relationship with Total Points.
Adding Controls: In Model 2, introducing controls (Games, FieldGoal, Pace) adjusts the coefficient for 3PointPercentage, reflecting a more accurate relationship by accounting for other factors.
Including Fixed Effects: Adding team dummies in Model 3 controls for team-specific effects, which reduces the coefficient for 3PointPercentage as it accounts for differences between teams. The inclusion of year dummies in Model 4 further adjusts the coefficient by accounting for variations over time.
Multicollinearity: The introduction of additional variables often changes coefficients due to multicollinearity. As more variables are added, some of the variance explained by 3PointPercentage may be shared with these new variables, altering its coefficient.
Specification Changes: The inclusion of fixed effects controls for unobserved heterogeneity (i.e., differences between teams and years), which can lead to changes in coefficient estimates.
Model 3: Impact on Field-Goal Percentage
Model 1: FieldGoalsPercentage = β0 + β1.3PointPercentage + ϵ
Where,
β0 is the intercept, β1 is the coefficient for 3PointPercentage, ϵ is the error term.
# Model 1: y ~ key x
model1_fg <- lm(FieldGoalsPercentage ~ `3PointPercentage`, data = data)
Model2:
The model 2 includes Games, FieldGoal, and Pace as additional control variables. It estimates how FieldGoalsPercentage is related to 3PointPercentage while accounting for the potential influence of these other variables.
Estimated Equation: Field Goals Percentage = β0 + β1 3PointPercentage + β2Games + β3 FieldGoal + β4Pace + ϵ
where, β0 : Intercept; baseline Field Goals Percentage. β1 is the coefficient for 3 Point Percentage, β2 is the coefficient for Games (shows how Points change with the number of games). β3 is the coefficient of filed goal. β4 is the coefficient of pace ϵ: Error term.
# Model 2: y ~ key x + controls
model2_fg <- lm(FieldGoalsPercentage ~ `3PointPercentage` + Games + FieldGoal + Pace, data = data)
Model 3
The model 3 adds factor(Team) to account for team-specific effects by including dummy variables for each team. It adjusts for differences between teams that might affect FieldGoalsPercentage.
Estimated Equation FieldGoalsPercentage=β0 + β13PointPercentage + β2Games + β 3FieldGoal + β4Pace + ∑ni=1γiTeami + ϵ
where,
β0: Intercept; the baseline level of FieldGoalsPercentage when all predictors are zero. β13 PointPercentage: Effect of a one-unit change in 3PointPercentage on FieldGoalsPercentage. β2 Games: Effect of an additional game on FieldGoalsPercentage. β3 FieldGoal: Effect of a one-unit change in FieldGoal on FieldGoalsPercentage. β4 Pace: Effect of a one-unit increase in Pace on FieldGoalsPercentage. ∑ni=1γiTeami: Effect of being on teami, with team dummies capturing team-specific variations. ϵ: Error term; captures unexplained variability in FieldGoalsPercentage
# Model 3: y ~ key x + controls + team dummies
model3_fg <- lm(FieldGoalsPercentage ~ `3PointPercentage` + Games + FieldGoal + Pace + factor(Team), data = data)
Model 4
The model 4 includes both team dummies and year dummies, allowing for adjustments based on both team-specific and year-specific effects. It captures the influence of both team and year on FieldGoalsPercentage, making the model more comprehensive.
Estimated Equation FieldGoalsPercentage = β0 + β13PointPercentage + β2Games + β3FieldGoal + β4Pace+ ∑ni=1γiTeami + ∑mj=1 δjYearj + ϵ where,
β0 : Intercept; baseline FieldGoalsPercentage.β13PointPercentage: Effect of 3PointPercentage on FieldGoalsPercentage. β2 Games : Effect of additional games on FieldGoalsPercentage. β3 FieldGoal: Effect of FieldGoal on FieldGoalsPercentage. β4 Pace: Effect of Pace on FieldGoalsPercentage. ∑ni=1γiTeami : Team-specific effects with dummy variables. ∑mj=1δ jYearj : Year-specific effects with dummy variables. ϵ: Error term; captures unexplained variability in FieldGoalsPercentage.
# Model 4: y ~ key x + controls + team dummies + year dummies
model4_fg <- lm(FieldGoalsPercentage ~ `3PointPercentage` + Games + FieldGoal + Pace + factor(Team) + factor(Year), data = data)
stargazer(
model1_fg, model2_fg, model3_fg, model4_fg,
type = 'text',
dep.var.labels = c("Winning Percentage"),
column.labels = c("Model 1", "Model 2", "Model 3", "Model 4"),
title = "Regression Results for Winning Percentage",
align = TRUE,
no.space = TRUE,
column.sep.width = "0.5pt",
keep = c("3PointPercentage", "Games", "FieldGoal", "^Pace$"),
add.lines = list(c("Entity FE", "No", "No", "Yes", "Yes")),
out = "model_summary_wp_cleaned.txt"
)
##
## Regression Results for Winning Percentage
## ===================================================================================================================
## Dependent variable:
## -----------------------------------------------------------------------------------------------
## Winning Percentage
## Model 1 Model 2 Model 3 Model 4
## (1) (2) (3) (4)
## -------------------------------------------------------------------------------------------------------------------
## `3PointPercentage` 0.524*** 0.253*** 0.257*** 0.253***
## (0.070) (0.052) (0.057) (0.058)
## Games 0.0002* 0.0003** 0.0004
## (0.0001) (0.0001) (0.0004)
## FieldGoal 0.007*** 0.007*** 0.006***
## (0.001) (0.001) (0.001)
## Pace -0.002*** -0.002*** -0.002***
## (0.0005) (0.001) (0.001)
## -------------------------------------------------------------------------------------------------------------------
## Entity FE No No Yes Yes
## Observations 120 120 120 120
## R2 0.323 0.744 0.851 0.868
## Adjusted R2 0.318 0.735 0.794 0.810
## Residual Std. Error 0.012 (df = 118) 0.008 (df = 115) 0.007 (df = 86) 0.006 (df = 83)
## F Statistic 56.403*** (df = 1; 118) 83.402*** (df = 4; 115) 14.933*** (df = 33; 86) 15.105*** (df = 36; 83)
## ===================================================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Why the Coefficients Change
Model 1: Includes only the 3-Point Percentage, so the coefficient of 0.524 reflects the direct relationship between 3-Point Percentage and Field Goals Percentage without controlling for other variables.
Model 2, 3, and 4: As additional variables (Games, FieldGoal, Pace) are added, the coefficient for 3-Point Percentage changes. This is due to the inclusion of other predictors which may account for some of the variability previously attributed to the 3-Point Percentage alone. These changes indicate how the relationship between 3-Point Percentage and Field Goals Percentage adjusts when other factors are considered.
Why the Pace is Negative
Pace: A negative coefficient for Pace suggests that as the pace of the game increases (i.e., more possessions per game), the Field Goals Percentage tends to decrease. This could be due to several reasons.
# Assuming models are already fitted as per your examples
# Extract AIC and BIC values
aic_values_wp <- c(AIC(model1_wp), AIC(model2_wp), AIC(model3_wp), AIC(model4_wp))
bic_values_wp <- c(BIC(model1_wp), BIC(model2_wp), BIC(model3_wp), BIC(model4_wp))
aic_values_points <- c(AIC(model1_pts), AIC(model2_pts), AIC(model3_pts), AIC(model4_pts))
bic_values_points <- c(BIC(model1_pts), BIC(model2_pts), BIC(model3_pts), BIC(model4_pts))
aic_values_fg <- c(AIC(model1_fg), AIC(model2_fg), AIC(model3_fg), AIC(model4_fg))
bic_values_fg <- c(BIC(model1_fg), BIC(model2_fg), BIC(model3_fg), BIC(model4_fg))
# Print AIC and BIC values
print(data.frame(Model = c("Model 1", "Model 2", "Model 3", "Model 4"),
AIC_WP = aic_values_wp,
BIC_WP = bic_values_wp,
AIC_Points = aic_values_points,
BIC_Points = bic_values_points,
AIC_FG = aic_values_fg,
BIC_FG = bic_values_fg))
## Model AIC_WP BIC_WP AIC_Points BIC_Points AIC_FG BIC_FG
## 1 Model 1 -184.8636 -176.5011 637.3680 645.7305 -711.2876 -702.9252
## 2 Model 2 -203.1888 -186.4638 466.6080 483.3329 -821.7513 -805.0263
## 3 Model 3 -212.3853 -114.8231 457.4017 554.9640 -829.1958 -731.6336
## 4 Model 4 -235.2935 -129.3688 452.9184 558.8431 -837.0166 -731.0919
1. Winning Percentage (WP)
Model 1: AIC = -184.86, BIC = -176.50 Model 2: AIC = -203.19, BIC = -186.46 Model 3: AIC = -212.39, BIC = -114.82 Model 4: AIC = -235.29, BIC = -129.37
Best Model: Model 4 has the lowest AIC and BIC values, indicating the best fit for predicting Winning Percentage. It includes 3-Point Percentage, Games, Field Goal, Pace, and both Team and Year factors.
2. Total Points (Points)
Model 1: AIC = 637.37, BIC = 645.73 Model 2: AIC = 466.61, BIC = 483.33 Model 3: AIC = 457.40, BIC = 554.96 Model 4: AIC = 452.92, BIC = 558.84
Best Model: Model 4 has the lowest AIC and is very close in BIC to Model 3. Therefore, Model 4 is the preferred model for predicting Total Points, incorporating 3-Point Percentage, Games, Field Goal, Pace, and both Team and Year factors.
3. Field Goals Percentage (FG%)
Model 1: AIC = -711.29, BIC = -702.93 Model 2: AIC = -821.75, BIC = -805.03 Model 3: AIC = -829.20, BIC = -731.63 Model 4: AIC = -837.02, BIC = -731.09
Best Model: Model 4 has the lowest AIC, though its BIC is close to that of Model 3. Thus, Model 4 is also the best model for Field Goals Percentage, which includes the same predictors as the other models.
Overall Best Models:
Winning Percentage: Model 4 Total Points: Model 4 Field Goals Percentage: Model 4
Rationale: Model 4 consistently shows the lowest AIC and BIC values across all outcome variables. This suggests it provides the best balance between fit and complexity, incorporating all relevant predictors, including 3-Point Percentage, Games, Field Goals, Pace, and both Team and Year effects.
stargazer(
model4_wp, model4_pts, model4_fg,
type = 'text',
dep.var.labels = c("Winning Percentage"),
column.labels = c("Model 1", "Model 2", "Model 3", "Model 4"),
title = "Regression Results for Winning Percentage",
align = TRUE,
no.space = TRUE,
column.sep.width = "0.5pt",
keep = c("3PointPercentage", "Games", "FieldGoal", "^Pace$"),
add.lines = list(c("Entity FE", "No", "No", "Yes", "Yes")),
out = "model_summary_wp_cleaned.txt"
)
##
## Regression Results for Winning Percentage
## ===============================================================================
## Dependent variable:
## -------------------------------------------------
## Winning Percentage Points FieldGoalsPercentage
## Model 1 Model 2 Model 3
## (1) (2) (3)
## -------------------------------------------------------------------------------
## `3PointPercentage` 2.337*** 65.209*** 0.253***
## (0.713) (12.546) (0.058)
## Games 0.015*** 0.056 0.0004
## (0.006) (0.097) (0.0004)
## FieldGoal 0.045*** 1.613*** 0.006***
## (0.010) (0.171) (0.001)
## Pace -0.023*** 0.512*** -0.002***
## (0.007) (0.121) (0.001)
## -------------------------------------------------------------------------------
## Entity FE No No Yes
## Observations 120 120 120
## R2 0.768 0.908 0.868
## Adjusted R2 0.667 0.868 0.810
## Residual Std. Error (df = 83) 0.080 1.399 0.006
## F Statistic (df = 36; 83) 7.635*** 22.759*** 15.105***
## ===============================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
1. Impact on Winning Percentage
Coefficient for 3PointPercentage: 2.337
Interpretation: For every one-unit increase in 3PointPercentage (i.e., an increase in the proportion of 3-point shots), the winning percentage of the team increases by 2.337 percentage points. This positive and significant effect suggests that teams that take more 3-point shots generally have a higher winning percentage.
Coefficient for Pace: -0.023
Interpretation: The negative coefficient indicates that an increase in the pace of the game (i.e., more possessions per game) is associated with a slight decrease in winning percentage. This implies that while a faster pace might lead to more scoring opportunities, it could negatively impact overall team performance and winning chances.
2. Impact on Total Points Scored
Coefficient for 3PointPercentage: 65.209
Interpretation: For each one-unit increase in 3PointPercentage, the total points scored by the team increases by 65.209 points. This large and statistically significant effect shows that teams that take more 3-point shots score significantly more points, highlighting the substantial impact of 3-point shooting on scoring.
Coefficient for Pace: 0.512
Interpretation: The positive coefficient for Pace indicates that a faster pace leads to more points scored. This suggests that playing at a higher tempo increases scoring opportunities, thereby contributing to higher total points.
3. Impact on Field-Goal Percentage
Coefficient for 3PointPercentage: 0.253
Interpretation: Each one-unit increase in 3PointPercentage is associated with a 0.253 percentage point increase in field-goal percentage. This indicates that as teams take more 3-point shots, their overall field-goal percentage also improves, though the effect is smaller compared to its impact on total points.
Coefficient for Pace: -0.002
Interpretation: The negative coefficient for Pace suggests that a faster pace is associated with a slight decrease in field-goal percentage. This implies that while a faster pace might increase scoring opportunities, it could also reduce shooting accuracy.
Overall Interpretation
Increase in 3-Point Shots:
Winning Percentage: Increasing 3-point shots significantly improves winning percentage. Teams that focus more on 3-point shooting tend to win more games.
Total Points Scored: More 3-point shots lead to a substantial increase in total points scored, highlighting the effectiveness of 3-point shooting in boosting scoring.
Field-Goal Percentage: There is a positive but smaller effect on field-goal percentage, suggesting that while 3-point shots improve overall scoring, their impact on shooting efficiency is less pronounced.
Influence of Pace:
Winning Percentage: A faster pace slightly reduces winning percentage, possibly due to less control over the game and more scoring variability. Total Points Scored: A faster pace increases total points scored, indicating that more possessions lead to more scoring opportunities. Field-Goal Percentage: A faster pace slightly decreases field-goal percentage, likely because faster play may lead to lower-quality shot attempts.
# Load necessary libraries
library(ggplot2)
library(car) # For Variance Inflation Factor (VIF)
Residuals vs. Fitted Values Plot
This plot helps verify if the relationship between predictors and the outcome is linear.
# Plot residuals vs. fitted values for Model 4_Winning_Percentage
plot(model4_wp$fitted.values, model4_wp$residuals,
xlab = "Fitted Values", ylab = "Residuals",
main = "Residuals vs Fitted Values (Model 4 WP)")
abline(h = 0, col = "red")
Linearity explanation
The “Residuals vs Fitted Values” plot shows that the residuals (errors) are randomly scattered around zero without any clear pattern. This suggests that the regression model is a good fit for the data, with no obvious issues like non-linearity or heteroscedasticity (uneven spread of residuals). The even spread of residuals across the range of fitted values indicates that the model’s predictions are consistent and reliable.
# Plot residuals vs. fitted values for Model 4_Points
plot(model4_pts$fitted.values, model4_pts$residuals,
xlab = "Fitted Values", ylab = "Residuals",
main = "Residuals vs Fitted Values (Model 4 Points)")
abline(h = 0, col = "red")
The plot suggests that the regression model fits the data well. The residuals are evenly spread around zero and show no patterns, indicating that the model’s assumptions are likely satisfied and that the predictions are reliable across the entire range of fitted values.
# Plot residuals vs. fitted values for Model 4_FieldGoal
plot(model4_fg$fitted.values, model4_fg$residuals,
xlab = "Fitted Values", ylab = "Residuals",
main = "Residuals vs Fitted Values (Model 4 FG)")
abline(h = 0, col = "red")
The residuals are relatively small, mostly staying within a narrow range (approximately -0.015 to 0.005). This indicates that the model’s predictions are quite close to the actual values, with only minor deviations.The plot suggests that the regression model is well-suited for the data, with no obvious issues like bias, non-linearity, or heteroscedasticity. The model’s predictions are consistent and reliable across the entire range of fitted values.
Residuals vs. Fitted Values (Absolute Residuals) Plot
# Plot residuals vs. fitted values (absolute residuals) for Model 4
plot(model4_wp$fitted.values, abs(model4_wp$residuals),
xlab = "Fitted Values", ylab = "Absolute Residuals",
main = "Absolute Residuals vs Fitted Values (Model 4 WP)")
abline(h = mean(abs(model4_wp$residuals)), col = "red")
The x-axis represents the “Fitted Values” while the y-axis represents “Absolute Residuals.” The plot displays points scattered around a horizontal line at approximately 0.05 on the y-axis, indicating the expected value. The points vary across the range of fitted values without displaying a clear trend or pattern.
# Plot residuals vs. fitted values (absolute residuals) for Model 4
plot(model4_pts$fitted.values, abs(model4_pts$residuals),
xlab = "Fitted Values", ylab = "Absolute Residuals",
main = "Absolute Residuals vs Fitted Values (Model 4 Points)")
abline(h = mean(abs(model4_wp$residuals)), col = "red")
The scatter plot shows the relationship between “Fitted Values” and “Absolute Residuals.” were the points are scattered without a clear pattern, with most falling near the horizontal line at approximately 0.05 on the y-axis. The x-axis ranges from 105 to 120 for “Fitted Values,” and the y-axis ranges from 0 to 3 for “Absolute Residuals.”
# Plot residuals vs. fitted values (absolute residuals) for Model 4
plot(model4_fg$fitted.values, abs(model4_fg$residuals),
xlab = "Fitted Values", ylab = "Absolute Residuals",
main = "Absolute Residuals vs Fitted Values (Model 4 FG)")
abline(h = mean(abs(model4_wp$residuals)), col = "red")
Here ,the x-axis represents the “Fitted Values” ranging from 0.44 to 0.50, while the y-axis represents the “Absolute Residuals” ranging from 0.00 to 0.015. The plot shows a spread of points without any clear pattern or trend. There is a horizontal line at approximately 0.05 on the y-axis, which could signify an expected value or threshold.
Q-Q Plot
# Q-Q plot for Model 4
qqnorm(model4_wp$residuals, main = "Q-Q Plot (Model 4 WP)")
qqline(model4_wp$residuals, col = "red")
# Q-Q plot for Model 4
qqnorm(model4_pts$residuals, main = "Q-Q Plot (Model 4 PTS)")
qqline(model4_pts$residuals, col = "red")
# Q-Q plot for Model 4
qqnorm(model4_fg$residuals, main = "Q-Q Plot (Model 4 FG)")
qqline(model4_fg$residuals, col = "red")
We can see from the graphs all of the models are normally ditributed. All the points are not with line but some of them are passing with line. So, Winning percentage,points and filed goals Q-Q plots are normally distributed. In the three plots few clusters are outliers.
# Shapiro-Wilk test for normality
shapiro.test(model4_wp$residuals)
##
## Shapiro-Wilk normality test
##
## data: model4_wp$residuals
## W = 0.9924, p-value = 0.7582
W Statistic: The value 0.9924 is close to 1, indicating that the residuals are nearly normally distributed. p-value: The p-value of 0.7582 is much higher than common significance levels (e.g., 0.05 or 0.01). This means that the test does not provide enough evidence to reject the null hypothesis of normality.
# Shapiro-Wilk test for normality
shapiro.test(model4_pts$residuals)
##
## Shapiro-Wilk normality test
##
## data: model4_pts$residuals
## W = 0.99405, p-value = 0.8942
W Statistic: The value 0.99405 is very close to 1, which indicates that the residuals from model4_pts are nearly normally distributed. p-value: The p-value of 0.8942 is substantially above common thresholds for significance (e.g., 0.05 or 0.01). This means that the test does not provide sufficient evidence to reject the null hypothesis.
# Shapiro-Wilk test for normality
shapiro.test(model4_fg$residuals)
##
## Shapiro-Wilk normality test
##
## data: model4_fg$residuals
## W = 0.99172, p-value = 0.6946
W Statistic: The value 0.99172 suggests that the residuals are close to normally distributed, but not perfectly. p-value: The p-value of 0.6946 is well above common significance levels (like 0.05 or 0.01). This means that there is no significant evidence to reject the null hypothesis of normality.
# Calculate VIF for Model 4
vif(model4_wp)
## GVIF Df GVIF^(1/(2*Df))
## `3PointPercentage` 2.489760 1 1.577897
## Games 18.112269 1 4.255851
## FieldGoal 4.329421 1 2.080726
## Pace 3.620380 1 1.902730
## factor(Team) 11.584348 29 1.043140
## factor(Year) 27.631588 3 1.738739
vif(model4_pts)
## GVIF Df GVIF^(1/(2*Df))
## `3PointPercentage` 2.489760 1 1.577897
## Games 18.112269 1 4.255851
## FieldGoal 4.329421 1 2.080726
## Pace 3.620380 1 1.902730
## factor(Team) 11.584348 29 1.043140
## factor(Year) 27.631588 3 1.738739
vif(model4_fg)
## GVIF Df GVIF^(1/(2*Df))
## `3PointPercentage` 2.489760 1 1.577897
## Games 18.112269 1 4.255851
## FieldGoal 4.329421 1 2.080726
## Pace 3.620380 1 1.902730
## factor(Team) 11.584348 29 1.043140
## factor(Year) 27.631588 3 1.738739
High VIF Values: Games has a notably high VIF value, suggesting that it might be highly collinear with other predictors in the model. This could affect the stability and interpretability of the regression coefficients.
Moderate VIF Values: FieldGoal shows some level of multicollinearity, but it is not extremely high.
Low VIF Values: 3PointPercentage, Pace, factor(Team), and factor(Year) have relatively low adjusted VIFs, indicating less concern regarding multicollinearity.
# Predictions for fixed effects models
predictions_fe_winning <- predict(model4_wp, data = data)
predictions_fe_total_points <- predict(model4_pts, data = data)
predictions_fe_field_goal_percentage <- predict(model4_fg, data = data)
# Comparison with actual values
comparison_winning_percentage <- data.frame(Actual = data$WinningPercentage, Predicted = predictions_fe_winning)
comparison_winning_percentage <- round(comparison_winning_percentage, 3)
comparison_total_points <- data.frame(Actual = data$Points, Predicted = predictions_fe_total_points)
comparison_total_points <- round(comparison_total_points)
comparison_field_goal_percentage <- data.frame(Actual = data$FieldGoalsPercentage, Predicted = predictions_fe_field_goal_percentage)
comparison_field_goal_percentage <- round(comparison_field_goal_percentage, 3)
head(comparison_total_points)
## Actual Predicted
## 1 112 112
## 2 114 114
## 3 112 111
## 4 103 102
## 5 107 106
## 6 107 108
head(comparison_winning_percentage)
## Actual Predicted
## 1 0.299 0.281
## 2 0.667 0.633
## 3 0.486 0.453
## 4 0.354 0.320
## 5 0.338 0.294
## 6 0.292 0.384
head(comparison_field_goal_percentage)
## Actual Predicted
## 1 0.449 0.445
## 2 0.461 0.462
## 3 0.448 0.457
## 4 0.434 0.437
## 5 0.447 0.454
## 6 0.458 0.461
# Define a function to calculate MAE, MSE, and R-squared
calculate_metrics <- function(actual, predicted) {
# Mean Absolute Error (MAE)
mae <- mean(abs(actual - predicted))
# Mean Squared Error (MSE)
mse <- mean((actual - predicted)^2)
# R-squared
ss_total <- sum((actual - mean(actual))^2)
ss_residual <- sum((actual - predicted)^2)
r_squared <- 1 - (ss_residual / ss_total)
return(c(MAE = mae, MSE = mse, R_squared = r_squared))
}
# Calculate metrics for each model
metrics_winning_percentage <- calculate_metrics(data$WinningPercentage, predictions_fe_winning)
metrics_total_points <- calculate_metrics(data$Points, predictions_fe_total_points)
metrics_field_goal_percentage <- calculate_metrics(data$FieldGoalsPercentage, predictions_fe_field_goal_percentage)
# Print metrics
cat("Metrics for Winning Percentage Model:\n")
## Metrics for Winning Percentage Model:
print(metrics_winning_percentage)
## MAE MSE R_squared
## 0.052265223 0.004374383 0.768068301
cat("\nMetrics for Total Points Model:\n")
##
## Metrics for Total Points Model:
print(metrics_total_points)
## MAE MSE R_squared
## 0.9304140 1.3540631 0.9080146
cat("\nMetrics for Field Goal Percentage Model:\n")
##
## Metrics for Field Goal Percentage Model:
print(metrics_field_goal_percentage)
## MAE MSE R_squared
## 4.295843e-03 2.905416e-05 8.675778e-01
As per the predicted model:
Winning Percentage: The increase in 3-point shots positively impacts winning percentage, with the effect moderated by game pace. The model shows that 76.81% of the variance in winning percentage can be explained by factors related to 3-point shooting.
Total Points Scored: More 3-point shots lead to higher total points scored, especially in high-paced games. The model captures 90.80% of the variance in total points scored.
Field Goal Percentage: An increase in 3-point attempts generally lowers field-goal percentage. The model explains 86.76% of the variance in field-goal percentage, reflecting the impact of 3-point shooting.
Overall, the increase in 3-point shooting significantly affects each of these statistical measures, with the pace of the game influencing the extent of these effects.