NBA 3Point Project

# Set working directory and path to data
  setwd("C:/Users/LENOVO/Downloads/Regression Model")  # Example path on Windows


# Clear the workspace
  rm(list = ls()) # Clear environment
  gc()            # Clear unused memory

##          used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 524671 28.1    1168867 62.5   660385 35.3
## Vcells 968739  7.4    8388608 64.0  1769879 13.6

  cat("\f")       # Clear the console

  dev.off         # Clear the charts

## function (which = dev.cur()) 
## {
##     if (which == 1) 
##         stop("cannot shut down device 1 (the null device)")
##     .External(C_devoff, as.integer(which))
##     dev.cur()
## }
## <bytecode: 0x000001c77887ea78>
## <environment: namespace:grDevices>

Introduction

Research Question:

How does the increase in the number of 3-point shots taken by NBA teams impact the following statistical measures: winning percentage, total points scored, and field-goal percentage?How is this relationship influenced by the pace of the game?

Here,

Dependent Variables are winning percentage, total points scored and field-goal percentage.

Independent Variables are the number of 3-point shots taken by NBA team i.e. 3-point, 3-point attempts and 3-point percentage.

Control Variables are games, field goal and pace of the game.

For this research question we have taken the data from the NBA stats from the year 2019 to 2022.

Data Source : Basketball Reference.

Loading Libraries and Reading Data(2019):

We have pulled 2 different data from the basketball references site i.e Per Game stats and Advanced Stats. Here, there is a need for Win, Loss and Pace variables data which was not present in the Per Game Stats data for each year.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.3

library(stargazer)

## 
## Please cite as:

##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.

##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

library(readxl)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

#Season Stats Data
df <- read.csv("C:\\Users\\LENOVO\\Downloads\\Regression Model\\Project\\sportsref_download.csv")

head(df)

##   Rk                    Team  G    MP   FG  FGA   FG.  X3P X3PA  X3P.  X2P X2PA
## 1  1        Milwaukee Bucks* 73 241.0 43.3 90.9 0.476 13.8 38.9 0.355 29.5 52.0
## 2  2        Houston Rockets* 72 241.4 40.8 90.4 0.451 15.6 45.3 0.345 25.1 45.2
## 3  3       Dallas Mavericks* 75 242.3 41.7 90.3 0.461 15.1 41.3 0.367 26.5 49.0
## 4  4   Los Angeles Clippers* 72 241.4 41.6 89.2 0.466 12.4 33.5 0.371 29.1 55.8
## 5  5    New Orleans Pelicans 72 242.1 42.6 91.6 0.465 13.6 36.9 0.370 28.9 54.8
## 6  6 Portland Trail Blazers* 74 241.0 42.2 91.2 0.463 12.9 34.1 0.377 29.3 57.1
##    X2P.   FT  FTA   FT.  ORB  DRB  TRB  AST STL BLK  TOV   PF   PTS
## 1 0.567 18.3 24.7 0.742  9.5 42.2 51.7 25.9 7.2 5.9 15.1 19.6 118.7
## 2 0.557 20.6 26.1 0.791  9.8 34.5 44.3 21.6 8.7 5.2 14.7 21.8 117.8
## 3 0.541 18.6 23.8 0.779 10.5 36.4 46.9 24.7 6.1 4.8 12.7 19.5 117.0
## 4 0.522 20.8 26.3 0.791 10.7 37.0 47.7 23.7 7.1 4.7 14.6 22.1 116.3
## 5 0.528 17.1 23.4 0.729 11.1 35.4 46.5 26.8 7.5 5.0 16.4 21.2 115.8
## 6 0.514 17.7 22.1 0.804 10.2 35.1 45.3 20.6 6.3 6.1 12.8 21.7 115.0

#Advance Stats Data
df1 <- read.csv("C:\\Users\\LENOVO\\Downloads\\sportsref_download.csv")

head(df1)

##   Rk                  Team  Age  W  L PW PL   MOV   SOS  SRS  ORtg  DRtg NRtg
## 1  1      Milwaukee Bucks* 29.2 56 17 57 16 10.08 -0.67 9.41 112.4 102.9  9.5
## 2  2       Boston Celtics* 25.3 48 24 50 22  6.31 -0.47 5.83 113.3 107.0  6.3
## 3  3 Los Angeles Clippers* 27.4 49 23 50 22  6.44  0.21 6.66 113.9 107.6  6.3
## 4  4      Toronto Raptors* 26.6 53 19 50 22  6.24 -0.26 5.97 111.1 105.0  6.1
## 5  5   Los Angeles Lakers* 29.5 52 19 48 23  5.79  0.49 6.28 112.0 106.3  5.7
## 6  6     Dallas Mavericks* 26.1 43 32 49 26  4.95 -0.07 4.87 116.7 111.7  5.0
##    Pace   FTr X3PAr   TS.  X Offense.Four.Factors  X.1  X.2    X.3 X.4
## 1 105.1 0.271 0.428 0.583 NA                 eFG% TOV% ORB% FT/FGA  NA
## 2  99.5 0.259 0.386 0.570 NA                0.552 12.9 20.7  0.201  NA
## 3 101.5 0.295 0.375 0.577 NA                0.531 12.2 23.9  0.207  NA
## 4 100.9 0.264 0.421 0.574 NA                0.535 12.6 23.5  0.233  NA
## 5 100.9 0.276 0.358 0.573 NA                0.536 13.1 21.3   0.21  NA
## 6  99.3 0.264 0.457 0.581 NA                0.542 13.3 24.5  0.201  NA
##   Defense.Four.Factors  X.5  X.6    X.7 X.8              X.9    X.10      X.11
## 1                 eFG% TOV% DRB% FT/FGA  NA            Arena Attend. Attend./G
## 2                0.489   12 81.6  0.178  NA     Fiserv Forum 549,036    17,711
## 3                0.509 13.5 77.4  0.215  NA        TD Garden 610,864    19,090
## 4                0.506 12.2 77.6  0.206  NA   STAPLES Center 610,176    19,068
## 5                0.502 14.6 76.7  0.202  NA Scotiabank Arena 633,456    19,796
## 6                0.515 14.1 78.8  0.205  NA   STAPLES Center 588,907    18,997

Data Cleaning:

Removing Asterisks:

We are going to modify the dataset with mutate function and use gsub, This replaces any asterisks (**) with an empty string in all columns. This is typically done to clean data where asterisks might be used to denote missing values or special cases.

df <- df %>%
  mutate(across(everything(), ~ gsub("\\*", "", .)))

df1 <- df1 %>%
  mutate(across(everything(), ~ gsub("\\*", "", .)))

Checking Column Names:

We are checking the column names so that we can know the variables in both datasets. Then we will remove the variables that we don’t need for this project.

names(df)

##  [1] "Rk"   "Team" "G"    "MP"   "FG"   "FGA"  "FG."  "X3P"  "X3PA" "X3P."
## [11] "X2P"  "X2PA" "X2P." "FT"   "FTA"  "FT."  "ORB"  "DRB"  "TRB"  "AST" 
## [21] "STL"  "BLK"  "TOV"  "PF"   "PTS"

names(df1)

##  [1] "Rk"                   "Team"                 "Age"                 
##  [4] "W"                    "L"                    "PW"                  
##  [7] "PL"                   "MOV"                  "SOS"                 
## [10] "SRS"                  "ORtg"                 "DRtg"                
## [13] "NRtg"                 "Pace"                 "FTr"                 
## [16] "X3PAr"                "TS."                  "X"                   
## [19] "Offense.Four.Factors" "X.1"                  "X.2"                 
## [22] "X.3"                  "X.4"                  "Defense.Four.Factors"
## [25] "X.5"                  "X.6"                  "X.7"                 
## [28] "X.8"                  "X.9"                  "X.10"                
## [31] "X.11"

Here’s a breakdown of the variables in your NBA stats dataset, including their definitions:

Rk: Rank
- The rank of the player or team based on some performance metric, such as points scored or overall performance.
Team:
- The name or abbreviation of the NBA team.
G: Games
- The number of games played by the player or team during the season.
MP: Minutes Played
- The total number of minutes a player has been on the court during the games.
FG: Field Goals Made
- The total number of field goals successfully made by the player or team.
FGA: Field Goals Attempted
- The total number of field goals attempted by the player or team.
FG%: Field Goal Percentage
- The percentage of field goals made out of those attempted. Calculated as (FG /FGA * 100).
X3P: Three-Point Field Goals Made
- The total number of three-point shots successfully made by the player or team.
X3PA: Three-Point Field Goals Attempted
- The total number of three-point shots attempted by the player or team.
X3P%: Three-Point Field Goal Percentage
- The percentage of three-point shots made out of those attempted. Calculated as (X3PM /X3PA *100).
X2P: Two-Point Field Goals Made
- The total number of two-point shots successfully made by the player or team.
X2PA: Two-Point Field Goals Attempted
- The total number of two-point shots attempted by the player or team.
X2P%: Two-Point Field Goal Percentage
- The percentage of two-point shots made out of those attempted. Calculated as (X2PM /X2PA * 100).
FT: Free Throws Made
- The total number of free throws successfully made by the player or team.
FTA: Free Throws Attempted
- The total number of free throws attempted by the player or team.
FT%: Free Throw Percentage
- The percentage of free throws made out of those attempted. Calculated as (FTM / FTA *100).
ORB: Offensive Rebounds
- The total number of rebounds collected by the player or team while on the offensive end of the court.
DRB: Defensive Rebounds
- The total number of rebounds collected by the player or team while on the defensive end of the court.
TRB: Total Rebounds
- The total number of rebounds (both offensive and defensive) collected by the player or team. Calculated as (ORB + DRB).
AST: Assists
- The total number of assists made by the player or team, which are passes that directly lead to a score.
STL: Steals
- The total number of times the player or team gains possession of the ball by intercepting or stealing it from the opponent.
BLK: Blocks
- The total number of shots blocked by the player or team.
TOV: Turnovers
- The total number of times the player or team loses possession of the ball to the opponent due to errors.
PF: Personal Fouls
- The total number of personal fouls committed by the player or team.
PTS: Points
- The total number of points scored by the player or team.

These definitions cover the main statistical categories typically recorded in an NBA dataset, reflecting various aspects of player and team performance.

Removing Columns Using Subset:

In the Data Merging process below, here we have merged the Advance Stats dataset with Stats dataset by team and used the key variables from the stats i.e Win ,Loss and Pace. Apart from these key variables, the remaining variables have been removed as we don’t need those variables in this particular research question.

# Remove column 'b'
df_new <- subset(df1, select = -c(Age, Rk, PW, PL, MOV, SOS, SRS, ORtg, DRtg, NRtg, FTr, X3PAr, TS., X, Offense.Four.Factors, X.1, X.2, X.3, X.4, Defense.Four.Factors, X.5, X.6, X.7, X.8, X.9, X.10, X.11))

head(df_new)

##                   Team  W  L  Pace
## 1      Milwaukee Bucks 56 17 105.1
## 2       Boston Celtics 48 24  99.5
## 3 Los Angeles Clippers 49 23 101.5
## 4      Toronto Raptors 53 19 100.9
## 5   Los Angeles Lakers 52 19 100.9
## 6     Dallas Mavericks 43 32  99.3

Merging Data Frames By Team:

merged_df <- merge(df, df_new, by = "Team")

names(merged_df)

##  [1] "Team" "Rk"   "G"    "MP"   "FG"   "FGA"  "FG."  "X3P"  "X3PA" "X3P."
## [11] "X2P"  "X2PA" "X2P." "FT"   "FTA"  "FT."  "ORB"  "DRB"  "TRB"  "AST" 
## [21] "STL"  "BLK"  "TOV"  "PF"   "PTS"  "W"    "L"    "Pace"

Then removing Rank and 2 point shots variables from the Stats dataset.

df_2019 <- subset(merged_df, select = -c(Rk, X2P, X2PA, X2P.))

Adding a Year Column:

Since there isn’t a year column in the dataset, we will create one for the dataset for 2019. We need to combine four-year datasets. So that the year in the dataset is known.

df_2019 <- df_2019 %>% mutate(Year = 2019)

names(df_2019)

##  [1] "Team" "G"    "MP"   "FG"   "FGA"  "FG."  "X3P"  "X3PA" "X3P." "FT"  
## [11] "FTA"  "FT."  "ORB"  "DRB"  "TRB"  "AST"  "STL"  "BLK"  "TOV"  "PF"  
## [21] "PTS"  "W"    "L"    "Pace" "Year"

Renaming and Relocating The Column:

In the dataset, the variables names is confusing and difficult to understand for eg. The 3 point column was named X3P. So to make it readable, the column is renamed accordingly.

names(df_2019) <- c("Team", "Games", "MinutesPlayed", "FieldGoal", "FieldGoalsAttempted", "FieldGoalsPercentage", "3Point", "3PointAttempted", "3PointPercentage", "FreeThrows", "FreeThrowsAttempted", "FreeThrowPercentage", "Offensive Rebounds", "Defensive Rebounds", "Total Rebounds", "Assists", "Steals", "Blocks", "Turnovers", "PersonalFouls", "Points", "Wins", "Losses", "Pace", "Year")

Here, the relocating of the Win, Loss and Year is done because during merging the datasets these columns were located at last. So in order to get these columns at the start necessary changes were made.

df_2019 <- df_2019 %>%
  relocate(Wins, Losses, .before = MinutesPlayed) %>%
  relocate(Year, .before = Games) %>% relocate(Pace, .before = Points)

The same above steps are executed for the remaining years dataset too.

Reading Data(2020):

Data Cleaning:

#Stats Dataset
df21 <- read.csv("C:\\Users\\LENOVO\\Downloads\\Regression Model\\Project\\sportsref_download (1).csv")
#Advance Dataset
df2 <- read.csv("C:\\Users\\LENOVO\\Downloads\\2020-21.csv")

head(df21)

##   Rk                    Team  G    MP   FG  FGA   FG.  X3P X3PA  X3P.  X2P X2PA
## 1  1        Milwaukee Bucks* 72 240.7 44.7 91.8 0.487 14.4 37.1 0.389 30.3 54.7
## 2  2          Brooklyn Nets* 72 241.7 43.1 87.3 0.494 14.2 36.1 0.392 29.0 51.2
## 3  3     Washington Wizards* 72 241.7 43.2 90.9 0.475 10.2 29.0 0.351 33.0 61.9
## 4  4              Utah Jazz* 72 241.0 41.3 88.1 0.468 16.7 43.0 0.389 24.5 45.1
## 5  5 Portland Trail Blazers* 72 240.3 41.3 91.1 0.453 15.7 40.8 0.385 25.6 50.3
## 6  6           Phoenix Suns* 72 242.8 43.3 88.3 0.490 13.1 34.6 0.378 30.3 53.7
##    X2P.   FT  FTA   FT.  ORB  DRB  TRB  AST STL BLK  TOV   PF   PTS
## 1 0.554 16.2 21.4 0.760 10.3 37.8 48.1 25.5 8.1 4.6 13.8 17.3 120.1
## 2 0.565 18.1 22.5 0.804  8.9 35.5 44.4 26.8 6.7 5.3 13.5 19.0 118.6
## 3 0.533 20.1 26.2 0.769  9.7 35.5 45.2 25.5 7.3 4.1 14.4 21.6 116.6
## 4 0.544 17.2 21.5 0.799 10.6 37.6 48.3 23.7 6.6 5.2 14.2 18.5 116.4
## 5 0.509 17.8 21.6 0.823 10.6 33.9 44.5 21.3 6.9 5.0 11.1 18.9 116.1
## 6 0.563 15.6 18.7 0.834  8.8 34.2 42.9 26.9 7.2 4.3 12.5 19.1 115.3

# Asterisks remove
df21 <- df21 %>%
  mutate(across(everything(), ~ gsub("\\*", "", .)))

df2 <- df2 %>%
  mutate(across(everything(), ~ gsub("\\*", "", .)))

#Removing Columns
df_new1 <- subset(df2, select = -c(Age, Rk, PW, PL, MOV, SOS, SRS, ORtg, DRtg, NRtg, FTr, X3PAr, TS., X, Offense.Four.Factors, X.1, X.2, X.3, X.4, Defense.Four.Factors, X.5, X.6, X.7, X.8, X.9, X.10, X.11))

head(df_new1)

##                   Team  W  L  Pace
## 1            Utah Jazz 52 20  98.5
## 2 Los Angeles Clippers 47 25  96.9
## 3         Phoenix Suns 51 21  97.2
## 4      Milwaukee Bucks 46 26 102.2
## 5   Philadelphia 76ers 49 23  99.5
## 6       Denver Nuggets 47 25  97.1

#Merge by Team
merged_df1 <- merge(df21, df_new1, by = "Team")

head(merged_df1)

##                  Team Rk  G    MP   FG  FGA   FG.  X3P X3PA  X3P.  X2P X2PA
## 1       Atlanta Hawks 11 72 241.7 40.8 87.2 0.468 12.4 33.4 0.373 28.4 53.9
## 2      Boston Celtics 16 72 241.4 41.5 88.9 0.466 13.6 36.4 0.374 27.9 52.5
## 3       Brooklyn Nets  2 72 241.7 43.1 87.3 0.494 14.2 36.1 0.392   29 51.2
## 4   Charlotte Hornets 23 72   241 39.9 87.8 0.455 13.7   37 0.369 26.3 50.8
## 5       Chicago Bulls 21 72 241.4 42.2 88.6 0.476 12.6   34  0.37 29.6 54.6
## 6 Cleveland Cavaliers 30 72 242.1 38.6 85.8  0.45   10 29.7 0.336 28.6   56
##    X2P.   FT  FTA   FT.  ORB  DRB  TRB  AST STL BLK  TOV   PF   PTS  W  L Pace
## 1 0.526 19.7 24.2 0.812 10.6 35.1 45.6 24.1   7 4.8 13.2 19.3 113.7 41 31 97.6
## 2  0.53 16.1 20.8 0.775 10.6 33.6 44.3 23.5 7.7 5.3 14.1 20.4 112.6 36 36 98.3
## 3 0.565 18.1 22.5 0.804  8.9 35.5 44.4 26.8 6.7 5.3 13.5   19 118.6 48 24 99.5
## 4 0.517 15.9 20.9 0.761 10.6 33.2 43.8 26.8 7.8 4.8 14.8   18 109.5 33 39 98.3
## 5 0.542 13.8 17.5 0.791  9.6 35.3   45 26.8 6.7 4.2 15.1 18.9 110.7 31 41   99
## 6  0.51 16.7 22.4 0.743 10.4 32.3 42.8 23.8 7.8 4.5 15.5 18.2 103.8 22 50 97.3

df_2020 <- subset(merged_df1, select = -c(Rk, X2P, X2PA, X2P.))

# Create Year Variable
df_2020 <- df_2020 %>% mutate(Year = 2020)

head(df_2020)

##                  Team  G    MP   FG  FGA   FG.  X3P X3PA  X3P.   FT  FTA   FT.
## 1       Atlanta Hawks 72 241.7 40.8 87.2 0.468 12.4 33.4 0.373 19.7 24.2 0.812
## 2      Boston Celtics 72 241.4 41.5 88.9 0.466 13.6 36.4 0.374 16.1 20.8 0.775
## 3       Brooklyn Nets 72 241.7 43.1 87.3 0.494 14.2 36.1 0.392 18.1 22.5 0.804
## 4   Charlotte Hornets 72   241 39.9 87.8 0.455 13.7   37 0.369 15.9 20.9 0.761
## 5       Chicago Bulls 72 241.4 42.2 88.6 0.476 12.6   34  0.37 13.8 17.5 0.791
## 6 Cleveland Cavaliers 72 242.1 38.6 85.8  0.45   10 29.7 0.336 16.7 22.4 0.743
##    ORB  DRB  TRB  AST STL BLK  TOV   PF   PTS  W  L Pace Year
## 1 10.6 35.1 45.6 24.1   7 4.8 13.2 19.3 113.7 41 31 97.6 2020
## 2 10.6 33.6 44.3 23.5 7.7 5.3 14.1 20.4 112.6 36 36 98.3 2020
## 3  8.9 35.5 44.4 26.8 6.7 5.3 13.5   19 118.6 48 24 99.5 2020
## 4 10.6 33.2 43.8 26.8 7.8 4.8 14.8   18 109.5 33 39 98.3 2020
## 5  9.6 35.3   45 26.8 6.7 4.2 15.1 18.9 110.7 31 41   99 2020
## 6 10.4 32.3 42.8 23.8 7.8 4.5 15.5 18.2 103.8 22 50 97.3 2020

#Renaming the columns
names(df_2020) <- c("Team", "Games", "MinutesPlayed", "FieldGoal", "FieldGoalsAttempted", "FieldGoalsPercentage", "3Point", "3PointAttempted", "3PointPercentage", "FreeThrows", "FreeThrowsAttempted", "FreeThrowPercentage", "Offensive Rebounds", "Defensive Rebounds", "Total Rebounds", "Assists", "Steals", "Blocks", "Turnovers", "PersonalFouls", "Points", "Wins", "Losses", "Pace", "Year")

#Relocating the columns
df_2020 <- df_2020 %>%
  relocate(Wins, Losses, .before = MinutesPlayed) %>%
  relocate(Year, .before = Games) %>% relocate(Pace, .before = Points)

# Combine datasets
combined_data <- bind_rows(df_2019, df_2020)

Reading Data(2021):

#Stats Dataset
df31 <- read.csv("C:\\Users\\LENOVO\\Downloads\\Regression Model\\Project\\sportsref_download (2).csv")
#Advance Dataset
df3 <- read.csv("C:\\Users\\LENOVO\\Downloads\\2021-22.csv")

#Removing Aestricks
df31 <- df31 %>%
  mutate(across(everything(), ~ gsub("\\*", "", .)))

df3 <- df3 %>%
  mutate(across(everything(), ~ gsub("\\*", "", .)))

df_new2 <- subset(df3, select = -c(Age, Rk, PW, PL, MOV, SOS, SRS, ORtg, DRtg, NRtg,  FTr, X3PAr, TS., X, eFG., TOV., ORB., FT.FGA, X.1, X.2, eFG..1, TOV..1, DRB., FT.FGA.1, Arena, Attend., Attend..G))

#Merge By Team
merged_df2 <- merge(df31, df_new2, by = "Team")

df_2021 <- subset(merged_df2, select = -c(Rk, X2P, X2PA, X2P.))

#Create Year Variabele
df_2021 <- df_2021 %>% mutate(Year = 2021)

#Renaming the columns
names(df_2021) <- c("Team", "Games", "MinutesPlayed", "FieldGoal", "FieldGoalsAttempted", "FieldGoalsPercentage", "3Point", "3PointAttempted", "3PointPercentage", "FreeThrows", "FreeThrowsAttempted", "FreeThrowPercentage", "Offensive Rebounds", "Defensive Rebounds", "Total Rebounds", "Assists", "Steals", "Blocks", "Turnovers", "PersonalFouls", "Points", "Wins", "Losses", "Pace", "Year")

#Relocating the columns
df_2021 <- df_2021 %>%
  relocate(Wins, Losses, .before = MinutesPlayed) %>%
  relocate(Year, .before = Games) %>% relocate(Pace, .before = Points)

Reading Data(2022):

#Stats Dataset
df41 <- read.csv("C:\\Users\\LENOVO\\Downloads\\Regression Model\\Project\\sportsref_download (3).csv")
#Advance Dataset
df4 <- read.csv("C:\\Users\\LENOVO\\Downloads\\2022-23.csv")

#Removing Aestricks
df41 <- df41 %>%
  mutate(across(everything(), ~ gsub("\\*", "", .)))

df4 <- df4 %>%
  mutate(across(everything(), ~ gsub("\\*", "", .)))

#Removing columns
df_new4 <- subset(df4, select = -c(Age, Rk, PW, PL, MOV, SOS, SRS, ORtg, DRtg, NRtg,  FTr, X3PAr, TS., X, Offense.Four.Factors, X.1, X.2, X.3, X.4, Defense.Four.Factors, X.5, X.6, X.7, X.8, X.9, X.10, X.11))

#Merge by Team
merged_df3 <- merge(df41, df_new4, by = "Team")

df_2022 <- subset(merged_df3, select = -c(Rk, X2P, X2PA, X2P.))

#Create Year Variable
df_2022 <- df_2022 %>% mutate(Year = 2022)

#Renaming the columns
names(df_2022) <- c("Team", "Games", "MinutesPlayed", "FieldGoal", "FieldGoalsAttempted", "FieldGoalsPercentage", "3Point", "3PointAttempted", "3PointPercentage", "FreeThrows", "FreeThrowsAttempted", "FreeThrowPercentage", "Offensive Rebounds", "Defensive Rebounds", "Total Rebounds", "Assists", "Steals", "Blocks", "Turnovers", "PersonalFouls", "Points", "Wins", "Losses", "Pace", "Year")

#Relocating the columns
df_2022 <- df_2022 %>%
  relocate(Wins, Losses, .before = MinutesPlayed) %>%
  relocate(Year, .before = Games) %>% relocate(Pace, .before = Points)

# Combine datasets
combined_data <- bind_rows(combined_data, df_2021, df_2022)

“I’ve merged the datasets from 2019 to 2022, and now the data is ready for exploration, visualization, and regression modeling.”

data <- combined_data %>%
  mutate(
    Team = as.factor(Team),
    Year = as.factor(Year)
  )

This code transforms the Team and Year columns in the combined_data data frame into factors, which is useful for categorical data analysis, especially when dealing with statistical models or plotting where categorical distinctions are needed.

#Structure of the dataset
str(data)

## 'data.frame':    120 obs. of  25 variables:
##  $ Team                : Factor w/ 30 levels "Atlanta Hawks",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Year                : Factor w/ 4 levels "2019","2020",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Games               : chr  "67" "72" "72" "65" ...
##  $ Wins                : chr  "20" "48" "35" "23" ...
##  $ Losses              : chr  "47" "24" "37" "42" ...
##  $ MinutesPlayed       : chr  "243" "242.1" "242.8" "242.3" ...
##  $ FieldGoal           : chr  "40.6" "41.3" "40.4" "37.3" ...
##  $ FieldGoalsAttempted : chr  "90.6" "89.6" "90.3" "85.9" ...
##  $ FieldGoalsPercentage: chr  "0.449" "0.461" "0.448" "0.434" ...
##  $ 3Point              : chr  "12" "12.6" "13.1" "12.1" ...
##  $ 3PointAttempted     : chr  "36.1" "34.5" "38.1" "34.3" ...
##  $ 3PointPercentage    : chr  "0.333" "0.364" "0.343" "0.352" ...
##  $ FreeThrows          : chr  "18.5" "18.6" "17.9" "16.2" ...
##  $ FreeThrowsAttempted : chr  "23.4" "23.2" "24.1" "21.6" ...
##  $ FreeThrowPercentage : chr  "0.79" "0.801" "0.745" "0.748" ...
##  $ Offensive Rebounds  : chr  "9.9" "10.7" "10.6" "11" ...
##  $ Defensive Rebounds  : chr  "33.4" "35.4" "37.3" "31.8" ...
##  $ Total Rebounds      : chr  "43.3" "46.1" "47.9" "42.8" ...
##  $ Assists             : chr  "24" "23" "24.5" "23.8" ...
##  $ Steals              : chr  "7.8" "8.3" "6.4" "6.6" ...
##  $ Blocks              : chr  "5.1" "5.6" "4.5" "4.1" ...
##  $ Turnovers           : chr  "16.2" "13.8" "15.3" "14.6" ...
##  $ PersonalFouls       : chr  "23.1" "21.6" "21" "18.8" ...
##  $ Pace                : chr  "103" "99.5" "101.4" "95.8" ...
##  $ Points              : chr  "111.8" "113.7" "111.8" "102.9" ...

Columns like Games, Wins, Losses, MinutesPlayed, FieldGoal, etc., are stored as character types instead of numeric. This prevents you from performing mathematical operations or analyses directly on these columns.We have to change the datasets to mumeric.

Convert Character Columns to Numeric:

# Convert columns to numeric
data$Wins <- as.numeric(as.character(data$Wins))
data$Losses <- as.numeric(as.character(data$Losses))
data$MinutesPlayed <- as.numeric(as.character(data$MinutesPlayed))
data$FieldGoal <- as.numeric(as.character(data$FieldGoal))
data$FieldGoalsAttempted <- as.numeric(as.character(data$FieldGoalsAttempted))
data$FieldGoalsPercentage <- as.numeric(as.character(data$FieldGoalsPercentage))
data$Pace <- as.numeric(as.character(data$Pace))
data$`3Point` <- as.numeric(as.character(data$`3Point`))
data$`3PointAttempted` <- as.numeric(as.character(data$`3PointAttempted`))
data$`3PointPercentage` <- as.numeric(as.character(data$`3PointPercentage`))
data$Points <- as.numeric(as.character(data$Points))
data$Games <- as.numeric(as.character(data$Games))
data$FieldGoalsPercentage <- as.numeric(as.character(data$FieldGoalsPercentage))
data$FreeThrows <- as.numeric(as.character(data$FreeThrows))
data$FreeThrowsAttempted <- as.numeric(as.character(data$FreeThrowsAttempted))
data$FreeThrowPercentage <- as.numeric(as.character(data$FreeThrowPercentage))
data$`Offensive Rebounds` <- as.numeric(as.character(data$`Offensive Rebounds`))
data$`Defensive Rebounds` <- as.numeric(as.character(data$`Defensive Rebounds`))
data$`Total Rebounds` <- as.numeric(as.character(data$`Total Rebounds`))
data$Assists <- as.numeric(as.character(data$Assists))
data$Steals <- as.numeric(as.character(data$Steals))
data$Blocks <- as.numeric(as.character(data$Blocks))
data$Turnovers <- as.numeric(as.character(data$Turnovers))
data$PersonalFouls <- as.numeric(as.character(data$PersonalFouls))

Adding new variable in the dataset:

We need winning percentage for the y variable. Incorporating winning percentage into your analysis helps quantify the success of various strategies and provides a direct measure of team performance, which is essential for understanding the impact of 3-point shooting and other metrics.

Formula for calculate winning percentage: WP = Wins / Wins + Losses.

data$WinningPercentage <- round(data$Wins / (data$Wins + data$Losses), 3)

# Write the merged dataset to a CSV file
write.csv(data, file = "Final_NBA_dataset.csv", row.names = FALSE)

Exploratory Data Analysis (EDA):

Double check the structure of the data

str(data)

## 'data.frame':    120 obs. of  26 variables:
##  $ Team                : Factor w/ 30 levels "Atlanta Hawks",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Year                : Factor w/ 4 levels "2019","2020",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Games               : num  67 72 72 65 65 65 75 73 66 65 ...
##  $ Wins                : num  20 48 35 23 22 19 43 46 20 15 ...
##  $ Losses              : num  47 24 37 42 43 46 32 27 46 50 ...
##  $ MinutesPlayed       : num  243 242 243 242 241 ...
##  $ FieldGoal           : num  40.6 41.3 40.4 37.3 39.6 40.3 41.7 42 39.3 38.6 ...
##  $ FieldGoalsAttempted : num  90.6 89.6 90.3 85.9 88.6 87.9 90.3 88.9 85.7 88.2 ...
##  $ FieldGoalsPercentage: num  0.449 0.461 0.448 0.434 0.447 0.458 0.461 0.473 0.459 0.438 ...
##  $ 3Point              : num  12 12.6 13.1 12.1 12.2 11.2 15.1 11 12 10.4 ...
##  $ 3PointAttempted     : num  36.1 34.5 38.1 34.3 35.1 31.8 41.3 30.6 32.7 31.3 ...
##  $ 3PointPercentage    : num  0.333 0.364 0.343 0.352 0.348 0.351 0.367 0.359 0.367 0.334 ...
##  $ FreeThrows          : num  18.5 18.6 17.9 16.2 15.5 15.1 18.6 16.2 16.6 18.7 ...
##  $ FreeThrowsAttempted : num  23.4 23.2 24.1 21.6 20.5 19.9 23.8 20.9 22.4 23.2 ...
##  $ FreeThrowPercentage : num  0.79 0.801 0.745 0.748 0.755 0.758 0.779 0.777 0.743 0.803 ...
##  $ Offensive Rebounds  : num  9.9 10.7 10.6 11 10.5 10.8 10.5 10.8 9.8 10 ...
##  $ Defensive Rebounds  : num  33.4 35.4 37.3 31.8 31.4 33.4 36.4 33.4 32 32.9 ...
##  $ Total Rebounds      : num  43.3 46.1 47.9 42.8 41.9 44.2 46.9 44.1 41.7 42.8 ...
##  $ Assists             : num  24 23 24.5 23.8 23.2 23.1 24.7 26.7 24.1 25.6 ...
##  $ Steals              : num  7.8 8.3 6.4 6.6 10 6.9 6.1 8 7.4 8.2 ...
##  $ Blocks              : num  5.1 5.6 4.5 4.1 4.1 3.2 4.8 4.6 4.5 4.6 ...
##  $ Turnovers           : num  16.2 13.8 15.3 14.6 15.5 16.5 12.7 13.8 15.3 14.9 ...
##  $ PersonalFouls       : num  23.1 21.6 21 18.8 21.8 18.3 19.5 20.3 19.7 20.1 ...
##  $ Pace                : num  103 99.5 101.4 95.8 99.7 ...
##  $ Points              : num  112 114 112 103 107 ...
##  $ WinningPercentage   : num  0.299 0.667 0.486 0.354 0.338 0.292 0.573 0.63 0.303 0.231 ...

We can see that every variable in the dataset is numerical, with the exception of the team and year variables. So that we can quickly analyze this dataset.

dim_info <- dim(data)
num_rows <- dim_info[1]
num_cols <- dim_info[2]

cat("Dimension of training set:   Number of rows:", num_rows, ", Number of cols:", num_cols, "\n")

## Dimension of training set:   Number of rows: 120 , Number of cols: 26

stargazer(data, type = "text", summary.stat = c("mean", "min", "max", "sd", "median"))

## 
## =============================================================
## Statistic             Mean     Min     Max   St. Dev. Median 
## -------------------------------------------------------------
## Games                76.650    64      82     5.640    78.5  
## Wins                 38.325    15      64     11.108    41   
## Losses               38.325    17      65     10.549   38.5  
## MinutesPlayed        241.583 240.000 243.700  0.810   241.500
## FieldGoal            41.163  37.300  44.700   1.563   41.300 
## FieldGoalsAttempted  88.407  83.700  94.400   2.234   88.400 
## FieldGoalsPercentage  0.466   0.429   0.504   0.015    0.468 
## 3Point               12.420   9.600  16.700   1.495   12.200 
## 3PointAttempted      34.532  28.000  45.300   3.609   34.200 
## 3PointPercentage      0.359   0.323   0.411   0.016    0.358 
## FreeThrows           17.536  13.800  21.000   1.455   17.500 
## FreeThrowsAttempted  22.575  17.500  26.600   1.816   22.400 
## FreeThrowPercentage   0.777   0.694   0.839   0.028    0.779 
## Offensive Rebounds   10.172   7.600  14.100   1.127   10.150 
## Defensive Rebounds   34.077  30.300  42.200   1.708   34.050 
## Total Rebounds       44.242  38.800  51.700   1.982   44.200 
## Assists              24.788  20.600  29.800   1.774   24.700 
## Steals                7.538   6.100  10.000   0.790    7.450 
## Blocks                4.787   3.000   6.600   0.683    4.750 
## Turnovers            14.063  11.100  16.500   1.095   14.150 
## PersonalFouls        19.922  17.200  23.100   1.272   19.900 
## Pace                 99.216  95.400  105.100  2.015   98.950 
## Points               112.266 102.900 120.700  3.853   112.850
## WinningPercentage     0.499   0.207   0.780   0.138    0.512 
## -------------------------------------------------------------

Missing Values Check:

sapply(data, function(x) sum(is.na(x)))

##                 Team                 Year                Games 
##                    0                    0                    0 
##                 Wins               Losses        MinutesPlayed 
##                    0                    0                    0 
##            FieldGoal  FieldGoalsAttempted FieldGoalsPercentage 
##                    0                    0                    0 
##               3Point      3PointAttempted     3PointPercentage 
##                    0                    0                    0 
##           FreeThrows  FreeThrowsAttempted  FreeThrowPercentage 
##                    0                    0                    0 
##   Offensive Rebounds   Defensive Rebounds       Total Rebounds 
##                    0                    0                    0 
##              Assists               Steals               Blocks 
##                    0                    0                    0 
##            Turnovers        PersonalFouls                 Pace 
##                    0                    0                    0 
##               Points    WinningPercentage 
##                    0                    0

The dataset contains no missing values.

hist(data$Points, main="Distribution of Points", xlab="Points", col="lightblue")

hist(data$'3PointPercentage', main="Distribution of 3-Point Percentage", xlab="3-Point Percentage", col="lightgreen")

hist(data$Pace, main = "Distribution of Pace", xlab = "Pace", col = "lightblue")

Interpret:

Distribution Of Points:

From the histogram, there is systematic shape(Normal Distribution) it may indicate a consistent pattern in the distribution of points. This could suggest that most teams score within a specific range, showing a regularity in scoring performance across the dataset.

Range: The spread from 105 to 120 points indicates that the majority of the teams score between these values. This range represents the central cluster of scoring performances.

The center of the distribution at 112 points indicates that the average or most common point value for the teams in your dataset is around 112. This suggests that a typical team’s score is close to this value.

Frequency The peak at between 112-114 indicates that this is the most common point value among the teams in your dataset. This suggests that the majority of teams score 113 points more frequently than any other specific score. Some team has above 120 points and some team has below 105 with low frequency.

Distribution Of 3 Point Percentage

Distribution Type: The graph shows with a peak at 0.36 and a central tendency around 0.35 to 0.36, combined with the spread, suggests a normal-like distribution or a single-peaked distribution. This indicates that most teams have shooting percentages clustered around 0.35 to 0.36, with decreasing frequencies as you move away from this center.

Frequency: With the highest frequency at 0.36 and a central spread, it appears that most teams have a 3-point shooting percentage close to this value.

Spread: The range from 0.32 to 0.42 shows there is some variability in shooting accuracy, but most teams are concentrated around the center.

Uniform with a Gap: The histogram shows a uniform distribution with a notable gap between 0.40 to 0.41, suggesting that these paces are less common.

Distribution of Pace

Central Tendency: With a center at 100, the typical pace for most teams is around this value. Spread: The range from 96 to 106 indicates overall variability in team paces.

Peak Frequency: The highest frequency being above 30 suggests a common pace range where most teams fall.

Scatterplot for key relationships:

plot(data$'3PointPercentage' ~ data$WinningPercentage, main="3-Point Perecentage vs Winning Percentage", xlab="Winning Percentage", ylab="3-Point Percentage")
abline(lm(data$'3PointPercentage' ~ data$WinningPercentage), col="red")

The scatter plot shows a positive trend, meaning as the winning percentage increases, the 3-point field goal percentage also tends to increase.

pairs(data[, c("Points", "3PointPercentage", "FieldGoalsPercentage", "Pace")], main="Pairwise Plot")

Grouped Analysis

Average Metrics by Year:

library(dplyr)
data %>%
  group_by(Year) %>%
  summarise(across(where(is.numeric), mean, na.rm = TRUE))

## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(where(is.numeric), mean, na.rm = TRUE)`.
## ℹ In group 1: `Year = 2019`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))

## # A tibble: 4 × 25
##   Year  Games  Wins Losses MinutesPlayed FieldGoal FieldGoalsAttempted
##   <fct> <dbl> <dbl>  <dbl>         <dbl>     <dbl>               <dbl>
## 1 2019   70.6  35.3   35.3          242.      40.8                88.8
## 2 2020   72    36     36            241.      41.2                88.4
## 3 2021   82    41     41            241.      40.6                88.1
## 4 2022   82    41     41            242.      42.0                88.3
## # ℹ 18 more variables: FieldGoalsPercentage <dbl>, `3Point` <dbl>,
## #   `3PointAttempted` <dbl>, `3PointPercentage` <dbl>, FreeThrows <dbl>,
## #   FreeThrowsAttempted <dbl>, FreeThrowPercentage <dbl>,
## #   `Offensive Rebounds` <dbl>, `Defensive Rebounds` <dbl>,
## #   `Total Rebounds` <dbl>, Assists <dbl>, Steals <dbl>, Blocks <dbl>,
## #   Turnovers <dbl>, PersonalFouls <dbl>, Pace <dbl>, Points <dbl>,
## #   WinningPercentage <dbl>

Games:

The number of games increased from 70.6 in 2019 to 82 in 2021 and 2022. This could be due to changes in the season length or adjustments for the pandemic.

Wins and Losses:

Wins and losses are balanced (41 wins and 41 losses) for 2021 and 2022, indicating that teams have had a more balanced performance compared to 2019 and 2020.

Minutes Played:

The minutes played per game remain fairly consistent across the years, with a slight decrease in 2021 compared to 2019 and little bit increase in 2022.

Field Goals and 3-Point Statistics:

Field Goals: The percentage of field goals attempted and made has decreased slightly from 2019 to 2022.

3-Point Statistics: There is a slight increase in 3-point field goals and percentage over the years, indicating a growing emphasis on the 3-point shot.

Pace:

The pace has slightly decreased from 2019 to 2022. This variation may reflect changes in playing style or game strategies over the years.

Points:

Points scored per game have generally increased, indicating higher scoring games or improved offensive strategies.

Winning Percentage:

The winning percentage has been relatively stable, with a small decrease in 2021 and 2022. This stability suggests that winning percentages have not fluctuated drastically despite changes in other metrics.

Winning Percentage by Team:

data %>%
  group_by(Team) %>%
  summarise(AverageWinningPercentage = mean(WinningPercentage, na.rm = TRUE)) %>%
  arrange(desc(AverageWinningPercentage))

## # A tibble: 30 × 2
##    Team                 AverageWinningPercentage
##    <fct>                                   <dbl>
##  1 Milwaukee Bucks                         0.684
##  2 Philadelphia 76ers                      0.638
##  3 Denver Nuggets                          0.629
##  4 Phoenix Suns                            0.626
##  5 Boston Celtics                          0.621
##  6 Los Angeles Clippers                    0.596
##  7 Utah Jazz                               0.596
##  8 Miami Heat                              0.586
##  9 Memphis Grizzlies                       0.575
## 10 Dallas Mavericks                        0.563
## # ℹ 20 more rows

We can see Milawaukee Bucks has the highest average winning percentage from 2019-2022. In the meantime Detroit Pistons registered lowest average winning percentage which is 0.27.Other teams has 0.4, 0.5 and 0.6 percentage respectively.

Boxplot:

boxplot(Points ~ Year, data=data, main="Points by Year", xlab="Year", ylab="Points")

boxplot(`3PointPercentage` ~ Year, data=data, main="3-Point Percentage by Year", xlab="Year", ylab="3-Point Percentage")

The boxplot represents the distribution of points for the years 2019 to 2022. Each year is shown on the x-axis, while the y-axis indicates the number of points. The boxplot for each year includes the median (represented by the horizontal line within the box), the interquartile range (the box itself), and the whiskers that extend from the box indicating the data range excluding outliers.

In the second boxplot, the distribution of 3-point percentage over the years 2019, 2020, 2021, and 2022. Each box in the plot represents the interquartile range (IQR) for a specific year, with the central line indicating the median value. The whiskers extend to the minimum and maximum values within a certain range, while any potential outliers are represented as individual dots outside this range.

Correlation:

library(ggplot2)
library(reshape2)

## Warning: package 'reshape2' was built under R version 4.3.3

cor_matrix <- cor(data[, sapply(data, is.numeric)])
library(corrplot)

## Warning: package 'corrplot' was built under R version 4.3.3

## corrplot 0.92 loaded

# Plot the correlation matrix with adjustments
corrplot(cor_matrix, 
 method = 'square', 
 order = 'FPC', 
 type = 'lower', 
 diag = FALSE,
 addCoef.col = "black",
 number.cex = 0.3, # adjust coefficient font size
 tl.cex = 0.5)

    # Size of the correlation coefficients

Wins and WinningPercentage:

There is a strong positive correlation (0.97) between the number of wins and the winning percentage. This makes sense because as the number of wins increases, the winning percentage naturally increases as well.

FieldGoalsPercentage and Points:

A strong positive correlation (0.69) exists between field goal percentage and points scored. Teams with a higher field goal percentage tend to score more points.

3PointAttempts and 3PointPercentage:

The correlation between 3-point attempts and 3-point percentage is relatively low (0.16). This suggests that a team’s volume of 3-point attempts doesn’t strongly predict their accuracy in 3-point shooting.

Pace and FieldGoalsAttempted:

There is a moderate positive correlation (0.64) between pace and field goals attempted. Teams that play faster (higher pace) tend to attempt more field goals.

MinutesPlayed and Points:

Minutes played and points scored have a weak positive correlation (0.09). This value is close to 0, suggesting a very weak positive correlation. This means that as “Minutes Played” increases, “Points Scored” tends to increase slightly, but the relationship is not strong.

FieldGoal and FieldGoalsPercentage:

A strong positive correlation (0.75) between field goal percentage and field goals made suggests that teams that are more accurate in their shooting will make more field goals.

Losses and WinningPercentage:

The number of losses has a very strong negative correlation (-0.95) with the winning percentage, indicating that as losses increase, the winning percentage decreases significantly.

3PointPercentage and WinningPercentage:

There is a moderate positive correlation (0.61) between 3-point shooting percentage and winning percentage, suggesting that teams that shoot well from beyond the arc are more likely to win.

Notable Observations:

FieldGoalsAttempted and FieldGoalsPercentage have a negative correlation (-0.13), meaning that a higher number of attempts doesn’t necessarily correlate with a higher percentage. This might indicate variability in shooting accuracy based on shot volume.

A correlation of 0 suggests that changes in “Pace” are not associated with changes in “Field Goal Percentage.” In other words, knowing the pace at which a game is played gives you no information about the expected field goal percentage.

Points and WinningPercentage are moderately positively correlated (0.55), which is intuitive as teams that score more points are more likely to win games.

library(ggplot2)

ggplot(data, aes(x=Year, y=Points, group=Team)) +
  geom_line(aes(color=Team)) +
  labs(title="Points Trend Over Time", x="Year", y="Points") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Model Building

Multiple Linear Regression Models:

library(stargazer)
library(car)  # For VIF function

## Warning: package 'car' was built under R version 4.3.3

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.3.3

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

# Assuming 'data' contains all independent variables
vif_model <- lm(WinningPercentage ~ `3PointPercentage` + `3PointAttempted` + `3Point` + Games + FieldGoal + Pace + Team, data = data)
vif(vif_model)

##                          GVIF Df GVIF^(1/(2*Df))
## `3PointPercentage` 109.451084  1       10.461887
## `3PointAttempted`  582.655811  1       24.138264
## `3Point`           772.263898  1       27.789637
## Games                1.676959  1        1.294975
## FieldGoal            4.194103  1        2.047951
## Pace                 3.669094  1        1.915488
## Team                19.353422 29        1.052411

VIF Interpretation

Based on the Variance Inflation Factor(VIF),

VIF Values:

3PointPercentage: 10.46, 3PointAttempted: 24.14, 3Point: 27.79

Choose a Key Independent Variable:

Given that 3Point, 3PointAttempted, and 3PointPercentage exhibit high multicollinearity with each other, you should select only one to include in your models. Based on VIF values and relevance, 3PointPercentage is a suitable choice for inclusion. It is commonly used to measure shooting efficiency and is relevant for your analysis.

Model Specification:

Proceed with models using only 3PointPercentage as the key independent variable.

Model 1: Impact on Winning Percentage

We used to fit a series of linear regression models to understand the relationship between “WinningPercentage” (the dependent variable) and various predictors (independent variables)

Model 1

The model 1 examines the relationship between WinningPercentage and 3PointPercentage, with no other predictors included.

Estimated Equation: WinningPercentage =𝛽0 + 𝛽1 × 3PointPercentage +𝜖

Where,

𝛽0 is the intercept. 𝛽1 is the coefficient for 3PointPercentage. 𝜖is the error term.

# Model 1: y ~ key x
model1_wp <- lm(WinningPercentage ~ `3PointPercentage`, data = data)

Model 2

The model 2 includes 3PointPercentage along with additional control variables: Games, FieldGoal, and Pace.

Estimated Equation: WinningPercentage =𝛽0 + 𝛽1 × 3PointPercentage + β2 × Games + β3 × FieldGoal + β4 × Pace +𝜖

Where,

𝛽0 is the intercept. 𝛽1 is the is the coefficient for 3PointPercentage. β2 is the coefficient for Games. β4 is the coefficient for Pace. 𝜖is the error term.

# Model 2: y ~ key x + controls
model2_wp <- lm(WinningPercentage ~ `3PointPercentage` + Games + FieldGoal + Pace, data = data)

Model 3

The model 3 adds dummy variables for Team to account for team-specific effects. factor(Team) creates a set of dummy variables for each team.

Estimated Equation: WinningPercentage =𝛽0 + 𝛽1 × 3PointPercentage + β2 × Games + β3 × FieldGoal + β4 × Pace + γ1 × Team1 + γ2 × Team2 + … + γk × Teamk + ϵ

Where,

𝛽0 is the intercept. 𝛽1 is the is the coefficient for 3PointPercentage. β2 is the coefficient for Games. β4 is the coefficient for Pace. γ1 to γk are the coefficients for the team dummies (excluding one reference team to avoid multicollinearity). 𝜖is the error term.

# Model 3: y ~ key x + controls + team dummies
model3_wp <- lm(WinningPercentage ~ `3PointPercentage` + Games + FieldGoal + Pace + factor(Team), data = data)

Model 4

The model 4 further includes dummy variables for Year to account for time-specific effects. factor(Year) creates a set of dummy variables for each year.

Estimated Equation:

WinningPercentage =𝛽0 + 𝛽1 × 3PointPercentage + β2 × Games + β3 × FieldGoal + β4 × Pace + γ1 × Team1 + γ2 × Team2 + … + γk × Teamk + δ1 × Year1 + δ2 × Year2 + … + δm × Yearm + ϵ

Where,

# Model 4: y ~ key x + controls + team dummies + year dummies
model4_wp <- lm(WinningPercentage ~ `3PointPercentage` + Games + FieldGoal + Pace + factor(Team) + factor(Year), data = data)

library(stargazer)

stargazer(
  model1_wp, model2_wp, model3_wp, model4_wp,
  type = 'text',
  dep.var.labels = c("Winning Percentage"),
  column.labels = c("Model 1", "Model 2", "Model 3", "Model 4"),
  title = "Regression Results for Winning Percentage",
  align = TRUE,
  no.space = TRUE,
  column.sep.width = "0.5pt",
  keep = c("3PointPercentage", "Games", "FieldGoal", "^Pace$"),
  add.lines = list(c("Entity FE", "No", "No", "Yes", "Yes")),
  out = "model_summary_wp_cleaned.txt"
)

## 
## Regression Results for Winning Percentage
## =================================================================================================================
##                                                          Dependent variable:                                     
##                     ---------------------------------------------------------------------------------------------
##                                                          Winning Percentage                                      
##                             Model 1                 Model 2                Model 3                Model 4        
##                               (1)                     (2)                    (3)                    (4)          
## -----------------------------------------------------------------------------------------------------------------
## `3PointPercentage`         5.181***                3.678***                2.736***               2.337***       
##                             (0.626)                 (0.683)                (0.750)                (0.713)        
## Games                                                0.001                  0.0001                0.015***       
##                                                     (0.002)                (0.002)                (0.006)        
## FieldGoal                                          0.036***                0.033***               0.045***       
##                                                     (0.008)                (0.010)                (0.010)        
## Pace                                               -0.016***               -0.014**              -0.023***       
##                                                     (0.006)                (0.007)                (0.007)        
## -----------------------------------------------------------------------------------------------------------------
## Entity FE                     No                      No                     Yes                    Yes          
## Observations                  120                     120                    120                    120          
## R2                           0.367                   0.483                  0.705                  0.768         
## Adjusted R2                  0.362                   0.465                  0.592                  0.667         
## Residual Std. Error    0.110 (df = 118)        0.101 (df = 115)        0.088 (df = 86)        0.080 (df = 83)    
## F Statistic         68.497*** (df = 1; 118) 26.900*** (df = 4; 115) 6.225*** (df = 33; 86) 7.635*** (df = 36; 83)
## =================================================================================================================
## Note:                                                                                 *p<0.1; **p<0.05; ***p<0.01

The regression results you’ve provided give a detailed view of how the inclusion of additional variables impacts the relationship between 3PointPercentage and WinningPercentage.

Model 1 (Simple Relationship)

Coefficient of 3PointPercentage: 5.181 (p < 0.01)

Interpretation: In the simplest model, each one-unit increase in 3PointPercentage is associated with a 5.181 percentage point increase in winning percentage. This indicates a strong positive relationship between the proportion of 3-point shots made and winning percentage.

Model 2 (Including Controls)

Coefficient of 3PointPercentage: 3.678 (p < 0.01)

Interpretation: When controlling for other factors like Games, FieldGoal, and Pace, the impact of 3PointPercentage on winning percentage decreases but remains positive and significant. This suggests that while 3PointPercentage is important, its effect is somewhat influenced by other aspects of team performance.

Games: Coefficient of 0.001 suggests that for each additional game, the Winning Percentage increases by 0.001 percentage points. This effect is not statistically significant.

FieldGoal: Coefficient of 0.036 indicates that for each additional percentage point in FieldGoal, the Winning Percentage increases by 0.036 percentage points. This effect is statistically significant.

Pace: Coefficient of -0.016 suggests that for each additional unit increase in Pace, the Winning Percentage decreases by 0.016 percentage points. This effect is statistically significant.

Model 3 (Including Team Dummies)

Coefficient of 3PointPercentage: 2.736 (p < 0.01)

Interpretation: Adding team-specific effects reduces the coefficient further. This decrease reflects that part of the impact of 3PointPercentage on winning percentage can be attributed to differences between teams.

Games: The coefficient is now 0.015, suggesting that the effect of Games on Winning Percentage becomes more noticeable with team dummies included.

FieldGoal: The coefficient remains at 0.045, indicating a strong positive effect on Winning Percentage, with statistical significance.

Pace: The coefficient is -0.023, showing a negative effect on Winning Percentage, which is still statistically significant.

Team Dummies: The coefficients for each team, adjusting for team-specific differences in the output.

Model 4 (Including Year Dummies)

Coefficient of 3PointPercentage: 2.337 (p < 0.01)

Interpretation: Including both team and year dummies further reduces the coefficient. This suggests that the effect of 3PointPercentage is also influenced by temporal factors affecting team performance across different seasons.

Games: The effect remains at 0.015.

FieldGoal: The effect remains at 0.045.

Pace: The effect remains at -0.023

Detailed Interpretation

Magnitude of Impact: The coefficient of 3PointPercentage decreases from 5.181 in Model 1 to 2.337 in Model 4. This decrease occurs as account for additional variables, indicating that while 3PointPercentage has a substantial positive effect on winning percentage, other factors (such as team quality and year-specific effects) also play a role.

Effect After Controls: Even in the most comprehensive model (Model 4), where the control for team and year effects, the coefficient remains significant and positive. This suggests that an increase in the number of 3-point shots taken (and made) continues to have a beneficial effect on winning percentage, although the effect is somewhat moderated when other factors are considered.

Practical Implications: The results imply that increasing the number of 3-point shots can lead to a higher winning percentage, but the magnitude of this effect is influenced by additional variables. Teams should consider the benefits of a strong 3-point shooting game while also accounting for other aspects such as overall team strategy and seasonal variations.

Contextual Factors: The significant year dummies indicate that the impact of 3PointPercentage might vary across seasons. This could be due to changes in game dynamics, rule changes, or evolving strategies in the league.

Why Pace is Negative

Increased Pace and Mistakes: A faster pace might lead to more mistakes, turnovers, or less control over the game, which could negatively impact the winning percentage. Teams may struggle to maintain high performance under a faster tempo.

Fatigue: Higher pace could result in increased player fatigue over time, which might impair performance and reduce the chances of winning.

Model 2: Imapct on Total Points Scored

Model 1

The model 1 evaluates the relationship between Points and 3PointPercentage, with no additional variables.

Estimated Equation: Points = 𝛽0 +𝛽1 × 3PointPercentage +𝜖

β0 is the intercept (the expected value of Points when 3PointPercentage is zero). β1 is the coefficient for 3PointPercentage (indicating how Points change with a one-unit change in 3PointPercentage). 𝜖is the error term (captures the variability in Points not explained by 3PointPercentage).

# Model 1: y ~ key x
model1_pts <- lm(Points ~ `3PointPercentage`, data = data)

Model 2

The model 2 extends Model 1 by including additional control variables: Games, FieldGoal, and Pace.

Estimated Equation: Points=β0 + β1 × 3PointPercentage + β2 × Games + β3 × FieldGoal + β4 × Pace + ϵ

Where,

β0 is the intercept. β1 is the coefficient for 3PointPercentage. β2 is the coefficient for Games (shows how Points change with the number of games). β3 is the coefficient for FieldGoal (shows how Points change with the field goal percentage). β4 is the coefficient for Pace (shows how Points change with the pace of the game). ϵ is the error term.

# Model 2: y ~ key x + controls
model2_pts <- lm(Points ~ `3PointPercentage` + Games + FieldGoal + Pace, data = data)

Model 3

This model adds dummy variables for each team (i.e., factor(Team)) to account for team-specific effects.

Estimated Equation: Points =𝛽0 +𝛽1 × 3PointPercentage +𝛽2 × Games + 𝛽3 × FieldGoal + 𝛽4 × Pace + 𝛾1 × Team1 +𝛾2 × Team2 + … + 𝛾𝑘× Teamk + 𝜖

β0 is the intercept. β 1 to β4 are the coefficients for 3PointPercentage, Games, FieldGoal, and Pace. 𝛾1,𝛾2,…,𝛾𝑘 are the coefficients for the team dummy variables. Each 𝛾𝑖represents the effect of being in a specific team (relative to a reference team, which is excluded to avoid multicollinearity). ϵ is the error term.

# Model 3: y ~ key x + controls + team dummies
model3_pts <- lm(Points ~ `3PointPercentage` + Games + FieldGoal + Pace + factor(Team), data = data)

Model 4

The model 4 further includes dummy variables for each year (factor(Year)) to account for year-specific effects.

Estimated Equation: Points =𝛽0 +𝛽1 × 3PointPercentage +𝛽2 × Games + 𝛽3 × FieldGoal + 𝛽4 × Pace + 𝛾1 × Team1 +𝛾2 × Team2 + … + 𝛾𝑘× Teamk + δ1 × Year1 + δ2 × Year2 + … + δm × Yearm + ϵ

Where,

# Model 4: y ~ key x + controls + team dummies + year dummies
model4_pts <- lm(Points ~ `3PointPercentage` + Games + FieldGoal + Pace + factor(Team) + factor(Year), data = data)

stargazer(
  model1_pts, model2_pts, model3_pts, model4_pts,
  type = 'text',
  dep.var.labels = c("Winning Percentage"),
  column.labels = c("Model 1", "Model 2", "Model 3", "Model 4"),
  title = "Regression Results for Winning Percentage",
  align = TRUE,
  no.space = TRUE,
  column.sep.width = "0.5pt",
  keep = c("3PointPercentage", "Games", "FieldGoal", "^Pace$"),
  add.lines = list(c("Entity FE", "No", "No", "Yes", "Yes")),
  out = "model_summary_wp_cleaned.txt"
)

## 
## Regression Results for Winning Percentage
## ====================================================================================================================
##                                                           Dependent variable:                                       
##                     ------------------------------------------------------------------------------------------------
##                                                            Winning Percentage                                       
##                             Model 1                 Model 2                  Model 3                 Model 4        
##                               (1)                     (2)                      (3)                     (4)          
## --------------------------------------------------------------------------------------------------------------------
## `3PointPercentage`        115.372***               81.892***                62.416***               65.209***       
##                            (19.252)                 (11.136)                (12.226)                (12.546)        
## Games                                               0.131***                0.103***                  0.056         
##                                                     (0.030)                  (0.029)                 (0.097)        
## FieldGoal                                           1.403***                1.732***                1.613***        
##                                                     (0.131)                  (0.162)                 (0.171)        
## Pace                                                0.587***                0.543***                0.512***        
##                                                     (0.098)                  (0.115)                 (0.121)        
## --------------------------------------------------------------------------------------------------------------------
## Entity FE                     No                       No                      Yes                     Yes          
## Observations                  120                     120                      120                     120          
## R2                           0.233                   0.824                    0.900                   0.908         
## Adjusted R2                  0.227                   0.818                    0.861                   0.868         
## Residual Std. Error    3.388 (df = 118)         1.643 (df = 115)         1.436 (df = 86)         1.399 (df = 83)    
## F Statistic         35.914*** (df = 1; 118) 134.838*** (df = 4; 115) 23.355*** (df = 33; 86) 22.759*** (df = 36; 83)
## ====================================================================================================================
## Note:                                                                                    *p<0.1; **p<0.05; ***p<0.01

Model 1:

3PointPercentage has a large positive effect on Total Points (115.372). This is the raw effect without controlling for other factors.

Model 2:

3PointPercentage effect decreases to 81.892 after adding controls for Games, FieldGoal, and Pace. This suggests the initial estimate in Model 1 was partly due to these other factors.

Games, FieldGoal, and Pace all positively affect Total Points.

Model 3:

3PointPercentage effect decreases further to 62.416 after accounting for team-specific differences. This indicates some of the effect was due to differences between teams.

FieldGoal and Pace effects are adjusted for team differences, showing their influence is significant but slightly reduced.

Model 4:

3PointPercentage effect slightly increases to 65.209 after including year-specific differences, suggesting some of the effect was due to variations over time.

Games effect becomes insignificant, indicating year-specific factors may explain its previous relationship with Total Points.

Adding Controls: In Model 2, introducing controls (Games, FieldGoal, Pace) adjusts the coefficient for 3PointPercentage, reflecting a more accurate relationship by accounting for other factors.

Including Fixed Effects: Adding team dummies in Model 3 controls for team-specific effects, which reduces the coefficient for 3PointPercentage as it accounts for differences between teams. The inclusion of year dummies in Model 4 further adjusts the coefficient by accounting for variations over time.

Multicollinearity: The introduction of additional variables often changes coefficients due to multicollinearity. As more variables are added, some of the variance explained by 3PointPercentage may be shared with these new variables, altering its coefficient.

Specification Changes: The inclusion of fixed effects controls for unobserved heterogeneity (i.e., differences between teams and years), which can lead to changes in coefficient estimates.

Model 3: Impact on Field-Goal Percentage

Model 1: FieldGoalsPercentage = β0 + β1.3PointPercentage + ϵ

Where,

β0 is the intercept, β1 is the coefficient for 3PointPercentage, ϵ is the error term.

# Model 1: y ~ key x
model1_fg <- lm(FieldGoalsPercentage ~ `3PointPercentage`, data = data)

Model2:

The model 2 includes Games, FieldGoal, and Pace as additional control variables. It estimates how FieldGoalsPercentage is related to 3PointPercentage while accounting for the potential influence of these other variables.

Estimated Equation: Field Goals Percentage = β0 + β1 3PointPercentage + β2Games + β3 FieldGoal + β4Pace + ϵ

where, β0 : Intercept; baseline Field Goals Percentage. β1 is the coefficient for 3 Point Percentage, β2 is the coefficient for Games (shows how Points change with the number of games). β3 is the coefficient of filed goal. β4 is the coefficient of pace ϵ: Error term.

# Model 2: y ~ key x + controls
model2_fg <- lm(FieldGoalsPercentage ~ `3PointPercentage` + Games + FieldGoal + Pace, data = data)

Model 3

The model 3 adds factor(Team) to account for team-specific effects by including dummy variables for each team. It adjusts for differences between teams that might affect FieldGoalsPercentage.

Estimated Equation FieldGoalsPercentage=β0 + β13PointPercentage + β2Games + β 3FieldGoal + β4Pace + ∑ni=1γiTeami + ϵ

where,

β0: Intercept; the baseline level of FieldGoalsPercentage when all predictors are zero. β13 PointPercentage: Effect of a one-unit change in 3PointPercentage on FieldGoalsPercentage. β2 Games: Effect of an additional game on FieldGoalsPercentage. β3 FieldGoal: Effect of a one-unit change in FieldGoal on FieldGoalsPercentage. β4 Pace: Effect of a one-unit increase in Pace on FieldGoalsPercentage. ∑ni=1γiTeami: Effect of being on teami, with team dummies capturing team-specific variations. ϵ: Error term; captures unexplained variability in FieldGoalsPercentage

# Model 3: y ~ key x + controls + team dummies
model3_fg <- lm(FieldGoalsPercentage ~ `3PointPercentage` + Games + FieldGoal + Pace + factor(Team), data = data)

Model 4

The model 4 includes both team dummies and year dummies, allowing for adjustments based on both team-specific and year-specific effects. It captures the influence of both team and year on FieldGoalsPercentage, making the model more comprehensive.

Estimated Equation FieldGoalsPercentage = β0 + β13PointPercentage + β2Games + β3FieldGoal + β4Pace+ ∑ni=1γiTeami + ∑mj=1 δjYearj + ϵ where,

β0 : Intercept; baseline FieldGoalsPercentage.β13PointPercentage: Effect of 3PointPercentage on FieldGoalsPercentage. β2 Games : Effect of additional games on FieldGoalsPercentage. β3 FieldGoal: Effect of FieldGoal on FieldGoalsPercentage. β4 Pace: Effect of Pace on FieldGoalsPercentage. ∑ni=1γiTeami : Team-specific effects with dummy variables. ∑mj=1δ jYearj : Year-specific effects with dummy variables. ϵ: Error term; captures unexplained variability in FieldGoalsPercentage.

# Model 4: y ~ key x + controls + team dummies + year dummies
model4_fg <- lm(FieldGoalsPercentage ~ `3PointPercentage` + Games + FieldGoal + Pace + factor(Team) + factor(Year), data = data)

stargazer(
  model1_fg, model2_fg, model3_fg, model4_fg,
  type = 'text',
  dep.var.labels = c("Winning Percentage"),
  column.labels = c("Model 1", "Model 2", "Model 3", "Model 4"),
  title = "Regression Results for Winning Percentage",
  align = TRUE,
  no.space = TRUE,
  column.sep.width = "0.5pt",
  keep = c("3PointPercentage", "Games", "FieldGoal", "^Pace$"),
  add.lines = list(c("Entity FE", "No", "No", "Yes", "Yes")),
  out = "model_summary_wp_cleaned.txt"
)

## 
## Regression Results for Winning Percentage
## ===================================================================================================================
##                                                           Dependent variable:                                      
##                     -----------------------------------------------------------------------------------------------
##                                                           Winning Percentage                                       
##                             Model 1                 Model 2                 Model 3                 Model 4        
##                               (1)                     (2)                     (3)                     (4)          
## -------------------------------------------------------------------------------------------------------------------
## `3PointPercentage`         0.524***                0.253***                0.257***                0.253***        
##                             (0.070)                 (0.052)                 (0.057)                 (0.058)        
## Games                                               0.0002*                0.0003**                 0.0004         
##                                                    (0.0001)                (0.0001)                (0.0004)        
## FieldGoal                                          0.007***                0.007***                0.006***        
##                                                     (0.001)                 (0.001)                 (0.001)        
## Pace                                               -0.002***               -0.002***               -0.002***       
##                                                    (0.0005)                 (0.001)                 (0.001)        
## -------------------------------------------------------------------------------------------------------------------
## Entity FE                     No                      No                      Yes                     Yes          
## Observations                  120                     120                     120                     120          
## R2                           0.323                   0.744                   0.851                   0.868         
## Adjusted R2                  0.318                   0.735                   0.794                   0.810         
## Residual Std. Error    0.012 (df = 118)        0.008 (df = 115)         0.007 (df = 86)         0.006 (df = 83)    
## F Statistic         56.403*** (df = 1; 118) 83.402*** (df = 4; 115) 14.933*** (df = 33; 86) 15.105*** (df = 36; 83)
## ===================================================================================================================
## Note:                                                                                   *p<0.1; **p<0.05; ***p<0.01

Why the Coefficients Change

Model 1: Includes only the 3-Point Percentage, so the coefficient of 0.524 reflects the direct relationship between 3-Point Percentage and Field Goals Percentage without controlling for other variables.

Model 2, 3, and 4: As additional variables (Games, FieldGoal, Pace) are added, the coefficient for 3-Point Percentage changes. This is due to the inclusion of other predictors which may account for some of the variability previously attributed to the 3-Point Percentage alone. These changes indicate how the relationship between 3-Point Percentage and Field Goals Percentage adjusts when other factors are considered.

Why the Pace is Negative

Pace: A negative coefficient for Pace suggests that as the pace of the game increases (i.e., more possessions per game), the Field Goals Percentage tends to decrease. This could be due to several reasons.

Choose a Best Fit Models:

# Assuming models are already fitted as per your examples
# Extract AIC and BIC values
aic_values_wp <- c(AIC(model1_wp), AIC(model2_wp), AIC(model3_wp), AIC(model4_wp))
bic_values_wp <- c(BIC(model1_wp), BIC(model2_wp), BIC(model3_wp), BIC(model4_wp))

aic_values_points <- c(AIC(model1_pts), AIC(model2_pts), AIC(model3_pts), AIC(model4_pts))
bic_values_points <- c(BIC(model1_pts), BIC(model2_pts), BIC(model3_pts), BIC(model4_pts))

aic_values_fg <- c(AIC(model1_fg), AIC(model2_fg), AIC(model3_fg), AIC(model4_fg))
bic_values_fg <- c(BIC(model1_fg), BIC(model2_fg), BIC(model3_fg), BIC(model4_fg))

# Print AIC and BIC values
print(data.frame(Model = c("Model 1", "Model 2", "Model 3", "Model 4"),
                 AIC_WP = aic_values_wp,
                 BIC_WP = bic_values_wp,
                 AIC_Points = aic_values_points,
                 BIC_Points = bic_values_points,
                 AIC_FG = aic_values_fg,
                 BIC_FG = bic_values_fg))

##     Model    AIC_WP    BIC_WP AIC_Points BIC_Points    AIC_FG    BIC_FG
## 1 Model 1 -184.8636 -176.5011   637.3680   645.7305 -711.2876 -702.9252
## 2 Model 2 -203.1888 -186.4638   466.6080   483.3329 -821.7513 -805.0263
## 3 Model 3 -212.3853 -114.8231   457.4017   554.9640 -829.1958 -731.6336
## 4 Model 4 -235.2935 -129.3688   452.9184   558.8431 -837.0166 -731.0919

Interpretation of Model Selection

1. Winning Percentage (WP)

Model 1: AIC = -184.86, BIC = -176.50 Model 2: AIC = -203.19, BIC = -186.46 Model 3: AIC = -212.39, BIC = -114.82 Model 4: AIC = -235.29, BIC = -129.37

Best Model: Model 4 has the lowest AIC and BIC values, indicating the best fit for predicting Winning Percentage. It includes 3-Point Percentage, Games, Field Goal, Pace, and both Team and Year factors.

2. Total Points (Points)

Model 1: AIC = 637.37, BIC = 645.73 Model 2: AIC = 466.61, BIC = 483.33 Model 3: AIC = 457.40, BIC = 554.96 Model 4: AIC = 452.92, BIC = 558.84

Best Model: Model 4 has the lowest AIC and is very close in BIC to Model 3. Therefore, Model 4 is the preferred model for predicting Total Points, incorporating 3-Point Percentage, Games, Field Goal, Pace, and both Team and Year factors.

3. Field Goals Percentage (FG%)

Model 1: AIC = -711.29, BIC = -702.93 Model 2: AIC = -821.75, BIC = -805.03 Model 3: AIC = -829.20, BIC = -731.63 Model 4: AIC = -837.02, BIC = -731.09

Best Model: Model 4 has the lowest AIC, though its BIC is close to that of Model 3. Thus, Model 4 is also the best model for Field Goals Percentage, which includes the same predictors as the other models.

Overall Best Models:

Winning Percentage: Model 4 Total Points: Model 4 Field Goals Percentage: Model 4

Rationale: Model 4 consistently shows the lowest AIC and BIC values across all outcome variables. This suggests it provides the best balance between fit and complexity, incorporating all relevant predictors, including 3-Point Percentage, Games, Field Goals, Pace, and both Team and Year effects.

stargazer(
  model4_wp, model4_pts, model4_fg,
  type = 'text',
  dep.var.labels = c("Winning Percentage"),
  column.labels = c("Model 1", "Model 2", "Model 3", "Model 4"),
  title = "Regression Results for Winning Percentage",
  align = TRUE,
  no.space = TRUE,
  column.sep.width = "0.5pt",
  keep = c("3PointPercentage", "Games", "FieldGoal", "^Pace$"),
  add.lines = list(c("Entity FE", "No", "No", "Yes", "Yes")),
  out = "model_summary_wp_cleaned.txt"
)

## 
## Regression Results for Winning Percentage
## ===============================================================================
##                                              Dependent variable:               
##                               -------------------------------------------------
##                               Winning Percentage  Points   FieldGoalsPercentage
##                                    Model 1        Model 2        Model 3       
##                                      (1)            (2)            (3)         
## -------------------------------------------------------------------------------
## `3PointPercentage`                 2.337***      65.209***       0.253***      
##                                    (0.713)       (12.546)        (0.058)       
## Games                              0.015***        0.056          0.0004       
##                                    (0.006)        (0.097)        (0.0004)      
## FieldGoal                          0.045***      1.613***        0.006***      
##                                    (0.010)        (0.171)        (0.001)       
## Pace                              -0.023***      0.512***       -0.002***      
##                                    (0.007)        (0.121)        (0.001)       
## -------------------------------------------------------------------------------
## Entity FE                             No            No             Yes         
## Observations                         120            120            120         
## R2                                  0.768          0.908          0.868        
## Adjusted R2                         0.667          0.868          0.810        
## Residual Std. Error (df = 83)       0.080          1.399          0.006        
## F Statistic (df = 36; 83)          7.635***      22.759***      15.105***      
## ===============================================================================
## Note:                                               *p<0.1; **p<0.05; ***p<0.01

Best Fit Model Interpretations

1. Impact on Winning Percentage

Coefficient for 3PointPercentage: 2.337

Interpretation: For every one-unit increase in 3PointPercentage (i.e., an increase in the proportion of 3-point shots), the winning percentage of the team increases by 2.337 percentage points. This positive and significant effect suggests that teams that take more 3-point shots generally have a higher winning percentage.

Coefficient for Pace: -0.023

Interpretation: The negative coefficient indicates that an increase in the pace of the game (i.e., more possessions per game) is associated with a slight decrease in winning percentage. This implies that while a faster pace might lead to more scoring opportunities, it could negatively impact overall team performance and winning chances.

2. Impact on Total Points Scored

Coefficient for 3PointPercentage: 65.209

Interpretation: For each one-unit increase in 3PointPercentage, the total points scored by the team increases by 65.209 points. This large and statistically significant effect shows that teams that take more 3-point shots score significantly more points, highlighting the substantial impact of 3-point shooting on scoring.

Coefficient for Pace: 0.512

Interpretation: The positive coefficient for Pace indicates that a faster pace leads to more points scored. This suggests that playing at a higher tempo increases scoring opportunities, thereby contributing to higher total points.

3. Impact on Field-Goal Percentage

Coefficient for 3PointPercentage: 0.253

Interpretation: Each one-unit increase in 3PointPercentage is associated with a 0.253 percentage point increase in field-goal percentage. This indicates that as teams take more 3-point shots, their overall field-goal percentage also improves, though the effect is smaller compared to its impact on total points.

Coefficient for Pace: -0.002

Interpretation: The negative coefficient for Pace suggests that a faster pace is associated with a slight decrease in field-goal percentage. This implies that while a faster pace might increase scoring opportunities, it could also reduce shooting accuracy.

Overall Interpretation

Increase in 3-Point Shots:

Winning Percentage: Increasing 3-point shots significantly improves winning percentage. Teams that focus more on 3-point shooting tend to win more games.

Total Points Scored: More 3-point shots lead to a substantial increase in total points scored, highlighting the effectiveness of 3-point shooting in boosting scoring.

Field-Goal Percentage: There is a positive but smaller effect on field-goal percentage, suggesting that while 3-point shots improve overall scoring, their impact on shooting efficiency is less pronounced.

Influence of Pace:

Winning Percentage: A faster pace slightly reduces winning percentage, possibly due to less control over the game and more scoring variability. Total Points Scored: A faster pace increases total points scored, indicating that more possessions lead to more scoring opportunities. Field-Goal Percentage: A faster pace slightly decreases field-goal percentage, likely because faster play may lead to lower-quality shot attempts.

# Load necessary libraries
library(ggplot2)
library(car)  # For Variance Inflation Factor (VIF)

Check Linearity

Residuals vs. Fitted Values Plot

This plot helps verify if the relationship between predictors and the outcome is linear.

# Plot residuals vs. fitted values for Model 4_Winning_Percentage
plot(model4_wp$fitted.values, model4_wp$residuals,
     xlab = "Fitted Values", ylab = "Residuals",
     main = "Residuals vs Fitted Values (Model 4 WP)")
abline(h = 0, col = "red")

Linearity explanation

The “Residuals vs Fitted Values” plot shows that the residuals (errors) are randomly scattered around zero without any clear pattern. This suggests that the regression model is a good fit for the data, with no obvious issues like non-linearity or heteroscedasticity (uneven spread of residuals). The even spread of residuals across the range of fitted values indicates that the model’s predictions are consistent and reliable.

# Plot residuals vs. fitted values for Model 4_Points
plot(model4_pts$fitted.values, model4_pts$residuals,
     xlab = "Fitted Values", ylab = "Residuals",
     main = "Residuals vs Fitted Values (Model 4 Points)")
abline(h = 0, col = "red")

The plot suggests that the regression model fits the data well. The residuals are evenly spread around zero and show no patterns, indicating that the model’s assumptions are likely satisfied and that the predictions are reliable across the entire range of fitted values.

# Plot residuals vs. fitted values for Model 4_FieldGoal
plot(model4_fg$fitted.values, model4_fg$residuals,
     xlab = "Fitted Values", ylab = "Residuals",
     main = "Residuals vs Fitted Values (Model 4 FG)")
abline(h = 0, col = "red")

The residuals are relatively small, mostly staying within a narrow range (approximately -0.015 to 0.005). This indicates that the model’s predictions are quite close to the actual values, with only minor deviations.The plot suggests that the regression model is well-suited for the data, with no obvious issues like bias, non-linearity, or heteroscedasticity. The model’s predictions are consistent and reliable across the entire range of fitted values.

Check Homoscedasticity

Residuals vs. Fitted Values (Absolute Residuals) Plot

# Plot residuals vs. fitted values (absolute residuals) for Model 4
plot(model4_wp$fitted.values, abs(model4_wp$residuals),
     xlab = "Fitted Values", ylab = "Absolute Residuals",
     main = "Absolute Residuals vs Fitted Values (Model 4 WP)")
abline(h = mean(abs(model4_wp$residuals)), col = "red")

The x-axis represents the “Fitted Values” while the y-axis represents “Absolute Residuals.” The plot displays points scattered around a horizontal line at approximately 0.05 on the y-axis, indicating the expected value. The points vary across the range of fitted values without displaying a clear trend or pattern.

# Plot residuals vs. fitted values (absolute residuals) for Model 4
plot(model4_pts$fitted.values, abs(model4_pts$residuals),
     xlab = "Fitted Values", ylab = "Absolute Residuals",
     main = "Absolute Residuals vs Fitted Values (Model 4 Points)")
abline(h = mean(abs(model4_wp$residuals)), col = "red")

The scatter plot shows the relationship between “Fitted Values” and “Absolute Residuals.” were the points are scattered without a clear pattern, with most falling near the horizontal line at approximately 0.05 on the y-axis. The x-axis ranges from 105 to 120 for “Fitted Values,” and the y-axis ranges from 0 to 3 for “Absolute Residuals.”

# Plot residuals vs. fitted values (absolute residuals) for Model 4
plot(model4_fg$fitted.values, abs(model4_fg$residuals),
     xlab = "Fitted Values", ylab = "Absolute Residuals",
     main = "Absolute Residuals vs Fitted Values (Model 4 FG)")
abline(h = mean(abs(model4_wp$residuals)), col = "red")

Here ,the x-axis represents the “Fitted Values” ranging from 0.44 to 0.50, while the y-axis represents the “Absolute Residuals” ranging from 0.00 to 0.015. The plot shows a spread of points without any clear pattern or trend. There is a horizontal line at approximately 0.05 on the y-axis, which could signify an expected value or threshold.

Check Normality of Residuals

Q-Q Plot

# Q-Q plot for Model 4
qqnorm(model4_wp$residuals, main = "Q-Q Plot (Model 4 WP)")
qqline(model4_wp$residuals, col = "red")

# Q-Q plot for Model 4
qqnorm(model4_pts$residuals, main = "Q-Q Plot (Model 4 PTS)")
qqline(model4_pts$residuals, col = "red")

# Q-Q plot for Model 4
qqnorm(model4_fg$residuals, main = "Q-Q Plot (Model 4 FG)")
qqline(model4_fg$residuals, col = "red")

We can see from the graphs all of the models are normally ditributed. All the points are not with line but some of them are passing with line. So, Winning percentage,points and filed goals Q-Q plots are normally distributed. In the three plots few clusters are outliers.

# Shapiro-Wilk test for normality
shapiro.test(model4_wp$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model4_wp$residuals
## W = 0.9924, p-value = 0.7582

W Statistic: The value 0.9924 is close to 1, indicating that the residuals are nearly normally distributed. p-value: The p-value of 0.7582 is much higher than common significance levels (e.g., 0.05 or 0.01). This means that the test does not provide enough evidence to reject the null hypothesis of normality.

# Shapiro-Wilk test for normality
shapiro.test(model4_pts$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model4_pts$residuals
## W = 0.99405, p-value = 0.8942

W Statistic: The value 0.99405 is very close to 1, which indicates that the residuals from model4_pts are nearly normally distributed. p-value: The p-value of 0.8942 is substantially above common thresholds for significance (e.g., 0.05 or 0.01). This means that the test does not provide sufficient evidence to reject the null hypothesis.

# Shapiro-Wilk test for normality
shapiro.test(model4_fg$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model4_fg$residuals
## W = 0.99172, p-value = 0.6946

W Statistic: The value 0.99172 suggests that the residuals are close to normally distributed, but not perfectly. p-value: The p-value of 0.6946 is well above common significance levels (like 0.05 or 0.01). This means that there is no significant evidence to reject the null hypothesis of normality.

# Calculate VIF for Model 4
vif(model4_wp)

##                         GVIF Df GVIF^(1/(2*Df))
## `3PointPercentage`  2.489760  1        1.577897
## Games              18.112269  1        4.255851
## FieldGoal           4.329421  1        2.080726
## Pace                3.620380  1        1.902730
## factor(Team)       11.584348 29        1.043140
## factor(Year)       27.631588  3        1.738739

vif(model4_pts)

##                         GVIF Df GVIF^(1/(2*Df))
## `3PointPercentage`  2.489760  1        1.577897
## Games              18.112269  1        4.255851
## FieldGoal           4.329421  1        2.080726
## Pace                3.620380  1        1.902730
## factor(Team)       11.584348 29        1.043140
## factor(Year)       27.631588  3        1.738739

vif(model4_fg)

##                         GVIF Df GVIF^(1/(2*Df))
## `3PointPercentage`  2.489760  1        1.577897
## Games              18.112269  1        4.255851
## FieldGoal           4.329421  1        2.080726
## Pace                3.620380  1        1.902730
## factor(Team)       11.584348 29        1.043140
## factor(Year)       27.631588  3        1.738739

High VIF Values: Games has a notably high VIF value, suggesting that it might be highly collinear with other predictors in the model. This could affect the stability and interpretability of the regression coefficients.

Moderate VIF Values: FieldGoal shows some level of multicollinearity, but it is not extremely high.

Low VIF Values: 3PointPercentage, Pace, factor(Team), and factor(Year) have relatively low adjusted VIFs, indicating less concern regarding multicollinearity.

Predict The Model

# Predictions for fixed effects models
predictions_fe_winning <- predict(model4_wp, data = data)
predictions_fe_total_points <- predict(model4_pts, data = data)
predictions_fe_field_goal_percentage <- predict(model4_fg, data = data)

# Comparison with actual values
comparison_winning_percentage <- data.frame(Actual = data$WinningPercentage, Predicted = predictions_fe_winning)
comparison_winning_percentage <- round(comparison_winning_percentage, 3)

comparison_total_points <- data.frame(Actual = data$Points, Predicted = predictions_fe_total_points)
comparison_total_points <- round(comparison_total_points)

comparison_field_goal_percentage <- data.frame(Actual = data$FieldGoalsPercentage, Predicted = predictions_fe_field_goal_percentage)
comparison_field_goal_percentage <- round(comparison_field_goal_percentage, 3)

head(comparison_total_points)

##   Actual Predicted
## 1    112       112
## 2    114       114
## 3    112       111
## 4    103       102
## 5    107       106
## 6    107       108

head(comparison_winning_percentage)

##   Actual Predicted
## 1  0.299     0.281
## 2  0.667     0.633
## 3  0.486     0.453
## 4  0.354     0.320
## 5  0.338     0.294
## 6  0.292     0.384

head(comparison_field_goal_percentage)

##   Actual Predicted
## 1  0.449     0.445
## 2  0.461     0.462
## 3  0.448     0.457
## 4  0.434     0.437
## 5  0.447     0.454
## 6  0.458     0.461

# Define a function to calculate MAE, MSE, and R-squared
calculate_metrics <- function(actual, predicted) {
  # Mean Absolute Error (MAE)
  mae <- mean(abs(actual - predicted))
  
  # Mean Squared Error (MSE)
  mse <- mean((actual - predicted)^2)
  
  # R-squared
  ss_total <- sum((actual - mean(actual))^2)
  ss_residual <- sum((actual - predicted)^2)
  r_squared <- 1 - (ss_residual / ss_total)
  
  return(c(MAE = mae, MSE = mse, R_squared = r_squared))
}

# Calculate metrics for each model
metrics_winning_percentage <- calculate_metrics(data$WinningPercentage, predictions_fe_winning)
metrics_total_points <- calculate_metrics(data$Points, predictions_fe_total_points)
metrics_field_goal_percentage <- calculate_metrics(data$FieldGoalsPercentage, predictions_fe_field_goal_percentage)

# Print metrics
cat("Metrics for Winning Percentage Model:\n")

## Metrics for Winning Percentage Model:

print(metrics_winning_percentage)

##         MAE         MSE   R_squared 
## 0.052265223 0.004374383 0.768068301

cat("\nMetrics for Total Points Model:\n")

## 
## Metrics for Total Points Model:

print(metrics_total_points)

##       MAE       MSE R_squared 
## 0.9304140 1.3540631 0.9080146

cat("\nMetrics for Field Goal Percentage Model:\n")

## 
## Metrics for Field Goal Percentage Model:

print(metrics_field_goal_percentage)

##          MAE          MSE    R_squared 
## 4.295843e-03 2.905416e-05 8.675778e-01

Conclusion

As per the predicted model:

Winning Percentage: The increase in 3-point shots positively impacts winning percentage, with the effect moderated by game pace. The model shows that 76.81% of the variance in winning percentage can be explained by factors related to 3-point shooting.

Total Points Scored: More 3-point shots lead to higher total points scored, especially in high-paced games. The model captures 90.80% of the variance in total points scored.

Field Goal Percentage: An increase in 3-point attempts generally lowers field-goal percentage. The model explains 86.76% of the variance in field-goal percentage, reflecting the impact of 3-point shooting.

Overall, the increase in 3-point shooting significantly affects each of these statistical measures, with the pace of the game influencing the extent of these effects.

NBA 3Point Project

Ganesh Kumar

2024-08-13

Introduction

Loading Libraries and Reading Data(2019):

Data Cleaning:

Removing Asterisks:

Checking Column Names:

Removing Columns Using Subset:

Merging Data Frames By Team:

Adding a Year Column:

Renaming and Relocating The Column:

Reading Data(2020):

Data Cleaning:

Reading Data(2021):

Reading Data(2022):

Convert Character Columns to Numeric:

Adding new variable in the dataset:

Exploratory Data Analysis (EDA):

Double check the structure of the data

Missing Values Check:

Interpret:

Scatterplot for key relationships:

Grouped Analysis

Games:

Winning Percentage by Team:

Boxplot:

Correlation:

Model Building

Multiple Linear Regression Models:

VIF Interpretation

Detailed Interpretation

Choose a Best Fit Models:

Interpretation of Model Selection

Best Fit Model Interpretations

Check Linearity

Check Homoscedasticity

Check Normality of Residuals

Predict The Model

Conclusion