NBA Data Analysis

Introduction
Background
Analysis
Descriptive Analysis
Inferential Analysis and Plots
Does Location Affect A Team’s Winning Or Losing?
How the Final Margin is affected by the other variables?
How winnings dependent on the shot result?
Conclusion
Appendices
Reference and citations

Introduction

The National Basketball Association (NBA) is a professional basketball league in North America. The league is composed of 30 teams (29 in the United States and 1 in Canada) and is one of the four major professional sports leagues in the United States and Canada. It is the premier men’s professional basketball league in the world. The NBA is an active member of USA Basketball (USAB), which is recognized by the FIBA (International Basketball Federation) as the national governing body for basketball in the United States. The league’s several international as well as individual team offices are directed out of its head offices in Midtown Manhattan, while its NBA Entertainment and NBA TV studios are directed out of offices located in Secaucus, New Jersey. The NBA is the third wealthiest professional sport league after the National Football League (NFL) and Major League Baseball (MLB) by revenue. Here our aim is to find how the winnings is dependent on the other factors and the on which factors affects final margin.

Background

Multiple regression analysis is used to see if there is a statistically significant relationship between sets of variables. It’s used to find trends in those sets of data. Multiple regression analysis is almost the same as simple linear regression. The only difference between simple linear regression and multiple regression is in the number of predictors (\(“x”\) variables) used in the regression.

Simple regression analysis uses a single \(x\) variable for each dependent \(y\) variable. For example: \((x_1, Y_1)\).
Multiple regression uses multiple \(x\) variables for each independent variable: \(x_1, x_2, x_3\).

In linear regression, you would input one dependent variable against independent variables.

In logistic regression, the dependent variable is a logit, which is the natural log of the odds, that is,

\(\log{(odds)} = logit(P) = \ln{\frac{P}{1-P}}\)

So a logit is a log of odds and odds are a function of P. In logistic regression, we find \(logit(P) = a + bX\), Which is assumed to be linear, that is, the log odds (logit) is assumed to be linearly related to X. Then we have to convert odds to a simple probability:

\(\ln{\frac{P}{1-P}} = a + bX\)

\(\frac{P}{1-P} = e^{a+bX}\)

\(P = \frac{e^{a+bX}}{1+e^{a+bX}}\)

The Chi-Square test of independence is used to determine if there is a significant relationship between two nominal (categorical) variables. The frequency of each category for one nominal variable is compared across the categories of the second nominal variable. The data can be displayed in a contingency table where each row represents a category for one variable and each column represents a category for the other variable. For example, say a researcher wants to examine the relationship between gender (male vs. female) and empathy (high vs. low). The chi-square test of independence can be used to examine this relationship. The null hypothesis for this test is that there is no relationship between gender and empathy. The alternative hypothesis is that there is a relationship between gender and empathy (e.g. there are more high-empathy females than high-empathy males).

Null hypothesis: Assumes that there is no association between the two variables. Alternative hypothesis: Assumes that there is an association between the two variables. Hypothesis testing: Hypothesis testing for the chi-square test of independence as it is for other tests like ANOVA, where a test statistic is computed and compared to a critical value. The critical value for the chi-square statistic is determined by the level of significance (typically .05) and the degrees of freedom. The degrees of freedom for the chi-square are calculated using the following formula: \(df = (r-1)(c-1)\) where \(r\) is the number of rows and \(c\) is the number of columns. If the observed chi-square test statistic is greater than the critical value, the null hypothesis can be rejected.

Analysis

Let us first read the data as follows,

data <- read.csv("C:/Users/Lenovo/OneDrive/Desktop/instadata/R Programming_Samraan_23rd Jan/nbadata.csv", header = TRUE)
head(data)

Descriptive Analysis

Let us find the descriptive statistics for the NBA data as follows,

summary(data)

##    ï..GAME_ID           DATE            HOME_TEAM          AWAY_TEAM        
##  Min.   :21400001   Length:128069      Length:128069      Length:128069     
##  1st Qu.:21400233   Class :character   Class :character   Class :character  
##  Median :21400449   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :21400452                                                           
##  3rd Qu.:21400673                                                           
##  Max.   :21400908                                                           
##                                                                             
##  PLAYER_NAME          PLAYER_ID        LOCATION              W            
##  Length:128069      Min.   :   708   Length:128069      Length:128069     
##  Class :character   1st Qu.:101162   Class :character   Class :character  
##  Mode  :character   Median :201939   Mode  :character   Mode  :character  
##                     Mean   :157238                                        
##                     3rd Qu.:202704                                        
##                     Max.   :204060                                        
##                                                                           
##   FINAL_MARGIN       SHOT_NUMBER         PERIOD       GAME_CLOCK       
##  Min.   :-53.0000   Min.   : 1.000   Min.   :1.000   Length:128069     
##  1st Qu.: -8.0000   1st Qu.: 3.000   1st Qu.:1.000   Class :character  
##  Median :  1.0000   Median : 5.000   Median :2.000   Mode  :character  
##  Mean   :  0.2087   Mean   : 6.507   Mean   :2.469                     
##  3rd Qu.:  9.0000   3rd Qu.: 9.000   3rd Qu.:3.000                     
##  Max.   : 53.0000   Max.   :38.000   Max.   :7.000                     
##                                                                        
##    SHOT_CLOCK       DRIBBLES        TOUCH_TIME         SHOT_DIST    
##  Min.   : 0.00   Min.   : 0.000   Min.   :-163.600   Min.   : 0.00  
##  1st Qu.: 8.20   1st Qu.: 0.000   1st Qu.:   0.900   1st Qu.: 4.70  
##  Median :12.30   Median : 1.000   Median :   1.600   Median :13.70  
##  Mean   :12.45   Mean   : 2.023   Mean   :   2.766   Mean   :13.57  
##  3rd Qu.:16.68   3rd Qu.: 2.000   3rd Qu.:   3.700   3rd Qu.:22.50  
##  Max.   :24.00   Max.   :32.000   Max.   :  24.900   Max.   :47.20  
##  NA's   :5567                                                       
##     PTS_TYPE     SHOT_RESULT        CLOSEST_DEFENDER   CLOSEST_DEFENDER_ID
##  Min.   :2.000   Length:128069      Length:128069      Min.   :   708     
##  1st Qu.:2.000   Class :character   Class :character   1st Qu.:101249     
##  Median :2.000   Mode  :character   Mode  :character   Median :201949     
##  Mean   :2.265                                         Mean   :159039     
##  3rd Qu.:3.000                                         3rd Qu.:203079     
##  Max.   :3.000                                         Max.   :530027     
##                                                                           
##  CLOSE_DEF_DIST        FGM              PTS        
##  Min.   : 0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 2.300   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 3.700   Median :0.0000   Median :0.0000  
##  Mean   : 4.123   Mean   :0.4521   Mean   :0.9973  
##  3rd Qu.: 5.300   3rd Qu.:1.0000   3rd Qu.:2.0000  
##  Max.   :53.200   Max.   :1.0000   Max.   :3.0000  
##

Let us get some idea about distributions of the variables. First observe that for FINAL_MARGIN, mean < median, so it’s distribution is negatively skewed. Again, observe that, for SHOT_NUMBER, mean > median, so the distribution of it is positively skewed. Again for PERIOD, observe that, mean > median, so it is positively skewed. Similarly observe that, for SHOR_CLOCK, mean > median which implies that the distribution of SHOT_CLOCK is positively skewed. Similarly observe that, for variables DRIBBLES, TOUCH_TIME, PTS_TYPE, CLOSE_DEF_DIST, FGM, PTS, mean > median which implies that the distributions of are positively skewed. Again observe that, for SHOT_DIST,mean < median, so distribution of it is negatively skewed.

Inferential Analysis and Plots

Does Location Affect A Team’s Winning Or Losing?

table(data$W, data$LOCATION)

##    
##         A     H
##   L 35496 27978
##   W 28639 35956

Here observe that, for away matches, total number of loss is higher than total number of winnings. Again, similarly, observe that, for home ground matches, total number of winnings is higher than losses. So, the winning could be dependent on the location. Let us perform Chi-square test to check the independence of location and winnings as follows,

chisq.test(data$W,data$LOCATION)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  data$W and data$LOCATION
## X-squared = 1718.5, df = 1, p-value < 2.2e-16

As, the p-value for test is less than \(0.05\), so we reject null hypothesis at level \(0.05\). So, winning is dependent on the location.

How the Final Margin is affected by the other variables?

Let us fit a linear model with explanatory variables LOCATION, W, SHOT_CLOCK, DRIBBLES, SHOT_DIST, SHOT_RESULT with dependent variable FINAL_MARGIN as follows,

model <- lm(FINAL_MARGIN~LOCATION+W+SHOT_CLOCK+DRIBBLES+SHOT_DIST+SHOT_RESULT,data=data)
summary(model)

## 
## Call:
## lm(formula = FINAL_MARGIN ~ LOCATION + W + SHOT_CLOCK + DRIBBLES + 
##     SHOT_DIST + SHOT_RESULT, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.112  -5.145  -0.030   5.058  41.870 
## 
## Coefficients:
##                     Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)       -11.540693   0.080624 -143.142  < 2e-16 ***
## LOCATIONH           1.652482   0.044695   36.973  < 2e-16 ***
## WW                 21.334155   0.044769  476.543  < 2e-16 ***
## SHOT_CLOCK          0.017838   0.003935    4.533 5.81e-06 ***
## DRIBBLES            0.014508   0.006567    2.209   0.0272 *  
## SHOT_DIST           0.013798   0.002620    5.267 1.39e-07 ***
## SHOT_RESULTmissed  -0.537384   0.045501  -11.810  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.761 on 122495 degrees of freedom
##   (5567 observations deleted due to missingness)
## Multiple R-squared:  0.6601, Adjusted R-squared:  0.6601 
## F-statistic: 3.965e+04 on 6 and 122495 DF,  p-value: < 2.2e-16

Here observe that, for the test of significance of the model, p-value is less than \(0.05\), so we reject null hypothesis at level \(0.05\). So, our fitted model is significant. Now, observe that, Adjusted R-squared for the fitted model is \(0.6601\) which is close to \(1\), so our fitted model is well enough. Also, notice that, the SE of the model is \(7.761\) which is significantly low, so variability in the error is low.

Now, observe that, for the test of significance of the coefficients, p-value for all coefficients is less than \(0.05\), so we reject null hypothesis at level \(0.05\). So, the variables LOCATION, W, SHOT_CLOCK, DRIBBLES, SHOT_DIST, SHOT_RESULT are significant for FINAL_MARGIN.

Now from appendix observe that, in residual plot the residuals are random at each level of fitted values, hence fitting is good. Also, from the normal Q-Q plot observe that, the points are near the theoretical line, so the errors follows normality assumptions. Again from the residuals vs leverage plots, we have \(3\) leverages at the observations \(17333, 238430, 16176\).

How winnings dependent on the shot result?

Here we want to know that whether the winnings dependent on whether the shot was “made” or “missed”. So, we will perform t test as follows,

win = 1*(data$W == "W")
pairwise.t.test(win,data$SHOT_RESULT)

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  win and data$SHOT_RESULT 
## 
##        made  
## missed <2e-16
## 
## P value adjustment method: holm

Here observe that, p-value for the test is less than \(0.05\), so we reject null hypothesis at level \(0.05\), so the winning differs for whether the shot results.

Conclusion

From this project the conclusions are as follows,

Winning a match is dependent on whether the match is away match or home ground match.
The variables LOCATION, W, SHOT_CLOCK, DRIBBLES, SHOT_DIST, SHOT_RESULT affects FINAL_MARGIN.
The winning differs for whether the shot results.

Appendices

plot(model)

Reference and citations

Here we took the help of following websites –

The important citations related to this work are,

Bhandari, I., Colet, E., Parker, J. et al. Advanced Scout: Data Mining and Knowledge Discovery in NBA Data. Data Mining and Knowledge Discovery 1, 121–125 (1997). https://doi.org/10.1023/A:1009782106822
Chatterjee, S., Campbell, M.R. and Wiseman, F. (1994), Take that jam! An analysis of winning percentage for NBA teams. Manage. Decis. Econ., 15: 521-535. https://doi.org/10.1002/mde.4090150514