Abstract
Here we are working on the NBA data. Here our aim is to extract some information from the data. Here first we performed descriptive analysis to get some idea about the distributions of the variables. Next, we performed inference about whether the winning dependent on the location. Also, we tried to find that on which factors final margin depends. At last, we tried to find whether winning is dependent on the shot results.The National Basketball Association (NBA) is a professional basketball league in North America. The league is composed of 30 teams (29 in the United States and 1 in Canada) and is one of the four major professional sports leagues in the United States and Canada. It is the premier men’s professional basketball league in the world. The NBA is an active member of USA Basketball (USAB), which is recognized by the FIBA (International Basketball Federation) as the national governing body for basketball in the United States. The league’s several international as well as individual team offices are directed out of its head offices in Midtown Manhattan, while its NBA Entertainment and NBA TV studios are directed out of offices located in Secaucus, New Jersey. The NBA is the third wealthiest professional sport league after the National Football League (NFL) and Major League Baseball (MLB) by revenue. Here our aim is to find how the winnings is dependent on the other factors and the on which factors affects final margin.
Multiple regression analysis is used to see if there is a statistically significant relationship between sets of variables. It’s used to find trends in those sets of data. Multiple regression analysis is almost the same as simple linear regression. The only difference between simple linear regression and multiple regression is in the number of predictors (\(“x”\) variables) used in the regression.
Simple regression analysis uses a single \(x\) variable for each dependent \(y\) variable. For example: \((x_1, Y_1)\).
Multiple regression uses multiple \(x\) variables for each independent variable: \(x_1, x_2, x_3\).
In linear regression, you would input one dependent variable against independent variables.
In logistic regression, the dependent variable is a logit, which is the natural log of the odds, that is,
\(\log{(odds)} = logit(P) = \ln{\frac{P}{1-P}}\)
So a logit is a log of odds and odds are a function of P. In logistic regression, we find \(logit(P) = a + bX\), Which is assumed to be linear, that is, the log odds (logit) is assumed to be linearly related to X. Then we have to convert odds to a simple probability:
\(\ln{\frac{P}{1-P}} = a + bX\)
\(\frac{P}{1-P} = e^{a+bX}\)
\(P = \frac{e^{a+bX}}{1+e^{a+bX}}\)
The Chi-Square test of independence is used to determine if there is a significant relationship between two nominal (categorical) variables. The frequency of each category for one nominal variable is compared across the categories of the second nominal variable. The data can be displayed in a contingency table where each row represents a category for one variable and each column represents a category for the other variable. For example, say a researcher wants to examine the relationship between gender (male vs. female) and empathy (high vs. low). The chi-square test of independence can be used to examine this relationship. The null hypothesis for this test is that there is no relationship between gender and empathy. The alternative hypothesis is that there is a relationship between gender and empathy (e.g. there are more high-empathy females than high-empathy males).
Null hypothesis: Assumes that there is no association between the two variables. Alternative hypothesis: Assumes that there is an association between the two variables. Hypothesis testing: Hypothesis testing for the chi-square test of independence as it is for other tests like ANOVA, where a test statistic is computed and compared to a critical value. The critical value for the chi-square statistic is determined by the level of significance (typically .05) and the degrees of freedom. The degrees of freedom for the chi-square are calculated using the following formula: \(df = (r-1)(c-1)\) where \(r\) is the number of rows and \(c\) is the number of columns. If the observed chi-square test statistic is greater than the critical value, the null hypothesis can be rejected.
Let us first read the data as follows,
data <- read.csv("C:/Users/Lenovo/OneDrive/Desktop/instadata/R Programming_Samraan_23rd Jan/nbadata.csv", header = TRUE)
head(data)
Let us find the descriptive statistics for the NBA data as follows,
summary(data)
## ï..GAME_ID DATE HOME_TEAM AWAY_TEAM
## Min. :21400001 Length:128069 Length:128069 Length:128069
## 1st Qu.:21400233 Class :character Class :character Class :character
## Median :21400449 Mode :character Mode :character Mode :character
## Mean :21400452
## 3rd Qu.:21400673
## Max. :21400908
##
## PLAYER_NAME PLAYER_ID LOCATION W
## Length:128069 Min. : 708 Length:128069 Length:128069
## Class :character 1st Qu.:101162 Class :character Class :character
## Mode :character Median :201939 Mode :character Mode :character
## Mean :157238
## 3rd Qu.:202704
## Max. :204060
##
## FINAL_MARGIN SHOT_NUMBER PERIOD GAME_CLOCK
## Min. :-53.0000 Min. : 1.000 Min. :1.000 Length:128069
## 1st Qu.: -8.0000 1st Qu.: 3.000 1st Qu.:1.000 Class :character
## Median : 1.0000 Median : 5.000 Median :2.000 Mode :character
## Mean : 0.2087 Mean : 6.507 Mean :2.469
## 3rd Qu.: 9.0000 3rd Qu.: 9.000 3rd Qu.:3.000
## Max. : 53.0000 Max. :38.000 Max. :7.000
##
## SHOT_CLOCK DRIBBLES TOUCH_TIME SHOT_DIST
## Min. : 0.00 Min. : 0.000 Min. :-163.600 Min. : 0.00
## 1st Qu.: 8.20 1st Qu.: 0.000 1st Qu.: 0.900 1st Qu.: 4.70
## Median :12.30 Median : 1.000 Median : 1.600 Median :13.70
## Mean :12.45 Mean : 2.023 Mean : 2.766 Mean :13.57
## 3rd Qu.:16.68 3rd Qu.: 2.000 3rd Qu.: 3.700 3rd Qu.:22.50
## Max. :24.00 Max. :32.000 Max. : 24.900 Max. :47.20
## NA's :5567
## PTS_TYPE SHOT_RESULT CLOSEST_DEFENDER CLOSEST_DEFENDER_ID
## Min. :2.000 Length:128069 Length:128069 Min. : 708
## 1st Qu.:2.000 Class :character Class :character 1st Qu.:101249
## Median :2.000 Mode :character Mode :character Median :201949
## Mean :2.265 Mean :159039
## 3rd Qu.:3.000 3rd Qu.:203079
## Max. :3.000 Max. :530027
##
## CLOSE_DEF_DIST FGM PTS
## Min. : 0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 2.300 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 3.700 Median :0.0000 Median :0.0000
## Mean : 4.123 Mean :0.4521 Mean :0.9973
## 3rd Qu.: 5.300 3rd Qu.:1.0000 3rd Qu.:2.0000
## Max. :53.200 Max. :1.0000 Max. :3.0000
##
Let us get some idea about distributions of the variables. First observe that for FINAL_MARGIN, mean < median, so it’s distribution is negatively skewed. Again, observe that, for SHOT_NUMBER, mean > median, so the distribution of it is positively skewed. Again for PERIOD, observe that, mean > median, so it is positively skewed. Similarly observe that, for SHOR_CLOCK, mean > median which implies that the distribution of SHOT_CLOCK is positively skewed. Similarly observe that, for variables DRIBBLES, TOUCH_TIME, PTS_TYPE, CLOSE_DEF_DIST, FGM, PTS, mean > median which implies that the distributions of are positively skewed. Again observe that, for SHOT_DIST,mean < median, so distribution of it is negatively skewed.
table(data$W, data$LOCATION)
##
## A H
## L 35496 27978
## W 28639 35956
Here observe that, for away matches, total number of loss is higher than total number of winnings. Again, similarly, observe that, for home ground matches, total number of winnings is higher than losses. So, the winning could be dependent on the location. Let us perform Chi-square test to check the independence of location and winnings as follows,
chisq.test(data$W,data$LOCATION)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: data$W and data$LOCATION
## X-squared = 1718.5, df = 1, p-value < 2.2e-16
As, the p-value for test is less than \(0.05\), so we reject null hypothesis at level \(0.05\). So, winning is dependent on the location.
Let us fit a linear model with explanatory variables LOCATION, W, SHOT_CLOCK, DRIBBLES, SHOT_DIST, SHOT_RESULT with dependent variable FINAL_MARGIN as follows,
model <- lm(FINAL_MARGIN~LOCATION+W+SHOT_CLOCK+DRIBBLES+SHOT_DIST+SHOT_RESULT,data=data)
summary(model)
##
## Call:
## lm(formula = FINAL_MARGIN ~ LOCATION + W + SHOT_CLOCK + DRIBBLES +
## SHOT_DIST + SHOT_RESULT, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.112 -5.145 -0.030 5.058 41.870
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.540693 0.080624 -143.142 < 2e-16 ***
## LOCATIONH 1.652482 0.044695 36.973 < 2e-16 ***
## WW 21.334155 0.044769 476.543 < 2e-16 ***
## SHOT_CLOCK 0.017838 0.003935 4.533 5.81e-06 ***
## DRIBBLES 0.014508 0.006567 2.209 0.0272 *
## SHOT_DIST 0.013798 0.002620 5.267 1.39e-07 ***
## SHOT_RESULTmissed -0.537384 0.045501 -11.810 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.761 on 122495 degrees of freedom
## (5567 observations deleted due to missingness)
## Multiple R-squared: 0.6601, Adjusted R-squared: 0.6601
## F-statistic: 3.965e+04 on 6 and 122495 DF, p-value: < 2.2e-16
Here observe that, for the test of significance of the model, p-value is less than \(0.05\), so we reject null hypothesis at level \(0.05\). So, our fitted model is significant. Now, observe that, Adjusted R-squared for the fitted model is \(0.6601\) which is close to \(1\), so our fitted model is well enough. Also, notice that, the SE of the model is \(7.761\) which is significantly low, so variability in the error is low.
Now, observe that, for the test of significance of the coefficients, p-value for all coefficients is less than \(0.05\), so we reject null hypothesis at level \(0.05\). So, the variables LOCATION, W, SHOT_CLOCK, DRIBBLES, SHOT_DIST, SHOT_RESULT are significant for FINAL_MARGIN.
Now from appendix observe that, in residual plot the residuals are random at each level of fitted values, hence fitting is good. Also, from the normal Q-Q plot observe that, the points are near the theoretical line, so the errors follows normality assumptions. Again from the residuals vs leverage plots, we have \(3\) leverages at the observations \(17333, 238430, 16176\).
Here we want to know that whether the winnings dependent on whether the shot was “made” or “missed”. So, we will perform t test as follows,
win = 1*(data$W == "W")
pairwise.t.test(win,data$SHOT_RESULT)
##
## Pairwise comparisons using t tests with pooled SD
##
## data: win and data$SHOT_RESULT
##
## made
## missed <2e-16
##
## P value adjustment method: holm
Here observe that, p-value for the test is less than \(0.05\), so we reject null hypothesis at level \(0.05\), so the winning differs for whether the shot results.
From this project the conclusions are as follows,
Winning a match is dependent on whether the match is away match or home ground match.
The variables LOCATION, W, SHOT_CLOCK, DRIBBLES, SHOT_DIST, SHOT_RESULT affects FINAL_MARGIN.
The winning differs for whether the shot results.
plot(model)
Here we took the help of following websites –
The important citations related to this work are,
Bhandari, I., Colet, E., Parker, J. et al. Advanced Scout: Data Mining and Knowledge Discovery in NBA Data. Data Mining and Knowledge Discovery 1, 121–125 (1997). https://doi.org/10.1023/A:1009782106822
Chatterjee, S., Campbell, M.R. and Wiseman, F. (1994), Take that jam! An analysis of winning percentage for NBA teams. Manage. Decis. Econ., 15: 521-535. https://doi.org/10.1002/mde.4090150514