class: inverse, center, middle background-image: url(https://cdn.theathletic.com/app/uploads/2020/08/11185316/GettyImages-1210436449-1024x683.jpeg) background-size: cover <h1 style="color:yellow;">NCAA Basketball Presentation</h1> <font color="yellow">.large[Louis Tintner | MATH1324: Applied Analytics | 2020] --- # Research Interest This analysis will use data from Division 1 NCAA USA college basketball school data from 2020 (Sundberg, 2020) Linear regression will be used to answer the following questions of interest: - Can knowledge about the strength of a basketball team’s offence or defence skills predict winning teams? - What type of skill contributes most to the variability in the number of games won? - My interest in this topic stems from my cousin playing for University of Idaho. I am also interested in the following questions: - How many games is my cousin’s team likely to win next year based on current skill level? - What should the coach focus on next season? --- # Multiple Linear Regression Multiple linear regression is a statistical analysis to model the relationship between two or more predictor variables (independent variables) against a response or outcome variable (dependent variable). Linear regression can be used to (Fiddel & Tabachnik, 2014): - Measure the strength of the relationship and determine whether it is statistically significant. - Identify variables which contribute to variation in the dependent variable. - To forecast or predict future values. - To identify where to best optimise values to improve outcomes --- # NCAA College Basketball Data Set https://www.kaggle.com/andrewsundberg/college-basketball-dataset. Data was scraped from https://www.barttorvik.com/# who obtained data from third parties - Data sets for the 2015 - 2020 seasons were available under creative commons copyright – only the 2020 dataset was used - the data set included 24 variables related to skills and identifiers. Only 5 variables were required and subset for the analysis. - Data included: Team ID, Athletic conference (competition group), no. games won(wins), adj_offense, adj_defense. - The data was examined for missing values, NAN, special characters, structure and type. ```r basketball<-read_csv("cbb20.csv") basketball<-basketball %>% rename( wins = "W", adj_offence = "ADJOE", adj_defence = "ADJDE") head(basketball) str(basketball) sum(is.na(basketball)) sum(is.nan(as.matrix(basketball))) ``` --- # Data Summary and Description - Data set conatains 353 observations and 23 variables. Four variables are relevant to this analysis including: - Team (TEAM): Team name. categorical variable - Wins (wins): Dependent variable - number of wins the team had in the season. - Adjusted Offensive efficiency (adj_offence): The number of the points per 100 possessions.-Independent variable. - Adjusted defensive efficiency (adj_defence) is the number of points a team has given up per 100 possessions.-Independent variable. - Adjusted variables (ie adj_offence & adj_defence) take into account the expected number of possessions per game considering the level of competition from the rival team (Pomeroy, 2012). --- # Summary Statistics and Exploration of the Data ```r summary(basketball[,c(2,3,5,6,7)]) %>% knitr::kable(format = "html") ``` <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> TEAM </th> <th style="text-align:left;"> CONF </th> <th style="text-align:left;"> wins </th> <th style="text-align:left;"> adj_offence </th> <th style="text-align:left;"> adj_defence </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> Length:353 </td> <td style="text-align:left;"> Length:353 </td> <td style="text-align:left;"> Min. : 1.00 </td> <td style="text-align:left;"> Min. : 80.1 </td> <td style="text-align:left;"> Min. : 85.6 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> Class :character </td> <td style="text-align:left;"> Class :character </td> <td style="text-align:left;"> 1st Qu.:13.00 </td> <td style="text-align:left;"> 1st Qu.: 97.3 </td> <td style="text-align:left;"> 1st Qu.: 98.0 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> Mode :character </td> <td style="text-align:left;"> Mode :character </td> <td style="text-align:left;"> Median :16.00 </td> <td style="text-align:left;"> Median :102.2 </td> <td style="text-align:left;"> Median :102.0 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> Mean :16.31 </td> <td style="text-align:left;"> Mean :102.2 </td> <td style="text-align:left;"> Mean :102.2 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> 3rd Qu.:20.00 </td> <td style="text-align:left;"> 3rd Qu.:106.7 </td> <td style="text-align:left;"> 3rd Qu.:106.4 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> Max. :31.00 </td> <td style="text-align:left;"> Max. :121.3 </td> <td style="text-align:left;"> Max. :122.7 </td> </tr> </tbody> </table> --- # Histograms - Demonstrate approximately normal distribution of univariate numeric variables. .pull-left[ ```r basketball[,c(2,3,5,6,7)] %>% keep(is.numeric) %>% gather() %>% ggplot(aes(value)) + facet_wrap(~ key, scales = "free") + geom_histogram() ``` ] .pull-right[ <!-- --> ] --- # Box Plots - Small number of univariate outliers discovered using box plots. - Outliers were checked and were not the result of data entry error. Multivariate outliers will be tested later. .pull-left[ ```r basketball[,c(2,3,5,6,7)] %>% keep(is.numeric) %>% gather() %>% ggplot(aes(value)) + facet_wrap(~ key, scales = "free") + geom_boxplot()+coord_flip() ``` ] .pull-right[ <!-- --> ] --- # Scatter Plots and Corellation Scatter plots showing correlation between predictors and dependent variable (wins). R(adj_offence) = 0.698 and R(adj_defence) = -0.642 .pull-left[ ```r ggplot(basketball, aes(x=adj_offence, y=wins))+ geom_point() ``` <img src="NCAA-Presentation_files/figure-html/unnamed-chunk-6-1.png" width="85%" /> ] .pull-right[ ```r ggplot(basketball, aes(x=adj_defence, y=wins))+ geom_point() ``` <img src="NCAA-Presentation_files/figure-html/unnamed-chunk-7-1.png" width="85%" /> ] --- # Analysis and Hypothesis A multiple linear regressions will be used to test following model: - Model: Wins = α + β<sub>1</sub> adj_defence + β<sub>2</sub> adj_offence + ɛ - Null hypothesis: There is no relation between the number of wins for a team in the NCAA basketball season and adjusted defensive and offensive efficiency. - Alternative hypothesis: There is a statistically significant relationship between the number of wins for a team in the NCAA basketball season and adjusted defensive and offensive efficiency. --- # Assumptions: Multiple Linear Regression - More than one independent variable is continuous or a factor, and the dependent variable is at least interval. - Linear relationship between the independent variables and the mean of the dependent variable. - No influential outliers. - Independence of errors. - Homoscedacity - the residuals are equal across fitted values. - Residuals are normally distributed. - Additivity (little multicollinearity). - Residuals of the model will be used to determine whether assumptions of the multivariate linear regression model are met. ```r lmwins <- lm(formula = wins ~ adj_defence + adj_offence, data = basketball) ``` --- # Linearity <p style="font-size:20px">In most parametric tests it is assumed that there is a linear relation between predictor and outcome variables. Linearity was checked using a scatter plot of the residuals v the fitted values. The plot is linear because it does not display a distinct pattern such as a curviliniear pattern.</p> .pull-left[ ```r lmwins %>% plot (which = 1) ``` ] .pull-right[ <!-- --> ] --- # Outliers - Outliers can impact the effectiveness of the linear regression model by impacting means, standard error and standard deviation. It is important to scan for outliers and influential values (Armstrong, 2016). - Cases with leverage sit far from the mean. - Influential values are outliers with leverage. - Bonferroni outlier test shows no outliers. - Possible outliers were not influential with Cook's distance showing little leverage. ```r outlierTest(lmwins) ``` ``` ## No Studentized residuals with Bonferroni p < 0.05 ## Largest |rstudent|: ## rstudent unadjusted p-value Bonferroni p ## 116 -3.042895 0.0025208 0.88983 ``` --- # Leverage Plot with Cook's Distance .center[ ```r plot(lmwins, which = 5) ``` <img src="NCAA-Presentation_files/figure-html/unnamed-chunk-11-1.png" width="50%" /> ] --- # Homoscedasticity .pull-left[ - It is assumed variance of residual of predictors when fitted in the model is roughly equal with a mean of zero. And relatively uniform in distribution across the Y axis. - Where residuals show a pattern and are not consistent across the predicted values the data is considered to be heteroscedastic.] .pull-left[ ```r lmwins %>% plot (which = 3) ``` ] .pull-right[ <!-- --> ] --- # Additivity <p style="font-size:16px">- There should be no correlation between predictor variables of a linear model. - Variance Inflation Factor (VIF) was checked to assess correlation. VIF < 5 note moderate correlation and are acceptable.</p> ```r ols_vif_tol(lmwins) ``` ``` ## Variables Tolerance VIF ## 1 adj_defence 0.7631261 1.310399 ## 2 adj_offence 0.7631261 1.310399 ``` # Independence of Errors <p style="font-size:16px">- The Durban Watson statistic is 1.98. The statistic will range between 0 and 4. A value of approximately 2 demonstrates no correlation of errors.</p> ```r durbinWatsonTest(lmwins) ``` ``` ## lag Autocorrelation D-W Statistic p-value ## 1 0.01258962 1.974426 0.752 ## Alternative hypothesis: rho != 0 ``` --- # Distribution of Standardised Residuals. -The standardised residuals appear to be approximately normally distributed in the histogram below. Normal distribution was further demonstrated by a qqplot of the residuals (not included due to space limitations). .pull-left[ ```r standardised <- rstudent(lmwins) hist(standardised, freq=FALSE, main="Distribution of Studentized Residuals") xfit<-seq(min(standardised),max(standardised),length=40) yfit<-dnorm(xfit) lines(xfit, yfit, col="darkblue") ``` ] .pull-right[ <!-- --> ] --- # Results Assumptions for the linear regression model were met and the following regression model was fitted using R: - Wins = α + β<sub>1</sub> adj_defence + β<sub>2</sub> adj_offence + ɛ ```r tab_model(lmwins) ``` <table style="border-collapse:collapse; border:none;"> <tr> <th style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; text-align:left; "> </th> <th colspan="3" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">wins</th> </tr> <tr> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; text-align:left; ">Predictors</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">Estimates</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">CI</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">p</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">(Intercept)</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">9.16</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-1.91 – 20.22</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.104</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">adj_defence</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.34</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.40 – -0.27</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "><strong><0.001</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">adj_offence</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.41</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.35 – 0.47</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "><strong><0.001</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm; border-top:1px solid;">Observations</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;" colspan="3">353</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm;">R<sup>2</sup> / R<sup>2</sup> adjusted</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="3">0.607 / 0.605</td> </tr> </table> --- # Conclusion - The utility of the linear regression model must be assessed. The R-squared score is a good indicator of the models predictive ability based on the independent variables. - The model was statistical significance F-statistic: = 270.1 (df 2, 350), p-value: < 0.001. As such, the null hypothesis is rejected. - The model, which included adjusted defensive and offensive efficiency, explained 60.7% (R2 score = 0.607) in a team’s number of wins during the NCAA 2020 basketball season The model was: Wins = 9.158 -0.338 adj_defence + 0.408 adj_offence - The coefficient estimates for both adj_defence (t value = -10.342, p <.001) and adj_offence (t value = 0.408, p <.001) were statistically significant. - Based on current efficiency (adj_offence = 93.7 and adj defence = 107.2) Idaho is predicted to win 11 games with a CI of 10.55 - 11.73. - As the scales are the same for the predictors, the coefficient estimates can be compared. Adjusted offensive efficiency is a stronger predictor of the number of wins in a NCAA season than adjusted defensive efficiency. Idaho's coach should would benefit from focusing on offensive efficiency. --- ## Limitations - The calculation of adjusted offensive and defensive efficiency varies. The Ken Pomeroy model is widely used but is a proprietary algorithm that lacks transparency as to how it is calculated. Other measures may result in different outcomes for the model. - The are a number of skill variables that were not included in this model but may improve its predictive power. ## Reference List - Armstrong, D. 2020, _Outliers and Influential Data_, University of Wisconsin – Milwaukee. Viewed October 16, <https://quantoid.net/files/reg3/lecture9_2016_4.pdf> - Fiddel, L and Tabachnik, B 2014, _Understanding Multivariate Statistics_, Pearson, CA - Pomeroy, K, 2012, _Ratings Glossary_ viewed October 12, <https://kenpom.com/blog/ratings-glossary/> - Sundberg, A, 2020, _College Basketball Data Set_, electronic data set, Kaggle, viewed October 10, <https://www.kaggle.com/andrewsundberg/college-basketball-dataset>