NCAA Basketball Presentation

class: inverse, center, middle
background-image: url(https://cdn.theathletic.com/app/uploads/2020/08/11185316/GettyImages-1210436449-1024x683.jpeg)
background-size: cover

<h1 style="color:yellow;">NCAA Basketball Presentation</h1>

<font color="yellow">.large[Louis Tintner | MATH1324: Applied Analytics | 2020]

---

# Research Interest

This analysis will use data from Division 1 NCAA USA college basketball school data from 2020 (Sundberg, 2020)

Linear  regression will be used to answer the following questions of interest:

- Can knowledge about the strength of a basketball team’s offence or defence skills predict winning teams?

- What type of skill contributes most to the variability in  the number of games won?

- My  interest in this topic stems  from  my cousin playing for University of Idaho. I am also interested in the following questions:

- How many games is my cousin’s team  likely to win next year based on current skill level?

- What should the coach focus on next season?

---
# Multiple Linear Regression

Multiple linear regression is  a statistical analysis to  model the relationship between  two or more  predictor variables (independent variables) against a response or outcome variable (dependent variable).
Linear regression can be used to (Fiddel & Tabachnik, 2014):
 
- Measure the strength  of the relationship and determine whether it is statistically significant.

- Identify variables which contribute to variation in the dependent variable.

- To forecast or predict future values.

- To identify where to best optimise values to improve outcomes

---
# NCAA College Basketball Data Set

https://www.kaggle.com/andrewsundberg/college-basketball-dataset. Data  was scraped from https://www.barttorvik.com/# who obtained data from third parties

- Data sets for the 2015 - 2020 seasons were available under creative commons copyright – only the 2020 dataset was used

- the data set included 24 variables related to skills and identifiers. Only 5 variables were required and subset for the analysis.

- Data included: Team ID, Athletic conference (competition group), no. games won(wins), adj_offense, adj_defense.

- The data was examined for  missing values, NAN, special characters, structure and type.

```r
basketball<-read_csv("cbb20.csv")
basketball<-basketball %>% rename( wins = "W", adj_offence = "ADJOE", adj_defence = "ADJDE")
head(basketball)
str(basketball)
sum(is.na(basketball))
sum(is.nan(as.matrix(basketball)))
```
---
# Data Summary and Description
- Data set conatains 353 observations and 23 variables. Four variables are relevant to this analysis including:

- Team (TEAM): Team name. categorical variable

- Wins (wins): Dependent variable - number of wins the team had in the season.

- Adjusted Offensive efficiency (adj_offence): The number of the points per 100 possessions.-Independent variable.

- Adjusted defensive efficiency (adj_defence) is the number of points a team has given up per 100 possessions.-Independent variable.

- Adjusted variables (ie adj_offence & adj_defence) take into account the expected number of possessions per game considering the level of competition from the rival team (Pomeroy, 2012).
---
# Summary Statistics and Exploration of the Data

```r
summary(basketball[,c(2,3,5,6,7)]) %>% 
  knitr::kable(format = "html")
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:left;">     TEAM </th>
   <th style="text-align:left;">     CONF </th>
   <th style="text-align:left;">      wins </th>
   <th style="text-align:left;">  adj_offence </th>
   <th style="text-align:left;">  adj_defence </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:left;"> Length:353 </td>
   <td style="text-align:left;"> Length:353 </td>
   <td style="text-align:left;"> Min.   : 1.00 </td>
   <td style="text-align:left;"> Min.   : 80.1 </td>
   <td style="text-align:left;"> Min.   : 85.6 </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:left;"> Class :character </td>
   <td style="text-align:left;"> Class :character </td>
   <td style="text-align:left;"> 1st Qu.:13.00 </td>
   <td style="text-align:left;"> 1st Qu.: 97.3 </td>
   <td style="text-align:left;"> 1st Qu.: 98.0 </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:left;"> Mode  :character </td>
   <td style="text-align:left;"> Mode  :character </td>
   <td style="text-align:left;"> Median :16.00 </td>
   <td style="text-align:left;"> Median :102.2 </td>
   <td style="text-align:left;"> Median :102.0 </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:left;"> NA </td>
   <td style="text-align:left;"> NA </td>
   <td style="text-align:left;"> Mean   :16.31 </td>
   <td style="text-align:left;"> Mean   :102.2 </td>
   <td style="text-align:left;"> Mean   :102.2 </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:left;"> NA </td>
   <td style="text-align:left;"> NA </td>
   <td style="text-align:left;"> 3rd Qu.:20.00 </td>
   <td style="text-align:left;"> 3rd Qu.:106.7 </td>
   <td style="text-align:left;"> 3rd Qu.:106.4 </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:left;"> NA </td>
   <td style="text-align:left;"> NA </td>
   <td style="text-align:left;"> Max.   :31.00 </td>
   <td style="text-align:left;"> Max.   :121.3 </td>
   <td style="text-align:left;"> Max.   :122.7 </td>
  </tr>
</tbody>
</table>
---
# Histograms
- Demonstrate  approximately normal distribution of univariate numeric variables.

.pull-left[

```r
basketball[,c(2,3,5,6,7)] %>%  keep(is.numeric) %>% 
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_histogram()
```
]
.pull-right[
![](NCAA-Presentation_files/figure-html/unnamed-chunk-4-1.png)
]
---
# Box Plots

- Small number of univariate outliers discovered using box plots.

- Outliers were checked and were not the result of data entry error. Multivariate outliers will be tested later.

.pull-left[

```r
basketball[,c(2,3,5,6,7)] %>%  keep(is.numeric) %>% 
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_boxplot()+coord_flip()
```
]
.pull-right[
![](NCAA-Presentation_files/figure-html/unnamed-chunk-5-1.png)
]
---
# Scatter Plots and Corellation
Scatter plots showing correlation between predictors and dependent variable (wins). R(adj_offence) =  0.698 and R(adj_defence) = -0.642
                       
.pull-left[

```r
ggplot(basketball, aes(x=adj_offence, y=wins))+ 
  geom_point()
```

<img src="NCAA-Presentation_files/figure-html/unnamed-chunk-6-1.png" width="85%" />
]
.pull-right[

```r
ggplot(basketball, aes(x=adj_defence, y=wins))+ 
  geom_point() 
```

<img src="NCAA-Presentation_files/figure-html/unnamed-chunk-7-1.png" width="85%" />
]
---
# Analysis and Hypothesis

A multiple linear regressions will be used to test following model:

- Model: Wins = α + β<sub>1</sub> adj_defence + β<sub>2</sub> adj_offence  +  ɛ

- Null hypothesis: There is no relation between the number of wins for a team  in the NCAA basketball season  and adjusted defensive and offensive efficiency.

- Alternative hypothesis: There is a statistically significant relationship between the number of wins for a team  in the NCAA basketball season and adjusted defensive and offensive efficiency.

---
# Assumptions: Multiple Linear Regression

- More than one independent variable is continuous or a factor, and the dependent variable is at least interval.

- Linear relationship between the  independent variables and the mean of the dependent variable.

- No influential outliers.

- Independence of errors.

- Homoscedacity - the residuals are equal across fitted values.

- Residuals are normally distributed.

- Additivity (little multicollinearity).

- Residuals of the model will be used to determine whether assumptions of the multivariate linear regression model are met.

```r
lmwins <- lm(formula = wins ~ adj_defence + adj_offence, data = basketball)
```
---
# Linearity

<p style="font-size:20px">In most parametric tests it is assumed that there is a linear relation between predictor and outcome variables. Linearity was checked using a scatter plot of the residuals v the fitted values. The plot is linear because it does not display a distinct pattern such as a curviliniear pattern.</p>

.pull-left[

```r
lmwins %>% plot (which = 1)
```
]
.pull-right[
![](NCAA-Presentation_files/figure-html/unnamed-chunk-9-1.png)
]
---
# Outliers

- Outliers can impact the effectiveness of the linear regression model by impacting means, standard error and standard deviation. It is important to scan for outliers and influential values (Armstrong, 2016).

- Cases with leverage sit far from the mean.

- Influential values are outliers with leverage.

- Bonferroni outlier test shows no outliers.

- Possible outliers were not influential with Cook's distance showing little leverage.

```r
outlierTest(lmwins)
```

```
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
##      rstudent unadjusted p-value Bonferroni p
## 116 -3.042895          0.0025208      0.88983
```
---
# Leverage Plot with Cook's Distance

.center[

```r
plot(lmwins, which = 5)
```

<img src="NCAA-Presentation_files/figure-html/unnamed-chunk-11-1.png" width="50%" />
]
---
# Homoscedasticity

.pull-left[ 
- It is assumed variance of residual of predictors when fitted  in the model  is roughly equal with a mean of zero. And relatively uniform in distribution across the Y axis.

- Where residuals show a pattern and are not consistent across the predicted values the data is considered to be heteroscedastic.]

.pull-left[

```r
lmwins %>% plot (which = 3)
```
]
.pull-right[
![](NCAA-Presentation_files/figure-html/unnamed-chunk-12-1.png)
]

---
# Additivity

<p style="font-size:16px">- There should be no correlation between predictor variables of a linear model.

- Variance Inflation Factor (VIF) was checked to assess correlation. VIF < 5 note moderate correlation and are acceptable.</p>

```r
ols_vif_tol(lmwins)
```

```
##     Variables Tolerance      VIF
## 1 adj_defence 0.7631261 1.310399
## 2 adj_offence 0.7631261 1.310399
```
# Independence of Errors

<p style="font-size:16px">- The Durban Watson statistic is 1.98. The statistic will range between 0 and 4. A value of approximately 2 demonstrates no correlation of errors.</p>

```r
durbinWatsonTest(lmwins) 
```

```
##  lag Autocorrelation D-W Statistic p-value
##    1      0.01258962      1.974426   0.752
##  Alternative hypothesis: rho != 0
```

---
# Distribution of Standardised Residuals.
-The standardised residuals appear to be approximately normally distributed in the histogram below. Normal distribution was further demonstrated by a qqplot of the residuals (not included due to space limitations).

.pull-left[

```r
standardised <- rstudent(lmwins) 
hist(standardised, freq=FALSE, 
     main="Distribution of Studentized Residuals")
xfit<-seq(min(standardised),max(standardised),length=40) 
yfit<-dnorm(xfit) 
lines(xfit, yfit, col="darkblue")
```
]
.pull-right[
![](NCAA-Presentation_files/figure-html/unnamed-chunk-15-1.png)
]

---
# Results
Assumptions for the linear regression model were met and the following regression model was fitted using R:

- Wins = α + β<sub>1</sub> adj_defence + β<sub>2</sub> adj_offence  +  ɛ

```r
tab_model(lmwins)
```

<table style="border-collapse:collapse; border:none;">
<tr>
<th style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm;  text-align:left; ">&nbsp;</th>
<th colspan="3" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">wins</th>
</tr>
<tr>
<td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal;  text-align:left; ">Predictors</td>
<td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal;  ">Estimates</td>
<td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal;  ">CI</td>
<td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal;  ">p</td>
</tr>
<tr>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">(Intercept)</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center;  ">9.16</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center;  ">-1.91&nbsp;&ndash;&nbsp;20.22</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center;  ">0.104</td>
</tr>
<tr>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">adj_defence</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center;  ">-0.34</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center;  ">-0.40&nbsp;&ndash;&nbsp;-0.27</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center;  "><strong>&lt;0.001</td>
</tr>
<tr>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">adj_offence</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center;  ">0.41</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center;  ">0.35&nbsp;&ndash;&nbsp;0.47</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center;  "><strong>&lt;0.001</td>
</tr>
<tr>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm; border-top:1px solid;">Observations</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left; border-top:1px solid;" colspan="3">353</td>
</tr>
<tr>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; padding-top:0.1cm; padding-bottom:0.1cm;">R<sup>2</sup> / R<sup>2</sup> adjusted</td>
<td style=" padding:0.2cm; text-align:left; vertical-align:top; padding-top:0.1cm; padding-bottom:0.1cm; text-align:left;" colspan="3">0.607 / 0.605</td>
</tr>

</table>

---
# Conclusion

- The utility of the linear regression model must be assessed. The R-squared score is a good indicator of the models predictive ability based on the independent variables.

- The model was statistical significance F-statistic: = 270.1 (df 2, 350),  p-value: < 0.001. As such, the null hypothesis is rejected.

- The model, which included adjusted defensive and offensive efficiency, explained 60.7% (R2 score = 0.607) in a team’s number of wins during the NCAA 2020 basketball season The model was:
Wins = 9.158  -0.338 adj_defence +  0.408 adj_offence

- The coefficient estimates for both adj_defence (t value = -10.342, p <.001) and adj_offence (t value = 0.408, p <.001) were statistically significant.

- Based on current efficiency (adj_offence = 93.7 and adj defence = 107.2) Idaho is predicted to win 11 games with a CI of 10.55 - 11.73.

- As the scales are the same for the predictors, the coefficient estimates can be compared. Adjusted offensive efficiency is a stronger predictor of the number of wins in a NCAA season than adjusted defensive efficiency. Idaho's coach should would benefit from focusing on offensive efficiency.

---
## Limitations
- The calculation of adjusted offensive and defensive efficiency varies. The Ken Pomeroy model is widely used but is a proprietary algorithm that lacks transparency as to how it is calculated. Other measures may result in different outcomes for the model. 
- The are a number of skill variables that were not included in this model but may improve its predictive power.

## Reference List
- Armstrong, D. 2020, _Outliers and Influential Data_, University of Wisconsin – Milwaukee. Viewed October 16, <https://quantoid.net/files/reg3/lecture9_2016_4.pdf>

- Fiddel, L and Tabachnik, B 2014, _Understanding Multivariate Statistics_, Pearson, CA

- Pomeroy, K, 2012, _Ratings Glossary_ viewed October 12, <https://kenpom.com/blog/ratings-glossary/>

- Sundberg, A, 2020, _College Basketball Data Set_, electronic data set, Kaggle, viewed October 10,  <https://www.kaggle.com/andrewsundberg/college-basketball-dataset>