Our first data set has the in-game statistics for each men’s college basketball game going back to 2003. There are over 92,000 observations. Each observation represents and individual game. Our second data set has the in-game statistics for each women’s college basketball game going back to 2010. These data sets were compiled by Kenneth Massey, but we got them from Kaggle. These data sets are interesting for multiple reasons. First and foremost, they have so many variables. Each variable is an in-game statistic. For this project is was not necessary to use all fifty-eight of them, but we did use them to create new columns. For example, we added a percentage of three pointers taken column and a point differential column. Some of the variables include amount of shots taken, amount of personal fouls, and the number of free throws allowed. Our group wanted to see if the number of three pointers a team took had a significant effect on the outcome of a college basketball game. We understand that the men’s and women’s games are different, so we were also interested in seeing if one depended more on three pointers compared to the other. For the men’s data set in particular, it will be interesting to see how the amount of three pointers taken played an overall effect on the outcome of a game because there are multiple times since 2003 when the NCAA decided to move back the line. It’ll be interesting to see the overall effect of this rule change and how the game adapted to a deeper three point line.
load("C:/Users/mayor/Downloads/results_samp_in_game-stats (1).rda")
CBBdata <- results_samp_in_game_stats
load("C:/Users/mayor/Downloads/w_results_samp_in_game-stats.rda")
CBBdata_w <- w_results_samp_in_game_stats
First, we want to create column for percentage of 3 point shots and score differential.
CBBdata$Percent_of_3pt_Shots <- CBBdata$FGA3 / CBBdata$FGA
CBBdata$ScoreDifferential <- CBBdata$Score - CBBdata$Opp_Score
CBBdata_w$Percent_of_3pt_Shots <- CBBdata_w$FGA3 / CBBdata_w$FGA
CBBdata_w$ScoreDifferential <- CBBdata_w$Score - CBBdata_w$Opp_Score
(We conducted multiple models. Below we answer questions 2,3,4 consecutively for each model.)
Men’s College Basketball Hypothesis:
H0: There is no relationship between the percentage of three point shots taken and score differential in a men’s college basketball game.
H1: There is a relationship between the percentage of three point shots taken and score differential in a men’s college basketball game.
lmTest <- lm(ScoreDifferential ~ Percent_of_3pt_Shots, data = CBBdata)
summary(lmTest)
##
## Call:
## lm(formula = ScoreDifferential ~ Percent_of_3pt_Shots, data = CBBdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.255 -9.924 0.588 9.950 94.093
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.7428 0.1823 4.075 4.62e-05 ***
## Percent_of_3pt_Shots -2.2120 0.5100 -4.337 1.44e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.11 on 92830 degrees of freedom
## Multiple R-squared: 0.0002026, Adjusted R-squared: 0.0001918
## F-statistic: 18.81 on 1 and 92830 DF, p-value: 1.445e-05
Since our p-value (1.445e-05) is below 0.05, we reject our null hypothesis that there is no relationship between the percent of three points shots and the outcome of a men’s college basketball game. Additionally, you can see that for every unit increase in percent of three point shots taken, the score differential decreases by 2.2120 points. This means that the more three point shots taken, the larger amount that team loses by. Finally, we have a small adjusted r-squared (0.0001918). The change in percent of three point shots taken can only account for 0.01918% of the variation in score differential. This makes sense since there are so many factors that go into winning a basketball game; therefore, percent of three point shots taken alone cannot greatly explain the results of a basketball game.
ggplot(data = CBBdata, aes(Percent_of_3pt_Shots, ScoreDifferential)) +
geom_smooth(se=F, colour = "blue") +
theme_minimal() +
labs(title = "Amount of 3 Point Shots vs Scoring Differential", y ="Scoring Differential", x = "Percent of 3pt Shots")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
As you can see from the above visualization, there is a negative relationship between the percent of three point shots that a men’s college basketball team takes and their point differential. This supports our hypothesis test and the negative coefficient we got for percent of three point shots taken.
Men’s College Basketball Interaction Hypothesis:
H0: The season has no significant effect on the relationship between the percentage of 3 point shots taken and the score differential.
H1: The has a significant effect on the relationship between the percentage of 3 point shots taken and the score differential.
We decided to run a linear model with an interaction term between percentage of 3 point shots taken by a men’s college basketball team and season year.
CBBdata$Season <- as.factor(CBBdata$Season)
lmTest2 <- lm(ScoreDifferential ~ Percent_of_3pt_Shots * Season, data = CBBdata)
summary(lmTest2)
##
## Call:
## lm(formula = ScoreDifferential ~ Percent_of_3pt_Shots * Season,
## data = CBBdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.212 -9.883 0.107 9.913 94.128
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1194 0.7875 2.691 0.007121 **
## Percent_of_3pt_Shots -6.7824 2.3514 -2.884 0.003922 **
## Season2004 0.6033 1.1371 0.531 0.595711
## Season2005 1.1343 1.1271 1.006 0.314214
## Season2006 -0.2888 1.1233 -0.257 0.797100
## Season2007 0.1555 1.1035 0.141 0.887944
## Season2008 -1.1305 1.1128 -1.016 0.309662
## Season2009 -0.4161 1.0824 -0.384 0.700690
## Season2010 -1.4460 1.0791 -1.340 0.180247
## Season2011 -0.7875 1.0845 -0.726 0.467751
## Season2012 -2.3193 1.0932 -2.122 0.033878 *
## Season2013 -1.9725 1.0952 -1.801 0.071692 .
## Season2014 -0.8162 1.0847 -0.752 0.451766
## Season2015 -1.2807 1.1008 -1.163 0.244660
## Season2016 -1.8026 1.1244 -1.603 0.108898
## Season2017 -2.8646 1.1314 -2.532 0.011349 *
## Season2018 -4.0154 1.1511 -3.488 0.000486 ***
## Season2019 -3.4817 1.1719 -2.971 0.002969 **
## Season2020 -3.2402 1.1372 -2.849 0.004382 **
## Percent_of_3pt_Shots:Season2004 -1.3112 3.3600 -0.390 0.696367
## Percent_of_3pt_Shots:Season2005 -2.7904 3.3178 -0.841 0.400335
## Percent_of_3pt_Shots:Season2006 0.6863 3.3042 0.208 0.835466
## Percent_of_3pt_Shots:Season2007 -0.0663 3.2031 -0.021 0.983486
## Percent_of_3pt_Shots:Season2008 3.7053 3.2168 1.152 0.249377
## Percent_of_3pt_Shots:Season2009 1.5163 3.1830 0.476 0.633820
## Percent_of_3pt_Shots:Season2010 4.7879 3.2008 1.496 0.134704
## Percent_of_3pt_Shots:Season2011 3.0589 3.2065 0.954 0.340111
## Percent_of_3pt_Shots:Season2012 6.1114 3.2216 1.897 0.057835 .
## Percent_of_3pt_Shots:Season2013 7.5258 3.2246 2.334 0.019605 *
## Percent_of_3pt_Shots:Season2014 2.3164 3.1995 0.724 0.469073
## Percent_of_3pt_Shots:Season2015 4.0660 3.1899 1.275 0.202432
## Percent_of_3pt_Shots:Season2016 6.1704 3.2128 1.921 0.054789 .
## Percent_of_3pt_Shots:Season2017 9.0331 3.1912 2.831 0.004646 **
## Percent_of_3pt_Shots:Season2018 11.4416 3.1947 3.581 0.000342 ***
## Percent_of_3pt_Shots:Season2019 10.8673 3.2031 3.393 0.000692 ***
## Percent_of_3pt_Shots:Season2020 9.4096 3.1651 2.973 0.002950 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.1 on 92796 degrees of freedom
## Multiple R-squared: 0.001017, Adjusted R-squared: 0.0006399
## F-statistic: 2.698 on 35 and 92796 DF, p-value: 2.33e-07
Some seasons are more significant than others, but overall our interaction model is significant because our p-value (2.33e-07) for our f-statistic is less than 0.05. Because of this we are going to reject our null hypothesis that season has no significant effect on the relationship between the percentage of 3 point shots taken and the score differential. Another interesting insight is that our adjusted r-squared (0.0006399) improved compared to our original linear model.
We now are going to create a visualization that shows the relationship between percentage of 3 point shots taken and score differential grouped by year for men.
ggplot(data = CBBdata, aes(Percent_of_3pt_Shots, ScoreDifferential)) +
geom_smooth() +
facet_wrap(~Season) +
theme_minimal() +
labs(title = "Relationship between 3's and Score Differential by Year", y ="Scoring Differential", x = "Percent of 3pt Shots")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
As you can see, the slope for the relationship between the percent of three point shots taken and point differential changes depending on the season proving that there is a significant effect between the x and y variable. An interesting takeaway is that the seasons immediately after the three point line was backed up (2009 and 2020) experienced a subtle decline in its slope compared to the prior year.
Final Visualization:
We created a data set that shows the mean percent of three point shots taken and groups by season year.
by_year <- CBBdata %>%
select("Season", "Percent_of_3pt_Shots") %>%
group_by(Season) %>%
summarize( mean = mean(Percent_of_3pt_Shots))
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(data = by_year, aes( x = Season, y = mean)) +
geom_col(fill = "blue") +
theme_minimal() +
labs(title = "Amount of 3 Point Shots Taken by Season", y="Avg Number of 3s", x = "Season")
Our visualization shows us two very interesting insights. First, the seasons that take place in 2009 and 2020 experience a decline in the percentage of threes taken per game compared to the prior year. This makes sense as these were the first seasons after the NCAA decided to back up the three point line. As you can see, the first year after the three point line was backed up, the amount of three pointers shot by men’s basketball teams decreased, but began to incline afterwards. This tells us that team’s were able to adjust to the deeper three point line after a certain amount of time. The second insight that we found interesting was that overall, the trend is continually increasing over time. The game of basketball is changing and taking more three pointers is becoming a strategy for most teams.
Women’s College Basketball Hypothesis:
H0: There is no relationship between the percentage of three point shots taken and score differential in a women’s college basketball game.
H1: There is a relationship between the percentage of three point shots taken and score differential in a women’s college basketball game.
lmTest_w <- lm(ScoreDifferential ~ Percent_of_3pt_Shots, data = CBBdata_w)
summary(lmTest_w)
##
## Call:
## lm(formula = ScoreDifferential ~ Percent_of_3pt_Shots, data = CBBdata_w)
##
## Residuals:
## Min 1Q Median 3Q Max
## -89.077 -11.791 -0.692 11.973 108.016
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2045 0.2319 0.882 0.378
## Percent_of_3pt_Shots -1.0472 0.7350 -1.425 0.154
##
## Residual standard error: 18.05 on 56791 degrees of freedom
## Multiple R-squared: 3.574e-05, Adjusted R-squared: 1.814e-05
## F-statistic: 2.03 on 1 and 56791 DF, p-value: 0.1542
Our p-value (0.154) is not significant because it is greater than 0.05. Since our p-value is not significant, we fail to reject our null hypothesis that there is no relationship between percentage of three pointers taken and the final outcome of a women’s college basketball game. Additionally, we have an adjusted r-squared of 1.814e-05 which tells us we can explain very minimal error in our model.
ggplot(data = CBBdata_w, aes(Percent_of_3pt_Shots, ScoreDifferential)) +
geom_smooth(se=F, colour = "pink") +
theme_minimal() +
labs(title = "Amount of 3 Point Shots vs Scoring Differential", y ="Scoring Differential", x = "Percent of 3pt Shots")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
While there is a negative relationship between the percent of three point shots taken by a women’s college basketball team and score differential, we know from our linear model that they are not significant. Although our visualization looks as if there is a steep slope between these variables, our y-axis is only showing a range of half a point.
Women’s College Basketball Interaction Hypothesis:
H0: The season has no significant effect on the relationship between the percentage of 3 point shots taken and the score differential.
H1: The has a significant effect on the relationship between the percentage of 3 point shots taken and the score differential.
We decided to run a linear model with an interaction term between percentage of 3 point shots taken by a women’s college basketball team and season year.
CBBdata_w$Season <- as.factor(CBBdata_w$Season)
lmTest2_w <- lm(ScoreDifferential ~ Percent_of_3pt_Shots * Season, data = CBBdata_w)
summary(lmTest2_w)
##
## Call:
## lm(formula = ScoreDifferential ~ Percent_of_3pt_Shots * Season,
## data = CBBdata_w)
##
## Residuals:
## Min 1Q Median 3Q Max
## -89.810 -11.586 -0.192 11.726 108.157
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.3599 0.7716 1.762 0.07799 .
## Percent_of_3pt_Shots -4.5117 2.5679 -1.757 0.07893 .
## Season2011 -0.1545 1.1061 -0.140 0.88893
## Season2012 -1.0149 1.0742 -0.945 0.34479
## Season2013 -0.5380 1.0657 -0.505 0.61364
## Season2014 -1.2033 1.0625 -1.133 0.25742
## Season2015 -2.5555 1.0611 -2.408 0.01603 *
## Season2016 -1.9644 1.0919 -1.799 0.07202 .
## Season2017 -1.0650 1.1233 -0.948 0.34307
## Season2018 -0.4183 1.1258 -0.372 0.71020
## Season2019 -1.0047 1.1285 -0.890 0.37330
## Season2020 -2.2203 1.1348 -1.956 0.05042 .
## Percent_of_3pt_Shots:Season2011 -0.8190 3.6434 -0.225 0.82214
## Percent_of_3pt_Shots:Season2012 1.4689 3.6310 0.405 0.68582
## Percent_of_3pt_Shots:Season2013 0.8532 3.5649 0.239 0.81085
## Percent_of_3pt_Shots:Season2014 4.2640 3.5022 1.218 0.22342
## Percent_of_3pt_Shots:Season2015 10.0640 3.4987 2.877 0.00402 **
## Percent_of_3pt_Shots:Season2016 5.3484 3.5470 1.508 0.13160
## Percent_of_3pt_Shots:Season2017 2.3649 3.5788 0.661 0.50874
## Percent_of_3pt_Shots:Season2018 1.8781 3.5450 0.530 0.59625
## Percent_of_3pt_Shots:Season2019 3.9038 3.5242 1.108 0.26800
## Percent_of_3pt_Shots:Season2020 6.3671 3.5697 1.784 0.07448 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.05 on 56771 degrees of freedom
## Multiple R-squared: 0.0005626, Adjusted R-squared: 0.0001929
## F-statistic: 1.522 on 21 and 56771 DF, p-value: 0.05919
After running a linear model with an interaction term for women’s college basketball, we see that our p-value (0.05919) of the f-statistic is still not significant, but it is smaller than the p-value for our original linear model and it is extremely close to our 0.05 cutoff. Therefore, since it is very close to our cutoff we believe that season has a slight effect on the relationship between the percentage of three point shots and score differential. Also, our adjusted r-squared (0.0001929) has improved compared to our original women’s linear model.
We created a visualization that shows the relationship between percentage of 3 point shots and score differential grouped by season for women.
ggplot(data = CBBdata_w, aes(Percent_of_3pt_Shots, ScoreDifferential)) +
geom_smooth(colour = "pink") +
facet_wrap(~Season) +
theme_minimal() +
labs(title = "Relationship between 3's and Score Differential by Year", y ="Scoring Differential", x = "Percent of 3pt Shots")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
As you can see, this visualization shows a similar trend compared to the men’s. As seasons go by the slope of our line becomes less and less negative. It is interesting to see that in 2015 the slope of our line is positive which is somewhat of an anomaly when compared to other seasons. After doing more research, we found that this could be because during this season many highly successful women’s teams also shot a large amount of three pointers. UCONN is an example because they won their games by an average of forty points per game in 2015 and were top ten in three points attempted for that season.
We created a data set that shows the mean percent of three point shots taken and grouped by season year for women’s college basketball.
by_year_women <- CBBdata_w %>%
select("Season", "Percent_of_3pt_Shots") %>%
group_by(Season) %>%
summarize(mean = mean(Percent_of_3pt_Shots))
## `summarise()` ungrouping output (override with `.groups` argument)
by_year_women$Season <- as.factor(by_year_women$Season)
ggplot(data = by_year_women, aes( x = Season, y = mean)) +
geom_col(fill = "pink") +
theme_minimal() +
labs(title = "Amount of 3 Point Shots Taken by Season", y="Avg Number of 3s", x = "Season")
Unfortunately, our women’s data set only goes back to 2010, so we cannot see how the change in three point distance in 2009 affected women’s college basketball. But similarly to the men’s visualization, there is an upward trend in the percentage of three pointers taken as time goes on. This may be because more teams are implementing shooting three pointers as a strategy to win games as over time more teams are realizing that the average points per possession is higher for three point field goals versus two point field goals.
Create a Mixed Model for Men
library(lme4)
## Warning: package 'lme4' was built under R version 4.0.3
## Loading required package: Matrix
riMod <- lmer(ScoreDifferential ~ Percent_of_3pt_Shots + (Percent_of_3pt_Shots|Season),
data = CBBdata)
## boundary (singular) fit: see ?isSingular
summary(riMod)
## Linear mixed model fit by REML ['lmerMod']
## Formula: ScoreDifferential ~ Percent_of_3pt_Shots + (Percent_of_3pt_Shots |
## Season)
## Data: CBBdata
##
## REML criterion at convergence: 767530.1
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.5842 -0.6571 0.0144 0.6604 6.2242
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## Season (Intercept) 1.404 1.185
## Percent_of_3pt_Shots 12.486 3.534 -1.00
## Residual 228.107 15.103
## Number of obs: 92832, groups: Season, 18
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 0.7879 0.3336 2.362
## Percent_of_3pt_Shots -2.4772 0.9776 -2.534
##
## Correlation of Fixed Effects:
## (Intr)
## Prcnt_f_3_S -0.989
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see ?isSingular
After getting information from our fixed effects, we can see that the score differential can move around nearly 3.534 points from percent of three point shots taken alone.
ranef(riMod)
## $Season
## (Intercept) Percent_of_3pt_Shots
## 2003 0.96694843 -2.8835578
## 2004 1.29962312 -3.8756342
## 2005 1.66380100 -4.9616569
## 2006 0.83250242 -2.4826235
## 2007 1.06095691 -3.1639025
## 2008 0.14560209 -0.4342031
## 2009 0.68192582 -2.0335857
## 2010 -0.10605862 0.3162797
## 2011 0.32623993 -0.9728871
## 2012 -0.48073425 1.4336080
## 2013 -0.73592288 2.1946115
## 2014 0.46434165 -1.3847233
## 2015 0.06512541 -0.1942119
## 2016 -0.53450384 1.5939555
## 2017 -1.22020662 3.6388045
## 2018 -1.53484501 4.5770945
## 2019 -1.76650442 5.2679310
## 2020 -1.12829112 3.3647013
##
## with conditional variances for "Season"
MuMIn::r.squaredGLMM(riMod)
## Warning: 'r.squaredGLMM' now calculates a revised statistic. See the help page.
## R2m R2c
## [1,] 0.0002540765 0.0007751381
library(sjPlot)
## Warning: package 'sjPlot' was built under R version 4.0.3
## Install package "strengejacke" from GitHub (`devtools::install_github("strengejacke/strengejacke")`) to load all sj-packages at once!
plot_model(riMod, type = "re") +
theme_minimal()
This visualization provides more evidence that the season does have a significant effect on the relationship between percentage of three point shots taken the the point differential. You can also see that in 2009 and 2020 (the years immediately after the three point line was backed up) the slopes decreased from the prior year. This means that teams that shot a high percentage of three point shots were less successful the year after the rule change compared to the prior year.
Create a Mixed Model for Women
riMod_w <- lmer(ScoreDifferential ~ Percent_of_3pt_Shots + (Percent_of_3pt_Shots|Season),
data = CBBdata_w)
## boundary (singular) fit: see ?isSingular
summary(riMod_w)
## Linear mixed model fit by REML ['lmerMod']
## Formula: ScoreDifferential ~ Percent_of_3pt_Shots + (Percent_of_3pt_Shots |
## Season)
## Data: CBBdata_w
##
## REML criterion at convergence: 489773.9
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.9396 -0.6494 -0.0259 0.6578 5.9858
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## Season (Intercept) 0.1701 0.4124
## Percent_of_3pt_Shots 4.0538 2.0134 -1.00
## Residual 325.6500 18.0458
## Number of obs: 56793, groups: Season, 11
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 0.2320 0.2636 0.880
## Percent_of_3pt_Shots -1.1632 0.9561 -1.217
##
## Correlation of Fixed Effects:
## (Intr)
## Prcnt_f_3_S -0.943
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see ?isSingular
After looking at the information from our fixed effects, we can see that the score differential can move around nearly 2.0134 points from percent of three point shots taken alone.
ranef(riMod_w)
## $Season
## (Intercept) Percent_of_3pt_Shots
## 2010 0.120483883 -0.58815871
## 2011 0.377502170 -1.84282895
## 2012 0.310199827 -1.51428328
## 2013 0.219811777 -1.07304153
## 2014 -0.158508656 0.77378189
## 2015 -0.714102802 3.48599141
## 2016 0.009267562 -0.04524088
## 2017 0.199595639 -0.97435366
## 2018 -0.063475484 0.30986434
## 2019 -0.214929848 1.04920972
## 2020 -0.085844067 0.41905966
##
## with conditional variances for "Season"
MuMIn::r.squaredGLMM(riMod_w)
## R2m R2c
## [1,] 4.409967e-05 0.0002847297
plot_model(riMod_w, type = "re") +
theme_minimal()
You can see that as the seasons go on the slope of our line gradually increases. This confirms our findings from our previous women’s models. The women’s three point line was moved back in 2009 which we do not have data for, but it seems that that might have had an effect on the relationship between percentage of three point shots taken and score differential as the four seasons after all have negative slopes. This indicates that team’s that shot a large percentage of threes were less successful than teams that did not. However from 2014 to 2020, the opposite was true; teams were beginning to adjust to this deeper three point line.
These results that we obtained matter for a few reasons. We can see that whenever the NCAA backed up the three point line, it initially had an impact on the percentage of three pointers teams were taking, but ultimately as time went on player’s adjusted to the new distance. This resulted in a a gradual increase in the percentage of three pointers taken after the initial year with the rule change. The NCAA could use these insights whenever they are contemplating whether or not to move the three point line back again. Our insights tell us that such a move will change the game in the short term to favor more diverse shot selection, but in the long run teams adjust to the new line and shoot more three point field goals. The NCAA is trying to fight a battle that we believe they cannot win because teams realize that the three point shot is highly effective and will continue to build their strategy around it regardless of where that line is. Therefore, the NCAA should embrace three point shooting in college basketball.