NCAAChamps <- read.csv("NCAA Tournament Champions by Year - Sheet1.csv")
NCAAChamps <- clean_names(NCAAChamps)
View(NCAAChamps)
CBB2020 <- read.csv("CBBStats2020.csv")
CBB2020 <- clean_names(CBB2020)
View(CBB2020)
In 2020, the Men’s (and Women’s) NCAA Basketball tournaments were cancelled because of Covid-19. I have two datasets: One I made with a few stats of all champions since 1985 and one I found of every major stat of every division one program from the 2020 season. I chose these datasets because I am a big college basketball fan and wanted to look at the data from the 2020 season and see if I can predict who the national champion might have been, knowing the stats of every champion since 1985.
Before importing the “CBB2020” dataset, I manually added a column that lists what conference every team is from. This is so I can eliminate some teams later on, when I begin to predict the champion.
Before working with the dataset, I want to create two new columns that will match two columns I put in my “NCAAChamps” dataset. “CBB2020” gives the total number of points scored and total number of points allowed. I want to create a column with the average number of points scored per game and the average number of points allowed per game. I rounded this to one decimal place, as that is what these stats were rounded to when I found the data for my “NCAAChamps” dataset.
PtsScored<-CBB2020$team_points
PtsAllowed<-CBB2020$opponent_points
Games<-CBB2020$games_played
PPG<-round(PtsScored/Games, 1)
PAPG<-round(PtsAllowed/Games, 1)
CBB2020$PPG<-PPG
CBB2020$PAPG<-PAPG
head(CBB2020)
## school conference games_played wins losses
## 1 Abilene Christian Southland 31 20 11
## 2 Air Force Mountain West 32 12 20
## 3 Akron Mid-American 31 24 7
## 4 Alabama SEC 31 16 15
## 5 Alabama A&M Southwestern Athletic 30 8 22
## 6 Alabama State Southwestern Athletic 32 8 24
## win_loss_ratio strength_of_record strength_of_schedule conf_wins conf_losses
## 1 0.645 -2.87 -6.87 15 5
## 2 0.375 -0.37 3.02 5 13
## 3 0.774 7.15 -0.40 14 4
## 4 0.516 11.12 8.12 8 10
## 5 0.267 -18.88 -8.85 5 13
## 6 0.250 -16.15 -6.53 7 11
## home_wins home_losses road_wins road_losses team_points opponent_points
## 1 13 3 7 8 2352 2024
## 2 8 7 3 9 2338 2386
## 3 15 2 8 4 2348 2062
## 4 10 5 4 8 2542 2449
## 5 6 5 2 16 1845 2146
## 6 4 5 2 16 1965 2273
## minutes_played field_goals field_goal_attempts field_goal x3_points
## 1 1260 814 1813 0.449 209
## 2 1280 802 1759 0.456 275
## 3 1240 788 1789 0.440 287
## 4 1260 854 1956 0.437 334
## 5 1205 656 1727 0.380 154
## 6 1280 683 1763 0.387 208
## x3_point_attempts x3_point free_throws free_throw_attempts free_throw
## 1 676 0.309 515 695 0.741
## 2 734 0.375 459 628 0.731
## 3 795 0.361 485 628 0.772
## 4 957 0.349 500 721 0.693
## 5 548 0.281 379 565 0.671
## 6 687 0.303 391 629 0.622
## offensive_rebounds total_rebounds assists steals blocks turnovers
## 1 337 1042 461 293 81 436
## 2 237 1040 469 161 43 395
## 3 302 1169 405 158 91 397
## 4 361 1220 441 196 136 461
## 5 286 1035 326 174 63 391
## 6 295 1031 325 200 70 512
## personal_fouls PPG PAPG
## 1 661 75.9 65.3
## 2 534 73.1 74.6
## 3 548 75.7 66.5
## 4 622 82.0 79.0
## 5 538 61.5 71.5
## 6 638 61.4 71.0
head(NCAAChamps)
## year school region seed conference wins losses points_per_game
## 1 1985 Villanova Southeast 8 Big East 25 10 68.7
## 2 1986 Louisville West 2 Metro 32 7 79.4
## 3 1987 Indiana Midwest 1 Big Ten 30 4 82.5
## 4 1988 Kansas Midwest 5 Big Eight 27 11 75.3
## 5 1989 Michigan Southeast 3 Big Ten 30 7 91.7
## 6 1990 UNLV West 1 Big West 35 5 93.5
## points_allowed_per_game
## 1 63.9
## 2 69.1
## 3 70.9
## 4 67.9
## 5 74.8
## 6 78.5
Now, I am going to create some graphs to visualize some of the data from the 2020 season.
plot(x=CBB2020$strength_of_schedule, y=CBB2020$strength_of_record, main="Stength of Schedule vs Strength of Record", xlab="SOS", ylab="SOR")
abline(lm(CBB2020$strength_of_record ~ CBB2020$strength_of_schedule))
Strength of schedule is a metric that looks at how tough a teams schedule is. It is calculated by looking at the record of the teams on a certain team’s schedule. A higher number means that a team had a relatively difficult schedule. Strength of record adds an extra layer to this by looking at how a team actually performed, relative to the difficulty of a schedule. A higher number means that a team performed well. A lot of the time, a team with a really high Strength of Record will also have a high Strength of Schedule, but the same is not necessarily true in reverse.
boxplot(CBB2020$PPG, main="Points Per Game Distribution", ylab="PPG", col="lightblue")
boxplot(CBB2020$PAPG, main="Points Allowed Per Game Distribution", ylab="PAPG", col="lightgreen")
## Who are these outliers?
print(CBB2020[CBB2020$PPG>83, c("school", "wins", "losses", "PPG", "PAPG")])
## school wins losses PPG PAPG
## 102 Gonzaga 31 2 87.4 67.8
print(CBB2020[CBB2020$PPG<60, c("school", "wins", "losses", "PPG", "PAPG")])
## school wins losses PPG PAPG
## 15 Arkansas-Pine Bluff 4 26 53.8 68.3
## 83 Fairfield 12 20 58.0 62.9
## 91 Fordham 9 22 58.6 61.7
## 135 Kennesaw State 1 28 55.2 75.2
## 163 Maryland-Eastern Shore 5 27 57.5 71.2
## 330 Virginia 23 7 57.0 52.4
print(CBB2020[CBB2020$PAPG>82, c("school", "wins", "losses", "PPG", "PAPG")])
## school wins losses PPG PAPG
## 43 Central Arkansas 10 21 74.7 82.4
## 67 Delaware State 6 26 73.4 83.0
## 114 Houston Christian 4 25 80.1 93.9
## 179 Mississippi Valley State 3 27 68.3 89.7
## 258 Samford 10 23 74.2 82.2
## 296 Tennessee-Martin 9 20 75.7 82.1
print(CBB2020[CBB2020$PAPG<58, c("school", "wins", "losses", "PPG", "PAPG")])
## school wins losses PPG PAPG
## 142 Liberty 30 4 68.6 53.8
## 330 Virginia 23 7 57.0 52.4
binWidths<-seq(0,35,2)
hist(CBB2020$wins, col="powderblue", main="Win Distribution", xlab="Wins", breaks=binWidths)
## Predicting a 2020 National Champion
To begin, I am going to eliminate any program with a negative SOR.
SOR2020<-CBB2020[CBB2020$strength_of_record>0,]
View(SOR2020)
This brings our list of teams from 353 to 168. Now, I am going to eliminate any team with less than 20 wins. If we look at the average number of wins in a championship team, we can see that they win an average of 26 games in the regular season. To calculate this, I subtract 6 from the win column and take the average. A team has to win 6 games to win a national champion, so eliminating these wins gives their regular season win total.
NCAAChamps$Wins<-NCAAChamps$wins-6
mean(NCAAChamps$Wins)
## [1] 26.875
Wins2020<-SOR2020[SOR2020$wins>20,]
View(Wins2020)
This brings our list from 168 to 69 teams. Next, I am going to compare the 5 number summary of the PPG and PAPG of the 2020 teams to the national champions since 1985.
print("5 Number Summary: PPG 2020")
## [1] "5 Number Summary: PPG 2020"
summary(CBB2020$PPG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 53.80 68.30 71.10 71.07 74.60 87.40
print("5 Number Summary: PAPG 2020")
## [1] "5 Number Summary: PAPG 2020"
summary(CBB2020$PAPG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 52.40 66.30 69.40 69.67 72.40 93.90
print("5 Number Summary: PPG Champs Since 1985")
## [1] "5 Number Summary: PPG Champs Since 1985"
summary(NCAAChamps$points_per_game)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 68.70 77.85 80.30 81.74 86.83 93.50
print("5 Number Summary: PAPG Champs Since 1985")
## [1] "5 Number Summary: PAPG Champs Since 1985"
summary(NCAAChamps$points_allowed_per_game)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 56.10 63.48 67.55 67.17 70.67 78.50
Specifically, I am going to look at the mean. As we can see, national champions have scored an average PPG that is 10 points higher than that of all teams in 2020. We can use this to narrow down our dataset when trying to predict a national champion.
Now, I am going to eliminate any program that did not score more than 80 points per game in 2020.
PPG80<-Wins2020[Wins2020$PPG>80,]
View(PPG80)
This brings our list to just 7 teams. I listed them out below: Duke (25-6), ACC, 82.5 ppg and 68.0 papg, Eastern Washington (23-8), Big Sky, 80.9 ppg and 72.9 papg, Gonzaga (31-2), West Coast, 87.4 ppg and 67.8 papg, LSU (21-10), SEC, 80.5 ppg and 73.3 papg, Stephen F. Austin (28-3), Southland, 80.6 ppg and 76.0 papg, Winthrop (24-10), Big South, 81.3 ppg and 71.5 papg, Wright State (25-7), Horizon, 80.6 ppg and 70.8 papg,
From this list, I selected 2 teams that I think had the highest likelihood of winning the championship in 2020: Duke and Gonzaga. I elminitated Eastern Washington, Stephen F. Austin, Winthrop, and Wright State for being from small conferences and I eliminated LSU because they have 10 losses and a high PAPG.
To make my final decision between the two teams, I am going to look at the difference between PPG and PAPG in the championship winners.
ChampPPG<-NCAAChamps$points_per_game
ChampPAPG<-NCAAChamps$points_allowed_per_game
difference<-ChampPPG-ChampPAPG
NCAAChamps$diff<-difference
mean(NCAAChamps$diff)
## [1] 14.575
max(NCAAChamps$diff)
## [1] 22
The average NCAA Champion scores 14 more points on average than they allow. Both Duke and Gonzaga fall into this category, with Duke’s difference being about 14 and Gonzaga’s being about 20. Because of the bigger gap in Gonzaga’s PPG and PAPG, I think Gonzaga would have won the 2020 NCAA Championship.