NCAAChamps <- read.csv("NCAA Tournament Champions by Year - Sheet1.csv")
NCAAChamps <- clean_names(NCAAChamps)
View(NCAAChamps)

CBB2020 <- read.csv("CBBStats2020.csv")
CBB2020 <- clean_names(CBB2020)
View(CBB2020)

Why did I choose these datasets?

In 2020, the Men’s (and Women’s) NCAA Basketball tournaments were cancelled because of Covid-19. I have two datasets: One I made with a few stats of all champions since 1985 and one I found of every major stat of every division one program from the 2020 season. I chose these datasets because I am a big college basketball fan and wanted to look at the data from the 2020 season and see if I can predict who the national champion might have been, knowing the stats of every champion since 1985.

Before importing the “CBB2020” dataset, I manually added a column that lists what conference every team is from. This is so I can eliminate some teams later on, when I begin to predict the champion.

Before working with the dataset, I want to create two new columns that will match two columns I put in my “NCAAChamps” dataset. “CBB2020” gives the total number of points scored and total number of points allowed. I want to create a column with the average number of points scored per game and the average number of points allowed per game. I rounded this to one decimal place, as that is what these stats were rounded to when I found the data for my “NCAAChamps” dataset.

PtsScored<-CBB2020$team_points
PtsAllowed<-CBB2020$opponent_points
Games<-CBB2020$games_played

PPG<-round(PtsScored/Games, 1)
PAPG<-round(PtsAllowed/Games, 1)

CBB2020$PPG<-PPG
CBB2020$PAPG<-PAPG
head(CBB2020)
##              school            conference games_played wins losses
## 1 Abilene Christian             Southland           31   20     11
## 2         Air Force         Mountain West           32   12     20
## 3             Akron          Mid-American           31   24      7
## 4           Alabama                   SEC           31   16     15
## 5       Alabama A&M Southwestern Athletic           30    8     22
## 6     Alabama State Southwestern Athletic           32    8     24
##   win_loss_ratio strength_of_record strength_of_schedule conf_wins conf_losses
## 1          0.645              -2.87                -6.87        15           5
## 2          0.375              -0.37                 3.02         5          13
## 3          0.774               7.15                -0.40        14           4
## 4          0.516              11.12                 8.12         8          10
## 5          0.267             -18.88                -8.85         5          13
## 6          0.250             -16.15                -6.53         7          11
##   home_wins home_losses road_wins road_losses team_points opponent_points
## 1        13           3         7           8        2352            2024
## 2         8           7         3           9        2338            2386
## 3        15           2         8           4        2348            2062
## 4        10           5         4           8        2542            2449
## 5         6           5         2          16        1845            2146
## 6         4           5         2          16        1965            2273
##   minutes_played field_goals field_goal_attempts field_goal x3_points
## 1           1260         814                1813      0.449       209
## 2           1280         802                1759      0.456       275
## 3           1240         788                1789      0.440       287
## 4           1260         854                1956      0.437       334
## 5           1205         656                1727      0.380       154
## 6           1280         683                1763      0.387       208
##   x3_point_attempts x3_point free_throws free_throw_attempts free_throw
## 1               676    0.309         515                 695      0.741
## 2               734    0.375         459                 628      0.731
## 3               795    0.361         485                 628      0.772
## 4               957    0.349         500                 721      0.693
## 5               548    0.281         379                 565      0.671
## 6               687    0.303         391                 629      0.622
##   offensive_rebounds total_rebounds assists steals blocks turnovers
## 1                337           1042     461    293     81       436
## 2                237           1040     469    161     43       395
## 3                302           1169     405    158     91       397
## 4                361           1220     441    196    136       461
## 5                286           1035     326    174     63       391
## 6                295           1031     325    200     70       512
##   personal_fouls  PPG PAPG
## 1            661 75.9 65.3
## 2            534 73.1 74.6
## 3            548 75.7 66.5
## 4            622 82.0 79.0
## 5            538 61.5 71.5
## 6            638 61.4 71.0
head(NCAAChamps)
##   year     school    region seed conference wins losses points_per_game
## 1 1985  Villanova Southeast    8   Big East   25     10            68.7
## 2 1986 Louisville      West    2      Metro   32      7            79.4
## 3 1987    Indiana   Midwest    1    Big Ten   30      4            82.5
## 4 1988     Kansas   Midwest    5  Big Eight   27     11            75.3
## 5 1989   Michigan Southeast    3    Big Ten   30      7            91.7
## 6 1990       UNLV      West    1   Big West   35      5            93.5
##   points_allowed_per_game
## 1                    63.9
## 2                    69.1
## 3                    70.9
## 4                    67.9
## 5                    74.8
## 6                    78.5

Now, I am going to create some graphs to visualize some of the data from the 2020 season.

plot(x=CBB2020$strength_of_schedule, y=CBB2020$strength_of_record, main="Stength of Schedule vs Strength of Record", xlab="SOS", ylab="SOR")

abline(lm(CBB2020$strength_of_record ~ CBB2020$strength_of_schedule))

What are Strength of Schedule and Strength of Record?

Strength of schedule is a metric that looks at how tough a teams schedule is. It is calculated by looking at the record of the teams on a certain team’s schedule. A higher number means that a team had a relatively difficult schedule. Strength of record adds an extra layer to this by looking at how a team actually performed, relative to the difficulty of a schedule. A higher number means that a team performed well. A lot of the time, a team with a really high Strength of Record will also have a high Strength of Schedule, but the same is not necessarily true in reverse.

boxplot(CBB2020$PPG, main="Points Per Game Distribution", ylab="PPG", col="lightblue")

boxplot(CBB2020$PAPG, main="Points Allowed Per Game Distribution", ylab="PAPG", col="lightgreen")

## Who are these outliers?

print(CBB2020[CBB2020$PPG>83, c("school", "wins", "losses", "PPG", "PAPG")])
##      school wins losses  PPG PAPG
## 102 Gonzaga   31      2 87.4 67.8
print(CBB2020[CBB2020$PPG<60, c("school", "wins", "losses", "PPG", "PAPG")])
##                     school wins losses  PPG PAPG
## 15     Arkansas-Pine Bluff    4     26 53.8 68.3
## 83               Fairfield   12     20 58.0 62.9
## 91                 Fordham    9     22 58.6 61.7
## 135         Kennesaw State    1     28 55.2 75.2
## 163 Maryland-Eastern Shore    5     27 57.5 71.2
## 330               Virginia   23      7 57.0 52.4
print(CBB2020[CBB2020$PAPG>82, c("school", "wins", "losses", "PPG", "PAPG")])
##                       school wins losses  PPG PAPG
## 43          Central Arkansas   10     21 74.7 82.4
## 67            Delaware State    6     26 73.4 83.0
## 114        Houston Christian    4     25 80.1 93.9
## 179 Mississippi Valley State    3     27 68.3 89.7
## 258                  Samford   10     23 74.2 82.2
## 296         Tennessee-Martin    9     20 75.7 82.1
print(CBB2020[CBB2020$PAPG<58, c("school", "wins", "losses", "PPG", "PAPG")])
##       school wins losses  PPG PAPG
## 142  Liberty   30      4 68.6 53.8
## 330 Virginia   23      7 57.0 52.4
binWidths<-seq(0,35,2)
hist(CBB2020$wins, col="powderblue", main="Win Distribution", xlab="Wins", breaks=binWidths)

## Predicting a 2020 National Champion

To begin, I am going to eliminate any program with a negative SOR.

SOR2020<-CBB2020[CBB2020$strength_of_record>0,]
View(SOR2020)

This brings our list of teams from 353 to 168. Now, I am going to eliminate any team with less than 20 wins. If we look at the average number of wins in a championship team, we can see that they win an average of 26 games in the regular season. To calculate this, I subtract 6 from the win column and take the average. A team has to win 6 games to win a national champion, so eliminating these wins gives their regular season win total.

NCAAChamps$Wins<-NCAAChamps$wins-6
mean(NCAAChamps$Wins)
## [1] 26.875
Wins2020<-SOR2020[SOR2020$wins>20,]
View(Wins2020)

This brings our list from 168 to 69 teams. Next, I am going to compare the 5 number summary of the PPG and PAPG of the 2020 teams to the national champions since 1985.

print("5 Number Summary: PPG 2020")
## [1] "5 Number Summary: PPG 2020"
summary(CBB2020$PPG)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   53.80   68.30   71.10   71.07   74.60   87.40
print("5 Number Summary: PAPG 2020")
## [1] "5 Number Summary: PAPG 2020"
summary(CBB2020$PAPG)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   52.40   66.30   69.40   69.67   72.40   93.90
print("5 Number Summary: PPG Champs Since 1985")
## [1] "5 Number Summary: PPG Champs Since 1985"
summary(NCAAChamps$points_per_game)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   68.70   77.85   80.30   81.74   86.83   93.50
print("5 Number Summary: PAPG Champs Since 1985")
## [1] "5 Number Summary: PAPG Champs Since 1985"
summary(NCAAChamps$points_allowed_per_game)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   56.10   63.48   67.55   67.17   70.67   78.50

Specifically, I am going to look at the mean. As we can see, national champions have scored an average PPG that is 10 points higher than that of all teams in 2020. We can use this to narrow down our dataset when trying to predict a national champion.

Now, I am going to eliminate any program that did not score more than 80 points per game in 2020.

PPG80<-Wins2020[Wins2020$PPG>80,]
View(PPG80)

This brings our list to just 7 teams. I listed them out below: Duke (25-6), ACC, 82.5 ppg and 68.0 papg, Eastern Washington (23-8), Big Sky, 80.9 ppg and 72.9 papg, Gonzaga (31-2), West Coast, 87.4 ppg and 67.8 papg, LSU (21-10), SEC, 80.5 ppg and 73.3 papg, Stephen F. Austin (28-3), Southland, 80.6 ppg and 76.0 papg, Winthrop (24-10), Big South, 81.3 ppg and 71.5 papg, Wright State (25-7), Horizon, 80.6 ppg and 70.8 papg,

From this list, I selected 2 teams that I think had the highest likelihood of winning the championship in 2020: Duke and Gonzaga. I elminitated Eastern Washington, Stephen F. Austin, Winthrop, and Wright State for being from small conferences and I eliminated LSU because they have 10 losses and a high PAPG.

To make my final decision between the two teams, I am going to look at the difference between PPG and PAPG in the championship winners.

ChampPPG<-NCAAChamps$points_per_game
ChampPAPG<-NCAAChamps$points_allowed_per_game
difference<-ChampPPG-ChampPAPG

NCAAChamps$diff<-difference

mean(NCAAChamps$diff)
## [1] 14.575
max(NCAAChamps$diff)
## [1] 22

The average NCAA Champion scores 14 more points on average than they allow. Both Duke and Gonzaga fall into this category, with Duke’s difference being about 14 and Gonzaga’s being about 20. Because of the bigger gap in Gonzaga’s PPG and PAPG, I think Gonzaga would have won the 2020 NCAA Championship.