First things first, close any project you are in and start a new project entitled something creative like “marchmadness”.

Next, open up an R script file, paste in and run the next three lines in order to point R towards the proper folders, load the most needed packages and load some data.

.libPaths(c("/home/rstudioshared", "/home/rstudioshared/shared_files/packages"))
library(dplyr); library(ggplot2); library(tidyr)
load('/home/rstudioshared/shared_files/data/march_workspace.RData')

You’ll notice that this creates a number of data sets in your workspace. These are the data sets that we’ll need for this project.

Today we’ll be working primarily with “mergedtourney”.

The Tournament Data

Here is an image of an NCAA tournament bracket.

mergedtourney contains data on 18 years of NCAA tournaments with the rank of the winning and losing team from each game along with the score of each game.

We can start our project by building models that predict the probability of either of two teams winning each game based solely on their rankings.

A Partial Table

table(mergedtourney$wrank, mergedtourney$lrank)[1:4,1:4]
##    
##      1  2  3  4
##   1 10 18 10 24
##   2 14  1 19  1
##   3  8 12  0  2
##   4 10  3  3  1

The rows show the winning team’s ranking, the columns show the losing teams’ ranking and the number represents the number of games that fall into that category.

A #1 seed beat a #4 seed 24 times but a #4 seed only beat a #1 seed 10 times. In other words, #1 seeds have won 24/34 = 70.6% of the 34 meetings between these two seeds.

Data for a Rankings Model

These three columns show teams 1’s rank, team 2’s rank and whether team 1 won the game:

nrow(mergedtourney)
## [1] 1156
head(mergedtourney[,c("rank1", "rank2", "win")])
##   rank1 rank2 win
## 1    16     1   0
## 2     1     9   1
## 3     8     9   0
## 4     9     8   0
## 5    11     3   0
## 6    14     3   0

The Difference in Rankings

Lets predict whether team 1 won the game based on the difference in rankings:

model1 <- lm(win~I(rank2-rank1), data=mergedtourney)

“I()” let’s you create functions of variables.

Summary of the Simple Model

(model1 <- lm(win~I(rank2-rank1), data=mergedtourney))
## 
## Call:
## lm(formula = win ~ I(rank2 - rank1), data = mergedtourney)
## 
## Coefficients:
##      (Intercept)  I(rank2 - rank1)  
##          0.49984           0.03316

So, we predict the chance of winning as roughly: .50 + .033*(difference in rankings)

Making Predictions and Creating a Submission for Kaggle

mergedtest$pred <- predict(model1, mergedtest)
write.csv(mergedtest[,c("id", "pred")], "simplemodel.csv", row.names=FALSE, row)

We could get fancier…

model2 <- lm(win~I(rank2-rank1)+I((rank2-rank1)^2), data=mergedtourney)

model3 <- lm(win ~ I( log(rank2)-log(rank1) ), data=mergedtourney)

Although often a simple solution is the best.

To see what kinds of predictions these make use “rankpred”

rankpred$pred1 <- predict(model1, rankpred)
rankpred$pred2 <- predict(model2, rankpred)
rankpred$pred3 <- predict(model3, rankpred)
rankpred$pred4 <- pnorm((log(rankpred$rank2+.8)-log(rankpred$rank1+.8))/1)

To see what kinds of predictions these make use “rankpred”

View(rankpred)

More Models

Try building the best models you can using just team rankings.

You can use a linear model: lm()

or just make up something that you think might work