First things first, close any project you are in and start a new project entitled something creative like “marchmadness”.
Next, open up an R script file, paste in and run the next three lines in order to point R towards the proper folders, load the most needed packages and load some data.
.libPaths(c("/home/rstudioshared", "/home/rstudioshared/shared_files/packages"))
library(dplyr); library(ggplot2); library(tidyr)
load('/home/rstudioshared/shared_files/data/march_workspace.RData')
You’ll notice that this creates a number of data sets in your workspace. These are the data sets that we’ll need for this project.
Today we’ll be working primarily with “mergedtourney”.
Here is an image of an NCAA tournament bracket.
mergedtourney contains data on 18 years of NCAA tournaments with the rank of the winning and losing team from each game along with the score of each game.
We can start our project by building models that predict the probability of either of two teams winning each game based solely on their rankings.
table(mergedtourney$wrank, mergedtourney$lrank)[1:4,1:4]
##
## 1 2 3 4
## 1 10 18 10 24
## 2 14 1 19 1
## 3 8 12 0 2
## 4 10 3 3 1
The rows show the winning team’s ranking, the columns show the losing teams’ ranking and the number represents the number of games that fall into that category.
A #1 seed beat a #4 seed 24 times but a #4 seed only beat a #1 seed 10 times. In other words, #1 seeds have won 24/34 = 70.6% of the 34 meetings between these two seeds.
These three columns show teams 1’s rank, team 2’s rank and whether team 1 won the game:
nrow(mergedtourney)
## [1] 1156
head(mergedtourney[,c("rank1", "rank2", "win")])
## rank1 rank2 win
## 1 16 1 0
## 2 1 9 1
## 3 8 9 0
## 4 9 8 0
## 5 11 3 0
## 6 14 3 0
Lets predict whether team 1 won the game based on the difference in rankings:
model1 <- lm(win~I(rank2-rank1), data=mergedtourney)
“I()” let’s you create functions of variables.
(model1 <- lm(win~I(rank2-rank1), data=mergedtourney))
##
## Call:
## lm(formula = win ~ I(rank2 - rank1), data = mergedtourney)
##
## Coefficients:
## (Intercept) I(rank2 - rank1)
## 0.49984 0.03316
So, we predict the chance of winning as roughly: .50 + .033*(difference in rankings)
mergedtest$pred <- predict(model1, mergedtest)
write.csv(mergedtest[,c("id", "pred")], "simplemodel.csv", row.names=FALSE, row)
model2 <- lm(win~I(rank2-rank1)+I((rank2-rank1)^2), data=mergedtourney)
model3 <- lm(win ~ I( log(rank2)-log(rank1) ), data=mergedtourney)
Although often a simple solution is the best.
rankpred$pred1 <- predict(model1, rankpred)
rankpred$pred2 <- predict(model2, rankpred)
rankpred$pred3 <- predict(model3, rankpred)
rankpred$pred4 <- pnorm((log(rankpred$rank2+.8)-log(rankpred$rank1+.8))/1)
View(rankpred)
Try building the best models you can using just team rankings.
You can use a linear model: lm()
or just make up something that you think might work