First select Project/New Project from your RStudio menu and create a new proejct in a new directory entitled “march_madness” (or whatever else you'd prefer to call it).
Go to: https://sites.google.com/a/saintannsny.org/probability-and-statistics/files/march_workspace.RData
to download an .RDate file containing the data. Now upload this .RData file into your project folder.
Load the .RData file that you just uploaded
load("march_workspace.RData")
You'll notice that this creates a number of data sets in your workspace. These are the data sets that we'll need for this project.
Today we'll be working primarily with “mergedtourney”.
mergedtourney contains data on 18 years of NCAA tournaments with the rank of the winning and losing team from each game along with the score of each game.
We can start our project by building models that predict the probability of either of two teams winning each game based solely on their rankings.
table(mergedtourney$wrank, mergedtourney$lrank)[1:4,1:4]
1 2 3 4
1 10 18 10 24
2 14 1 19 1
3 8 12 0 2
4 10 3 3 1
The rows show the winning team's ranking, the columns show the losing teams' ranking and the number represents the number of games that fall into that category.
table(mergedtourney$wrank, mergedtourney$lrank)[1:4,1:4]
1 2 3 4
1 10 18 10 24
2 14 1 19 1
3 8 12 0 2
4 10 3 3 1
A #1 seed beat a #4 seed 24 times but a #4 seed only beat a #1 seed 10 times. In other words, #1 seeds have won 24/34 = 70.6% of the 34 meetings between these two seeds.
These three columns show teams 1's rank, team 2's rank and whether team 1 won the game:
head(mergedtourney[,c("rank1", "rank2", "win")])
rank1 rank2 win
1 16 1 0
2 1 9 1
3 8 9 0
4 9 8 0
5 11 3 0
6 14 3 0
Lets predict whether team 1 won the game based on the difference in rankings:
model1 <- lm(win~I(rank2-rank1), data=mergedtourney)
“I()” let's you create functions of variables.
(model1 <- lm(win~I(rank2-rank1), data=mergedtourney))
Call:
lm(formula = win ~ I(rank2 - rank1), data = mergedtourney)
Coefficients:
(Intercept) I(rank2 - rank1)
0.4998 0.0332
So, we predict the chance of winning as roughly: .50 + .033*(difference in rankings)
mergedtest$pred <- predict(model1, mergedtest)
write.csv(mergedtest[,c("id", "pred")], "simplemodel.csv", row.names=FALSE)
model2 <- lm(win~I(rank2-rank1)+I((rank2-rank1)^2), data=mergedtourney)
model3 <- lm(win ~ I( log(rank2)-log(rank1) ), data=mergedtourney)
Although often a simple solution is the best.
rankpred$pred1 <- predict(model1, rankpred)
rankpred$pred2 <- predict(model2, rankpred)
rankpred$pred3 <- predict(model3, rankpred)
rankpred$pred4 <- pnorm((log(rankpred$rank2+.8)-log(rankpred$rank1+.8))/1)
rankpred
rank1 rank2 pred1 pred2 pred3 pred4
1 1 1 0.4998 0.4963 0.4919 0.5000
2 1 2 0.5330 0.5295 0.6190 0.6707
3 1 3 0.5661 0.5628 0.6933 0.7725
4 1 4 0.5993 0.5962 0.7460 0.8367
5 1 5 0.6325 0.6298 0.7869 0.8790
6 1 6 0.6656 0.6635 0.8203 0.9081
7 1 7 0.6988 0.6973 0.8485 0.9287
8 1 8 0.7319 0.7313 0.8730 0.9437
9 1 9 0.7651 0.7654 0.8946 0.9549
10 1 10 0.7982 0.7996 0.9139 0.9634
11 1 11 0.8314 0.8339 0.9314 0.9700
12 1 12 0.8646 0.8683 0.9473 0.9751
13 1 13 0.8977 0.9029 0.9620 0.9792
14 1 14 0.9309 0.9376 0.9755 0.9824
15 1 15 0.9640 0.9725 0.9882 0.9851
16 1 16 0.9972 1.0074 1.0000 0.9872
17 2 2 0.4998 0.4963 0.4919 0.5000
18 2 3 0.5330 0.5295 0.5662 0.6200
19 2 4 0.5661 0.5628 0.6190 0.7051
20 2 5 0.5993 0.5962 0.6599 0.7668
21 2 6 0.6325 0.6298 0.6933 0.8125
22 2 7 0.6656 0.6635 0.7215 0.8472
23 2 8 0.6988 0.6973 0.7460 0.8739
24 2 9 0.7319 0.7313 0.7676 0.8949
25 2 10 0.7651 0.7654 0.7869 0.9115
26 2 11 0.7982 0.7996 0.8043 0.9249
27 2 12 0.8314 0.8339 0.8203 0.9357
28 2 13 0.8646 0.8683 0.8349 0.9446
29 2 14 0.8977 0.9029 0.8485 0.9520
30 2 15 0.9309 0.9376 0.8612 0.9582
31 2 16 0.9640 0.9725 0.8730 0.9634
32 3 3 0.4998 0.4963 0.4919 0.5000
33 3 4 0.5330 0.5295 0.5447 0.5924
34 3 5 0.5661 0.5628 0.5856 0.6638
35 3 6 0.5993 0.5962 0.6190 0.7197
36 3 7 0.6325 0.6298 0.6472 0.7640
37 3 8 0.6656 0.6635 0.6717 0.7995
38 3 9 0.6988 0.6973 0.6933 0.8283
39 3 10 0.7319 0.7313 0.7126 0.8519
40 3 11 0.7651 0.7654 0.7300 0.8714
41 3 12 0.7982 0.7996 0.7460 0.8877
42 3 13 0.8314 0.8339 0.7606 0.9014
43 3 14 0.8646 0.8683 0.7742 0.9130
44 3 15 0.8977 0.9029 0.7869 0.9229
45 3 16 0.9309 0.9376 0.7987 0.9314
46 4 4 0.4998 0.4963 0.4919 0.5000
47 4 5 0.5330 0.5295 0.5328 0.5750
48 4 6 0.5661 0.5628 0.5662 0.6362
49 4 7 0.5993 0.5962 0.5945 0.6863
50 4 8 0.6325 0.6298 0.6190 0.7278
51 4 9 0.6656 0.6635 0.6405 0.7623
52 4 10 0.6988 0.6973 0.6599 0.7913
53 4 11 0.7319 0.7313 0.6773 0.8158
54 4 12 0.7651 0.7654 0.6933 0.8367
55 4 13 0.7982 0.7996 0.7079 0.8545
56 4 14 0.8314 0.8339 0.7215 0.8699
57 4 15 0.8646 0.8683 0.7342 0.8833
58 4 16 0.8977 0.9029 0.7460 0.8949
59 5 5 0.4998 0.4963 0.4919 0.5000
60 5 6 0.5330 0.5295 0.5254 0.5632
61 5 7 0.5661 0.5628 0.5536 0.6165
62 5 8 0.5993 0.5962 0.5781 0.6616
63 5 9 0.6325 0.6298 0.5997 0.7000
64 5 10 0.6656 0.6635 0.6190 0.7329
65 5 11 0.6988 0.6973 0.6364 0.7612
66 5 12 0.7319 0.7313 0.6524 0.7857
67 5 13 0.7651 0.7654 0.6670 0.8070
68 5 14 0.7982 0.7996 0.6806 0.8256
69 5 15 0.8314 0.8339 0.6933 0.8419
70 5 16 0.8646 0.8683 0.7051 0.8562
71 6 6 0.4998 0.4963 0.4919 0.5000
72 6 7 0.5330 0.5295 0.5202 0.5546
73 6 8 0.5661 0.5628 0.5447 0.6017
74 6 9 0.5993 0.5962 0.5662 0.6426
75 6 10 0.6325 0.6298 0.5856 0.6782
76 6 11 0.6656 0.6635 0.6030 0.7092
77 6 12 0.6988 0.6973 0.6190 0.7365
78 6 13 0.7319 0.7313 0.6336 0.7604
79 6 14 0.7651 0.7654 0.6472 0.7816
80 6 15 0.7982 0.7996 0.6599 0.8004
81 6 16 0.8314 0.8339 0.6717 0.8171
82 7 7 0.4998 0.4963 0.4919 0.5000
83 7 8 0.5330 0.5295 0.5164 0.5480
84 7 9 0.5661 0.5628 0.5380 0.5903
85 7 10 0.5993 0.5962 0.5573 0.6276
86 7 11 0.6325 0.6298 0.5748 0.6606
87 7 12 0.6656 0.6635 0.5907 0.6898
88 7 13 0.6988 0.6973 0.6054 0.7158
89 7 14 0.7319 0.7313 0.6190 0.7391
90 7 15 0.7651 0.7654 0.6316 0.7599
91 7 16 0.7982 0.7996 0.6434 0.7785
92 8 8 0.4998 0.4963 0.4919 0.5000
93 8 9 0.5330 0.5295 0.5135 0.5429
94 8 10 0.5661 0.5628 0.5328 0.5811
95 8 11 0.5993 0.5962 0.5503 0.6154
96 8 12 0.6325 0.6298 0.5662 0.6461
97 8 13 0.6656 0.6635 0.5809 0.6736
98 8 14 0.6988 0.6973 0.5945 0.6984
99 8 15 0.7319 0.7313 0.6071 0.7208
100 8 16 0.7651 0.7654 0.6190 0.7411
101 9 9 0.4998 0.4963 0.4919 0.5000
102 9 10 0.5330 0.5295 0.5113 0.5387
103 9 11 0.5661 0.5628 0.5287 0.5737
104 9 12 0.5993 0.5962 0.5447 0.6053
105 9 13 0.6325 0.6298 0.5593 0.6339
106 9 14 0.6656 0.6635 0.5729 0.6599
107 9 15 0.6988 0.6973 0.5856 0.6835
108 9 16 0.7319 0.7313 0.5974 0.7051
109 10 10 0.4998 0.4963 0.4919 0.5000
110 10 11 0.5330 0.5295 0.5094 0.5353
111 10 12 0.5661 0.5628 0.5254 0.5675
112 10 13 0.5993 0.5962 0.5400 0.5968
113 10 14 0.6325 0.6298 0.5536 0.6236
114 10 15 0.6656 0.6635 0.5662 0.6482
115 10 16 0.6988 0.6973 0.5781 0.6707
116 11 11 0.4998 0.4963 0.4919 0.5000
117 11 12 0.5330 0.5295 0.5079 0.5324
118 11 13 0.5661 0.5628 0.5226 0.5622
119 11 14 0.5993 0.5962 0.5361 0.5896
120 11 15 0.6325 0.6298 0.5488 0.6148
121 11 16 0.6656 0.6635 0.5606 0.6381
122 12 12 0.4998 0.4963 0.4919 0.5000
123 12 13 0.5330 0.5295 0.5066 0.5300
124 12 14 0.5661 0.5628 0.5202 0.5577
125 12 15 0.5993 0.5962 0.5328 0.5834
126 12 16 0.6325 0.6298 0.5447 0.6072
127 13 13 0.4998 0.4963 0.4919 0.5000
128 13 14 0.5330 0.5295 0.5055 0.5279
129 13 15 0.5661 0.5628 0.5182 0.5538
130 13 16 0.5993 0.5962 0.5300 0.5780
131 14 14 0.4998 0.4963 0.4919 0.5000
132 14 15 0.5330 0.5295 0.5046 0.5261
133 14 16 0.5661 0.5628 0.5164 0.5504
134 15 15 0.4998 0.4963 0.4919 0.5000
135 15 16 0.5330 0.5295 0.5038 0.5245
136 16 16 0.4998 0.4963 0.4919 0.5000
Try building the best models you can using just team rankings.
You can use a linear model: lm()
or just make up something that you think might work