March Madness - Week 1

Starting a Project

First select Project/New Project from your RStudio menu and create a new proejct in a new directory entitled “march_madness” (or whatever else you'd prefer to call it).

Go to: https://sites.google.com/a/saintannsny.org/probability-and-statistics/files/march_workspace.RData

to download an .RDate file containing the data. Now upload this .RData file into your project folder.

Loading the Workspace

Load the .RData file that you just uploaded

load("march_workspace.RData")

You'll notice that this creates a number of data sets in your workspace. These are the data sets that we'll need for this project.

Today we'll be working primarily with “mergedtourney”.

The Tournament Data

mergedtourney contains data on 18 years of NCAA tournaments with the rank of the winning and losing team from each game along with the score of each game.

We can start our project by building models that predict the probability of either of two teams winning each game based solely on their rankings.

A Partial Table

table(mergedtourney$wrank, mergedtourney$lrank)[1:4,1:4]


     1  2  3  4
  1 10 18 10 24
  2 14  1 19  1
  3  8 12  0  2
  4 10  3  3  1

The rows show the winning team's ranking, the columns show the losing teams' ranking and the number represents the number of games that fall into that category.

A Partial Table (continued)

table(mergedtourney$wrank, mergedtourney$lrank)[1:4,1:4]


     1  2  3  4
  1 10 18 10 24
  2 14  1 19  1
  3  8 12  0  2
  4 10  3  3  1

A #1 seed beat a #4 seed 24 times but a #4 seed only beat a #1 seed 10 times. In other words, #1 seeds have won 24/34 = 70.6% of the 34 meetings between these two seeds.

Data for a Rankings Model

These three columns show teams 1's rank, team 2's rank and whether team 1 won the game:

head(mergedtourney[,c("rank1", "rank2", "win")])

  rank1 rank2 win
1    16     1   0
2     1     9   1
3     8     9   0
4     9     8   0
5    11     3   0
6    14     3   0

The Difference in Rankings

Lets predict whether team 1 won the game based on the difference in rankings:

model1 <- lm(win~I(rank2-rank1), data=mergedtourney)

“I()” let's you create functions of variables.

Summary of the Simple Model

(model1 <- lm(win~I(rank2-rank1), data=mergedtourney))


Call:
lm(formula = win ~ I(rank2 - rank1), data = mergedtourney)

Coefficients:
     (Intercept)  I(rank2 - rank1)  
          0.4998            0.0332

So, we predict the chance of winning as roughly: .50 + .033*(difference in rankings)

Creating a Submission for Kaggle

mergedtest$pred <- predict(model1, mergedtest)
write.csv(mergedtest[,c("id", "pred")], "simplemodel.csv", row.names=FALSE)

We could get fancier...

model2 <- lm(win~I(rank2-rank1)+I((rank2-rank1)^2), data=mergedtourney)

model3 <- lm(win ~ I( log(rank2)-log(rank1) ), data=mergedtourney)

Although often a simple solution is the best.

To see what kinds of predictions these make use "rankpred"

rankpred$pred1 <- predict(model1, rankpred)
rankpred$pred2 <- predict(model2, rankpred)
rankpred$pred3 <- predict(model3, rankpred)
rankpred$pred4 <- pnorm((log(rankpred$rank2+.8)-log(rankpred$rank1+.8))/1)

To see what kinds of predictions these make use "rankpred"

rankpred

    rank1 rank2  pred1  pred2  pred3  pred4
1       1     1 0.4998 0.4963 0.4919 0.5000
2       1     2 0.5330 0.5295 0.6190 0.6707
3       1     3 0.5661 0.5628 0.6933 0.7725
4       1     4 0.5993 0.5962 0.7460 0.8367
5       1     5 0.6325 0.6298 0.7869 0.8790
6       1     6 0.6656 0.6635 0.8203 0.9081
7       1     7 0.6988 0.6973 0.8485 0.9287
8       1     8 0.7319 0.7313 0.8730 0.9437
9       1     9 0.7651 0.7654 0.8946 0.9549
10      1    10 0.7982 0.7996 0.9139 0.9634
11      1    11 0.8314 0.8339 0.9314 0.9700
12      1    12 0.8646 0.8683 0.9473 0.9751
13      1    13 0.8977 0.9029 0.9620 0.9792
14      1    14 0.9309 0.9376 0.9755 0.9824
15      1    15 0.9640 0.9725 0.9882 0.9851
16      1    16 0.9972 1.0074 1.0000 0.9872
17      2     2 0.4998 0.4963 0.4919 0.5000
18      2     3 0.5330 0.5295 0.5662 0.6200
19      2     4 0.5661 0.5628 0.6190 0.7051
20      2     5 0.5993 0.5962 0.6599 0.7668
21      2     6 0.6325 0.6298 0.6933 0.8125
22      2     7 0.6656 0.6635 0.7215 0.8472
23      2     8 0.6988 0.6973 0.7460 0.8739
24      2     9 0.7319 0.7313 0.7676 0.8949
25      2    10 0.7651 0.7654 0.7869 0.9115
26      2    11 0.7982 0.7996 0.8043 0.9249
27      2    12 0.8314 0.8339 0.8203 0.9357
28      2    13 0.8646 0.8683 0.8349 0.9446
29      2    14 0.8977 0.9029 0.8485 0.9520
30      2    15 0.9309 0.9376 0.8612 0.9582
31      2    16 0.9640 0.9725 0.8730 0.9634
32      3     3 0.4998 0.4963 0.4919 0.5000
33      3     4 0.5330 0.5295 0.5447 0.5924
34      3     5 0.5661 0.5628 0.5856 0.6638
35      3     6 0.5993 0.5962 0.6190 0.7197
36      3     7 0.6325 0.6298 0.6472 0.7640
37      3     8 0.6656 0.6635 0.6717 0.7995
38      3     9 0.6988 0.6973 0.6933 0.8283
39      3    10 0.7319 0.7313 0.7126 0.8519
40      3    11 0.7651 0.7654 0.7300 0.8714
41      3    12 0.7982 0.7996 0.7460 0.8877
42      3    13 0.8314 0.8339 0.7606 0.9014
43      3    14 0.8646 0.8683 0.7742 0.9130
44      3    15 0.8977 0.9029 0.7869 0.9229
45      3    16 0.9309 0.9376 0.7987 0.9314
46      4     4 0.4998 0.4963 0.4919 0.5000
47      4     5 0.5330 0.5295 0.5328 0.5750
48      4     6 0.5661 0.5628 0.5662 0.6362
49      4     7 0.5993 0.5962 0.5945 0.6863
50      4     8 0.6325 0.6298 0.6190 0.7278
51      4     9 0.6656 0.6635 0.6405 0.7623
52      4    10 0.6988 0.6973 0.6599 0.7913
53      4    11 0.7319 0.7313 0.6773 0.8158
54      4    12 0.7651 0.7654 0.6933 0.8367
55      4    13 0.7982 0.7996 0.7079 0.8545
56      4    14 0.8314 0.8339 0.7215 0.8699
57      4    15 0.8646 0.8683 0.7342 0.8833
58      4    16 0.8977 0.9029 0.7460 0.8949
59      5     5 0.4998 0.4963 0.4919 0.5000
60      5     6 0.5330 0.5295 0.5254 0.5632
61      5     7 0.5661 0.5628 0.5536 0.6165
62      5     8 0.5993 0.5962 0.5781 0.6616
63      5     9 0.6325 0.6298 0.5997 0.7000
64      5    10 0.6656 0.6635 0.6190 0.7329
65      5    11 0.6988 0.6973 0.6364 0.7612
66      5    12 0.7319 0.7313 0.6524 0.7857
67      5    13 0.7651 0.7654 0.6670 0.8070
68      5    14 0.7982 0.7996 0.6806 0.8256
69      5    15 0.8314 0.8339 0.6933 0.8419
70      5    16 0.8646 0.8683 0.7051 0.8562
71      6     6 0.4998 0.4963 0.4919 0.5000
72      6     7 0.5330 0.5295 0.5202 0.5546
73      6     8 0.5661 0.5628 0.5447 0.6017
74      6     9 0.5993 0.5962 0.5662 0.6426
75      6    10 0.6325 0.6298 0.5856 0.6782
76      6    11 0.6656 0.6635 0.6030 0.7092
77      6    12 0.6988 0.6973 0.6190 0.7365
78      6    13 0.7319 0.7313 0.6336 0.7604
79      6    14 0.7651 0.7654 0.6472 0.7816
80      6    15 0.7982 0.7996 0.6599 0.8004
81      6    16 0.8314 0.8339 0.6717 0.8171
82      7     7 0.4998 0.4963 0.4919 0.5000
83      7     8 0.5330 0.5295 0.5164 0.5480
84      7     9 0.5661 0.5628 0.5380 0.5903
85      7    10 0.5993 0.5962 0.5573 0.6276
86      7    11 0.6325 0.6298 0.5748 0.6606
87      7    12 0.6656 0.6635 0.5907 0.6898
88      7    13 0.6988 0.6973 0.6054 0.7158
89      7    14 0.7319 0.7313 0.6190 0.7391
90      7    15 0.7651 0.7654 0.6316 0.7599
91      7    16 0.7982 0.7996 0.6434 0.7785
92      8     8 0.4998 0.4963 0.4919 0.5000
93      8     9 0.5330 0.5295 0.5135 0.5429
94      8    10 0.5661 0.5628 0.5328 0.5811
95      8    11 0.5993 0.5962 0.5503 0.6154
96      8    12 0.6325 0.6298 0.5662 0.6461
97      8    13 0.6656 0.6635 0.5809 0.6736
98      8    14 0.6988 0.6973 0.5945 0.6984
99      8    15 0.7319 0.7313 0.6071 0.7208
100     8    16 0.7651 0.7654 0.6190 0.7411
101     9     9 0.4998 0.4963 0.4919 0.5000
102     9    10 0.5330 0.5295 0.5113 0.5387
103     9    11 0.5661 0.5628 0.5287 0.5737
104     9    12 0.5993 0.5962 0.5447 0.6053
105     9    13 0.6325 0.6298 0.5593 0.6339
106     9    14 0.6656 0.6635 0.5729 0.6599
107     9    15 0.6988 0.6973 0.5856 0.6835
108     9    16 0.7319 0.7313 0.5974 0.7051
109    10    10 0.4998 0.4963 0.4919 0.5000
110    10    11 0.5330 0.5295 0.5094 0.5353
111    10    12 0.5661 0.5628 0.5254 0.5675
112    10    13 0.5993 0.5962 0.5400 0.5968
113    10    14 0.6325 0.6298 0.5536 0.6236
114    10    15 0.6656 0.6635 0.5662 0.6482
115    10    16 0.6988 0.6973 0.5781 0.6707
116    11    11 0.4998 0.4963 0.4919 0.5000
117    11    12 0.5330 0.5295 0.5079 0.5324
118    11    13 0.5661 0.5628 0.5226 0.5622
119    11    14 0.5993 0.5962 0.5361 0.5896
120    11    15 0.6325 0.6298 0.5488 0.6148
121    11    16 0.6656 0.6635 0.5606 0.6381
122    12    12 0.4998 0.4963 0.4919 0.5000
123    12    13 0.5330 0.5295 0.5066 0.5300
124    12    14 0.5661 0.5628 0.5202 0.5577
125    12    15 0.5993 0.5962 0.5328 0.5834
126    12    16 0.6325 0.6298 0.5447 0.6072
127    13    13 0.4998 0.4963 0.4919 0.5000
128    13    14 0.5330 0.5295 0.5055 0.5279
129    13    15 0.5661 0.5628 0.5182 0.5538
130    13    16 0.5993 0.5962 0.5300 0.5780
131    14    14 0.4998 0.4963 0.4919 0.5000
132    14    15 0.5330 0.5295 0.5046 0.5261
133    14    16 0.5661 0.5628 0.5164 0.5504
134    15    15 0.4998 0.4963 0.4919 0.5000
135    15    16 0.5330 0.5295 0.5038 0.5245
136    16    16 0.4998 0.4963 0.4919 0.5000

More Models

Try building the best models you can using just team rankings.

You can use a linear model: lm()

or just make up something that you think might work