Trying out first Kaggle Competition: March Madness. Although it's officially over, I'm doing this as an exercise to see how the process works.
In doing this exercise, I followed a tutorial from the statsguys' blog.
First, I downloaded all the data files from Kaggle. Next, I will store the datasets:
# setwd('Dropbox/Data Science/Kaggle/March Madness/') Read files
regSeason <- read.csv("regular_season_results.csv", header = TRUE, stringsAsFactors = FALSE)
seasons <- read.csv("seasons.csv", header = TRUE, stringsAsFactors = FALSE)
teams <- read.csv("teams.csv", header = TRUE, stringsAsFactors = FALSE)
tourneyRes <- read.csv("tourney_results.csv", header = TRUE, stringsAsFactors = FALSE)
tourneySeeds <- read.csv("tourney_seeds.csv", header = TRUE, stringsAsFactors = FALSE)
tourneySlots <- read.csv("tourney_slots.csv", header = TRUE, stringsAsFactors = FALSE)
Next, examine the data:
head(regSeason)
## season daynum wteam wscore lteam lscore wloc numot
## 1 A 16 511 91 647 57 H NA
## 2 A 16 515 75 812 67 H NA
## 3 A 16 606 87 658 67 H NA
## 4 A 16 670 73 573 65 H NA
## 5 A 16 721 99 632 68 H NA
## 6 A 16 842 102 597 86 A NA
tail(regSeason)
## season daynum wteam wscore lteam lscore wloc numot
## 91219 S 131 842 88 709 67 H 0
## 91220 S 132 592 61 640 60 N 0
## 91221 S 132 671 69 670 55 N 0
## 91222 S 132 779 65 825 61 N 0
## 91223 S 132 811 82 605 81 N 1
## 91224 S 132 830 72 577 63 N 0
According to Kaggle: This file identifies the game-by-game results for all 18 seasons of historical data, from season A (1995-6) through season R (2012-3). Each year, it includes all games played from daynum 0 through 132 (which by definition is “Selection Sunday”, the day that tournament pairings are announced). Each row in the file represents a single game played.
season - this is the one-letter identifier of the season, - corresponding to the “season” column in the “seasons.csv” file.
daynum - this integer always ranges from 0 to 132, and tells you what day the game was played on. It represents an offset from the
dayzero date in the “seasons.csv” file. For example, the first game in the file was daynum=16. Combined with the fact from the “season.csv”“ file that day zero was 10/30/1995, that means the first game was played 16 days later, or 11/15/1995. There are no teams that ever played more than one game on a given date, so you can use this fact if you need a unique key. In order to accomplish this uniqueness, we had to adjust one game's date. In March 2008, the SEC postseason tournament had to reschedule one game (Georgia-Kentucky) to a subsequent day, so Georgia had to actually play two games on the same day. In order to enforce this uniqueness, we moved the game date for the Georgia-Kentucky game back to its original date.
wteam - this identifies the id number of the team that won the game, as listed in the "teams.csv” file. No matter whether the game was won by the home team or visiting team, “wteam” always identifies the winning team.
wscore - this identifies the number of points scored by the winning team.
lteam - this identifies the id number of the team that lost the game.
lscore - this identifies the number of points scored by the l osing team.
numot - this indicates the number of overtime periods in the game, an integer 0 or higher. Note that this information is only available for season J (2004-5) or later. For seasons A thru I, all games will have numot=NA.
wloc - this identifies the “location” of the winning team. If the winning team was the home team, this value will be “H”. If the winning team was the visiting team, this value will be “A”. If it was played on a neutral court, then this value will be “N”. Sometimes it is unclear whether the site should be considered neutral, since it is near one team's home court, or even on their court during a tournament, but for this determination we have simply used the Kenneth Massey data in its current state, where the “@” sign is either listed with the winning team, the losing team, or neither team.
head(seasons)
## season years dayzero regionW regionX regionY regionZ
## 1 A 1995-1996 10/30/1995 East Midwest Southeast West
## 2 B 1996-1997 10/28/1996 East Southeast Midwest West
## 3 C 1997-1998 10/27/1997 East West Midwest South
## 4 D 1998-1999 10/26/1998 East Midwest South West
## 5 E 1999-2000 11/01/1999 Midwest West East South
## 6 F 2000-2001 10/30/2000 East West Midwest South
According to Kaggle: This file identifies the 18 different seasons included in the historical data, along with certain season-level properties.
season - indicates the letter used to uniquely identify each season. For instance, season Q represents the 2011-2012 season, meaning the college basketball season that started in late 2011 and ended with the final tournament during March/April 2012.
years - indicates the years spanned by each season. For instance, you can see that season Q was played during the years 2011-2012.
“dayzero” - tells you the date corresponding to daynum=0 during that season. All game dates have been aligned upon a common scale so that the championship game of the final tournament is on daynum=154. Working backward, the national semifinals are always on daynum=152, the “play-in” games are on days 134/135, Selection Sunday is on day 132, and so on. All game data includes the day number in order to make it easier to perform date calculations. If you really want to know the exact date a game was played on, you can combine the game's “daynum” with the season's “dayzero”. For instance, since day zero during the 2011-2012 season was 10/31/2011, if we know that the earliest regular season games that year were played on daynum=7, they were therefore played on 11/07/2011.
regionW/X/Y/Z - by convention, the four regions in the final tournament are always named W, X, Y, and Z. Whichever region's name comes first alphabetically, that region will be Region W. And whichever Region plays against Region W in the national semifinals, that will be Region X. For the other two regions, whichever region's name comes first alphabetically, that region will be Region Y, and the other will be Region Z. This allows us to identify the regions and brackets in a standardized way in other files. For instance, during the 2012 tournament, the four regions were East, Midwest, South, and West. Being the first alphabetically, East becomes W. Since the East regional champion (Ohio State) played against the Midwest regional champion (Kansas) in the national semifinals, that makes Midwest be region X. For the other two (South and West), since South comes first alphabetically, that makes South Y and therefore West is Z. So for this season, the W/X/Y/Z are East,Midwest,South,West.
head(teams)
## id name
## 1 501 Abilene Chr
## 2 502 Air Force
## 3 503 Akron
## 4 504 Alabama
## 5 505 Alabama A&M
## 6 506 Alabama St
From Kaggle: This file identifies the 356 different college teams that are present in at least one of the seasons from 1995-1996 through 2013-2014. The other data files that identify teams, such as when game-by-game results are listed or tournament seeds are listed, will always reference the teams by their id number rather than by their name. This makes the files more compact and also eliminates any possible spelling issues. This is the only file that actually contains the text names of the college teams.
id - this number uniquely identifies the college team. All id values are three digit numbers, starting with 501, and the id's are assigned in alphabetical order, so for instance 501 is “Abilene Chr”, and 502 is “Air Force”, … and 855 is “Youngstown State”, the last team name alphabetically. The one exception to the alphabetical sequencing is Incarnate Word (#856), which was added recently. Having numbers so high avoids any possible confusion with game scores and ensures that all team id's will be exactly three digits long. There are four teams that are not present in the historical data, but nevertheless exist in the teams file: 501 (Abilene Chr), 609 (Grand Canyon), 656 (MA Lowell), and 856 (Incarnate Word). These three teams are present for the first time in the 2013-2014 data (the current season), and so they have already been assigned id numbers.
name - this is the text name of the team, as determined in Kenneth Massey's historical game data. If a team name changed from one year to the next, then the new name is applied historically as well. For instance, the name of one college team used to be “MD Baltimore Co”, and is now “UMBC”, but since we know those are the same team, that is just represented as team 813, with a current team name of UMBC. Therefore, if you want to know what a team's results were in previous years, you don't have to look for different spellings of the team name, but instead you can just look for games played by that same team id in the previous years.
head(tourneyRes)
## season daynum wteam wscore lteam lscore numot
## 1 A 136 515 86 729 80 NA
## 2 A 136 559 68 555 59 NA
## 3 A 136 576 75 666 63 NA
## 4 A 136 581 75 577 60 NA
## 5 A 136 604 81 551 74 NA
## 6 A 136 629 74 539 64 NA
From Kaggle: This file identifies the game-by-game NCAA tournament results for all 18 seasons of historical data, from season A (1995-6) through season R (2012-3). The data is formatted exactly like the “regular season results” data, except that all games are assumed to be on a neutral court, and therefore the “wloc” column is not included. Note that these games also include the play-in games (which always occurred on day 134/135) for those years that had play-in games.
head(tourneySeeds)
## season seed team
## 1 A W01 663
## 2 A W02 603
## 3 A W03 796
## 4 A W04 660
## 5 A W05 729
## 6 A W06 708
From Kaggle: This file identifies the seeds for the final 64 teams in each NCAA tournament, for all 18 seasons of historical data. The losers of play-in games are not listed, though their games are included in the “tourney_results.csv” file. Therefore there are exactly 64 rows for each year, and a total of 64x18=1152 rows.
season - this is the one-letter identifier of the season, corresponding to the “season” column in the “seasons.csv” file
“seed” - this is a three-character identifier of the seed, where the first character is either W, X, Y, or Z (identifying the region the team was in) and the next two digits (either 01, 02, …, 15, or 16) tells you the seed within the region. For example, the first record in the file is seed W01, which means we are looking at the #1 seed in the W region (which we can see from the “seasons.csv” file was the East region). This seed is also referenced in the “tourney_slots.csv” file that tells us which bracket slots face which other bracket slots in which rounds.
team - this identifies the id number of the team, as specified in the “teams.csv” file.
head(tourneySlots)
## season slot strongseed weakseed
## 1 A R1W1 W01 W16
## 2 A R1W2 W02 W15
## 3 A R1W3 W03 W14
## 4 A R1W4 W04 W13
## 5 A R1W5 W05 W12
## 6 A R1W6 W06 W11
From Kaggle: This file identifies the mechanism by which teams are paired against each other, depending upon their seeds. Because of the existence of play-in games for particular seed numbers, the pairings have small differences from year to year. If there were N teams in the tournament during a particular year, there were N-1 teams eliminated (leaving one champion) and therefore N-1 games played, as well as N-1 slots in the tournament bracket, and thus there will be N-1 records in this file for that season. There were 64 tournament teams in seasons A-E, 65 tournament teams in seasons F-O, and 68 tournament teams in seasons P-R.
season - this is the one-letter identifier of the season, corresponding to the “season” column in the “seasons.csv” file
“slot” - this uniquely identifies one of the tournament games. For play-in games, it is a three-character string identifying the seed fulfilled by the winning team, such as W16 or Z13. For regular tournament games, it is a four-character string, where the first two characters tell you which round the game is (R1, R2, R3, R4, R5, or R6) and the second two characters tell you the expected seed of the favored team. Thus the first row is R1W1, identifying the Round 1 game played in the W bracket, where the favored team is the 1 seed. As a further example, the R2W1 slot indicates the Round 2 game that would have the 1 seed from the W bracket, assuming that all favored teams have won up to that point. The slot names are different for the final two rounds, where R5WX identifies the national semifinal game between the winners of regions W and X, and R5YZ identifies the national semifinal game between the winners of regions Y and Z, and R6CH identifies the championship game. The “slot” value is used in other columns in order to represent the advancement and pairings of winners of previous games.
strongseed - this indicates the expected stronger-seeded team that plays in this game. For Round 1 games, a team seed is identified in this column (as listed in the “seed” column in the “tourney_seeds.csv” file), whereas for subsequent games, a slot is identified in this column. In the first record of this file (slot R1W1), we see that seed W01 is the “strongseed”, which during the 1996 tournament would have been Massachusetts. Whereas for games from Round 2 or later, rather than a team seed, we will see a “slot” referenced in this column. So in the 33rd record of this file (slot R2W1), it tells us that the winners of slots R1W1 and R1W8 will face each other in Round 2. Of course, in the last few games of the tournament - the national semifinals and finals - it's not really meaningful to talk about a “strong seed” or “weak seed”, but those games are represented in the same format for the sake of uniformity.
weakseed - this indicates the expected weaker-seeded team that plays in this game, assuming all favored teams have won so far. For Round 1 games, a team seed is identified in this column (as listed in the “seed” column in the “tourney seeds.csv” file), whereas for subsequent games, a slot is identified in this column. So in the first record of this file (slot R1W1), we see that seed W16 is the “weakseed”. So we can look this up in the “tourney_seeds.csv” file to see that in season A, the W16 seed was team 809, which we can then find (from the “teams.csv” file) to be UCF. Thus we know that this game was Massachusetts against UCF during the 1996 tournament. And just like the “strongseed” column, for games from Round 2 or later, rather than a team seed, we will see a “slot” referenced in this column.
From Kaggle:
In stage one of this two-stage competition, participants will build and test their models against the previous five tournaments. In the second stage, participants will predict the outcome of the 2014 tournament.
In the first stage of the competition, we are asked to make predictions on every possible first-round tournament matchups between every team for seasons N, O, P, Q, and R (each alphabet represents a season). Taking season N as an example, there were a total of 65 teams, so you have to make predictions for:
Team 1 vs. Team 2, Team 1 vs. Team 3,…Team 1 vs. Team 65 and then…
Team 2 vs. Team 3, Team 2 vs. Team 4,… Team 2 vs. Team 65 and so on…
until every combination is listed for each season N-R
Here's a snapshot of the sample submission file:

Code for creating the submission file:
create a custom function submissionFile() which creates columns in the form SEASONLETTER_TEAMID_TEAMID
The format is a list of every possible matchup between the tournament teams. Since team1 vs. team2 is the same as team2 vs. team1, we only include the game pairs where team1 has the lower team id. For example, in a tournament of 68 teams (64 + 4 play-in teams), you will predict (68*67)/2 = 2278 matchups.
submissionFile <- function(season) {
playoffTeams <- sort(tourneySeeds$team[which(tourneySeeds$season == season)])
numTeams <- length(playoffTeams)
matrix <- matrix(nrow = numTeams, ncol = numTeams)
for (i in c(1:numTeams)) {
for (j in c(1:numTeams)) {
matrix[i, j] <- paste(season, "_", playoffTeams[i], "_", playoffTeams[j],
sep = "")
}
}
keep <- upper.tri(matrix, diag = F)
idcol <- vector()
for (i in c(1:numTeams)) {
for (j in c(1:numTeams)) {
if (keep[i, j] == T) {
idcol <- c(idcol, matrix[i, j])
}
}
}
form <- data.frame(Matchup = idcol, Win = NA)
return(form)
}
sub_file <- data.frame()
for (i in LETTERS[19:19]) {
sub_file <- rbind(sub_file, submissionFile(i))
# 14-18th letters: N to R, 19th is S
}
To test the submission process, we simply guess 50% for every possible matchup. Then store this in a file called “sub1.csv”:
colnames(sub_file) <- c("id", "pred")
sub_file$pred <- 0.5
write.csv(sub_file, file = "sub1.csv", row.names = FALSE)
So here we go, first submission resulted in a ranking that fits in the “benchmark 50%” spot as expected!
The following methodology is how the statsguys performed their logistic model.
Using playoffs as the response variables and the regular seasons win/loss as the predictors.
The first step is to create a data frame in R which lists these match ups in one column and the result of the game in another column. The following does this for season A:
season_matches. season_matches and we concatenate the season “A” with the teamID for the winning team and the losing team. train_data_frame along with the result of the game. So in the end, it'd look something like: A_515_729 for two teams that played in the “A” season.season_matches <- tourneyRes[which(tourneyRes$season == "A"), ]
team <- vector()
result <- vector()
for (i in c(1:nrow(season_matches))) {
row <- season_matches[i, ]
if (row$wteam < row$lteam) {
vector <- paste("A", "_", row$wteam, "_", row$lteam, sep = "")
team <- c(team, vector)
result <- c(result, 1)
} else {
oth <- paste("A", "_", row$lteam, "_", row$wteam, sep = "")
team <- c(team, oth)
result <- c(result, 0)
}
}
train_data_frame <- data.frame(Matchup = team, Win = result)
Taking a look at the training data set created:
head(train_data_frame)
## Matchup Win
## 1 A_515_729 1
## 2 A_555_559 0
## 3 A_576_666 1
## 4 A_577_581 0
## 5 A_551_604 0
## 6 A_539_629 0
Next, create a dataset which is organized by teamID rather than season. Doing this just for Season “A”:
# install.packages('stringr')
library("stringr")
# Selecting and sorting the playoff teamIDs least to greatest for season A
playoff_teams <- sort(tourneySeeds$team[which(tourneySeeds$season == "A")])
# Selecting the seeds for season A
playoff_seeds <- tourneySeeds[which(tourneySeeds$season == "A"), ]
# Selecting the regular season statistics for season A
season <- regSeason[which(regSeason$season == "A"), ]
# Wins by team (Regular Season wins against playoff teams)
win_freq_table <- as.data.frame(table(season$wteam))
wins_by_team <- win_freq_table[win_freq_table$Var1 %in% playoff_teams, ]
# Losses by team (Regular Season wins against playoff teams)
loss_freq_table <- as.data.frame(table(season$lteam))
loss_by_team <- loss_freq_table[loss_freq_table$Var1 %in% playoff_teams, ]
# Total Win Percentage (against playoff teams)
gamesplayed <- as.vector(wins_by_team$Freq + loss_by_team$Freq)
total_winpct <- round(wins_by_team$Freq/gamesplayed, digits = 3)
total_winpct_by_team <- as.data.frame(cbind(as.vector(loss_by_team$Var1), total_winpct))
colnames(total_winpct_by_team) <- c("Var1", "Freq")
# Num of wins in last 6 games
wins_last_six_games_by_team <- data.frame()
for (i in playoff_teams) {
games <- season[which(season$wteam == i | season$lteam == i), ]
numwins <- sum(tail(games$wteam) == i)
put <- c(i, numwins)
wins_last_six_games_by_team <- rbind(wins_last_six_games_by_team, put)
}
colnames(wins_last_six_games_by_team) <- c("Var1", "Freq")
# Seed in tournament
pattern <- "[A-Z]([0-9][0-9])"
team_seeds <- as.data.frame(str_match(playoff_seeds$seed, pattern))
seeds <- as.numeric(team_seeds$V2)
playoff_seeds$seed <- seeds
seed_col <- vector()
for (i in playoff_teams) {
val <- match(i, playoff_seeds$team)
seed_col <- c(seed_col, playoff_seeds$seed[val])
}
team_seed <- data.frame(Var1 = playoff_teams, Freq = seed_col)
# Combining columns together
team_metrics <- data.frame()
team_metrics <- cbind(total_winpct_by_team, wins_last_six_games_by_team$Freq,
team_seed$Freq)
colnames(team_metrics) <- c("TEAMID", "A_TWPCT", "A_WST6", "A_SEED")
Taking a look at the data frames created:
head(wins_by_team)
## Var1 Freq
## 8 511 24
## 11 515 18
## 15 519 17
## 20 527 18
## 23 530 22
## 31 539 17
head(team_metrics)
## TEAMID A_TWPCT A_WST6 A_SEED
## 1 511 0.8 5 3
## 2 515 0.6 2 12
## 3 519 0.63 5 14
## 4 527 0.643 2 11
## 5 530 0.759 5 8
## 6 539 0.63 4 12
NOTE: For the actual competition, the statsguys used the following predictors (only highlighted ones were shown in example above).
To get a better picture of what we have now, here is a summary (we have DF 1,2, and 3 below):
In order to arrive at DF4 above: 1. Sort DF 1 by the teamIDs defined in the column titled “A_ID” of DF 4 2. Sort DF 2 by the teamIDs defined in the column titled “B_ID” of DF 4 3. Combine the columns from the two data frames in 1 and 2 to create the training set
In order to do so easily, the statsguys created blog_utility.R to create the combined dataframe (specifically the train_frame_model() function):
source("blog_utility.R")
trainData <- data.frame()
for (i in LETTERS[1:13]) {
trainData <- rbind(trainData, train_frame_model(i))
}
head(trainData)
## Matchup Win HomeID AwayID A_TWPCT A_WST6 A_SEED B_TWPCT B_WST6 B_SEED
## 2 A_515_729 1 515 729 0.6 2 12 0.778 3 5
## 10 A_555_559 0 555 559 0.517 5 16 0.938 6 1
## 12 A_576_666 1 576 666 0.893 6 12 0.759 4 5
## 13 A_577_581 0 577 581 0.6 4 8 0.815 5 9
## 9 A_551_604 0 551 604 0.643 3 9 0.679 5 8
## 6 A_539_629 0 539 629 0.63 4 12 0.733 5 5
The statsguys use Rpart package rather than the tree package that we learned in STATS216.
library("rpart")
train_rpart <- rpart(Win ~ A_WST6 + A_SEED + B_WST6 + B_SEED, data = trainData,
method = "class")
Once again, the statsguys created a useful function for creating the test data set (test_frame_model() in blog_utility.R which they provided):
testData <- data.frame()
for (i in LETTERS[19:19]) {
testData <- rbind(testData, test_frame_model(i))
}
## Warning: longer object length is not a multiple of shorter object length
## Warning: number of rows of result is not a multiple of vector length (arg 1)
head(testData)
## Matchup Win HomeID AwayID A_TWPCT A_WST6 A_SEED B_TWPCT B_WST6 B_SEED
## 1 S_507_509 NA 507 509 0.562 5 16 0.625 5 15
## 2 S_507_511 NA 507 511 0.562 5 16 0.882 4 1
## 3 S_507_512 NA 507 512 0.562 5 16 0.656 2 10
## 4 S_507_521 NA 507 521 0.562 5 16 0.667 5 6
## 5 S_507_536 NA 507 536 0.562 5 16 0.667 5 10
## 6 S_507_538 NA 507 538 0.562 5 16 0.367 3 16
Now we can finally make prediction using the model. Create a submission file for Kaggle:
predictions_rpart <- predict(train_rpart, newdata = testData, type = "prob")
predictions <- predictions_rpart[, 1]
subfile <- data.frame(id = testData$Matchup, pred = predictions)
write.csv(subfile, file = "tree_model.csv", row.names = FALSE)
The second submission resulted in a ranking that is below the “benchmark 50%” (ranking around 200) spot as expected!
Let's try to make some improvements!!