Kaggle March Madness Competition

Trying out first Kaggle Competition: March Madness. Although it's officially over, I'm doing this as an exercise to see how the process works.

In doing this exercise, I followed a tutorial from the statsguys' blog.

Reading the data

First, I downloaded all the data files from Kaggle. Next, I will store the datasets:

# setwd('Dropbox/Data Science/Kaggle/March Madness/') Read files
regSeason <- read.csv("regular_season_results.csv", header = TRUE, stringsAsFactors = FALSE)
seasons <- read.csv("seasons.csv", header = TRUE, stringsAsFactors = FALSE)
teams <- read.csv("teams.csv", header = TRUE, stringsAsFactors = FALSE)
tourneyRes <- read.csv("tourney_results.csv", header = TRUE, stringsAsFactors = FALSE)
tourneySeeds <- read.csv("tourney_seeds.csv", header = TRUE, stringsAsFactors = FALSE)
tourneySlots <- read.csv("tourney_slots.csv", header = TRUE, stringsAsFactors = FALSE)

Next, examine the data:

head(regSeason)

##   season daynum wteam wscore lteam lscore wloc numot
## 1      A     16   511     91   647     57    H    NA
## 2      A     16   515     75   812     67    H    NA
## 3      A     16   606     87   658     67    H    NA
## 4      A     16   670     73   573     65    H    NA
## 5      A     16   721     99   632     68    H    NA
## 6      A     16   842    102   597     86    A    NA

tail(regSeason)

##       season daynum wteam wscore lteam lscore wloc numot
## 91219      S    131   842     88   709     67    H     0
## 91220      S    132   592     61   640     60    N     0
## 91221      S    132   671     69   670     55    N     0
## 91222      S    132   779     65   825     61    N     0
## 91223      S    132   811     82   605     81    N     1
## 91224      S    132   830     72   577     63    N     0

According to Kaggle: This file identifies the game-by-game results for all 18 seasons of historical data, from season A (1995-6) through season R (2012-3). Each year, it includes all games played from daynum 0 through 132 (which by definition is “Selection Sunday”, the day that tournament pairings are announced). Each row in the file represents a single game played.

season - this is the one-letter identifier of the season, - corresponding to the “season” column in the “seasons.csv” file.
daynum - this integer always ranges from 0 to 132, and tells you what day the game was played on. It represents an offset from the
dayzero date in the “seasons.csv” file. For example, the first game in the file was daynum=16. Combined with the fact from the “season.csv”“ file that day zero was 10/30/1995, that means the first game was played 16 days later, or 11/15/1995. There are no teams that ever played more than one game on a given date, so you can use this fact if you need a unique key. In order to accomplish this uniqueness, we had to adjust one game's date. In March 2008, the SEC postseason tournament had to reschedule one game (Georgia-Kentucky) to a subsequent day, so Georgia had to actually play two games on the same day. In order to enforce this uniqueness, we moved the game date for the Georgia-Kentucky game back to its original date.
wteam - this identifies the id number of the team that won the game, as listed in the "teams.csv” file. No matter whether the game was won by the home team or visiting team, “wteam” always identifies the winning team.
wscore - this identifies the number of points scored by the winning team.
lteam - this identifies the id number of the team that lost the game.
lscore - this identifies the number of points scored by the l osing team.
numot - this indicates the number of overtime periods in the game, an integer 0 or higher. Note that this information is only available for season J (2004-5) or later. For seasons A thru I, all games will have numot=NA.
wloc - this identifies the “location” of the winning team. If the winning team was the home team, this value will be “H”. If the winning team was the visiting team, this value will be “A”. If it was played on a neutral court, then this value will be “N”. Sometimes it is unclear whether the site should be considered neutral, since it is near one team's home court, or even on their court during a tournament, but for this determination we have simply used the Kenneth Massey data in its current state, where the “@” sign is either listed with the winning team, the losing team, or neither team.

head(seasons)

##   season     years    dayzero regionW   regionX   regionY regionZ
## 1      A 1995-1996 10/30/1995    East   Midwest Southeast    West
## 2      B 1996-1997 10/28/1996    East Southeast   Midwest    West
## 3      C 1997-1998 10/27/1997    East      West   Midwest   South
## 4      D 1998-1999 10/26/1998    East   Midwest     South    West
## 5      E 1999-2000 11/01/1999 Midwest      West      East   South
## 6      F 2000-2001 10/30/2000    East      West   Midwest   South

According to Kaggle: This file identifies the 18 different seasons included in the historical data, along with certain season-level properties.

season - indicates the letter used to uniquely identify each season. For instance, season Q represents the 2011-2012 season, meaning the college basketball season that started in late 2011 and ended with the final tournament during March/April 2012.
years - indicates the years spanned by each season. For instance, you can see that season Q was played during the years 2011-2012. “dayzero” - tells you the date corresponding to daynum=0 during that season. All game dates have been aligned upon a common scale so that the championship game of the final tournament is on daynum=154. Working backward, the national semifinals are always on daynum=152, the “play-in” games are on days 134/135, Selection Sunday is on day 132, and so on. All game data includes the day number in order to make it easier to perform date calculations. If you really want to know the exact date a game was played on, you can combine the game's “daynum” with the season's “dayzero”. For instance, since day zero during the 2011-2012 season was 10/31/2011, if we know that the earliest regular season games that year were played on daynum=7, they were therefore played on 11/07/2011.
regionW/X/Y/Z - by convention, the four regions in the final tournament are always named W, X, Y, and Z. Whichever region's name comes first alphabetically, that region will be Region W. And whichever Region plays against Region W in the national semifinals, that will be Region X. For the other two regions, whichever region's name comes first alphabetically, that region will be Region Y, and the other will be Region Z. This allows us to identify the regions and brackets in a standardized way in other files. For instance, during the 2012 tournament, the four regions were East, Midwest, South, and West. Being the first alphabetically, East becomes W. Since the East regional champion (Ohio State) played against the Midwest regional champion (Kansas) in the national semifinals, that makes Midwest be region X. For the other two (South and West), since South comes first alphabetically, that makes South Y and therefore West is Z. So for this season, the W/X/Y/Z are East,Midwest,South,West.

head(teams)

##    id        name
## 1 501 Abilene Chr
## 2 502   Air Force
## 3 503       Akron
## 4 504     Alabama
## 5 505 Alabama A&M
## 6 506  Alabama St

From Kaggle: This file identifies the 356 different college teams that are present in at least one of the seasons from 1995-1996 through 2013-2014. The other data files that identify teams, such as when game-by-game results are listed or tournament seeds are listed, will always reference the teams by their id number rather than by their name. This makes the files more compact and also eliminates any possible spelling issues. This is the only file that actually contains the text names of the college teams.

id - this number uniquely identifies the college team. All id values are three digit numbers, starting with 501, and the id's are assigned in alphabetical order, so for instance 501 is “Abilene Chr”, and 502 is “Air Force”, … and 855 is “Youngstown State”, the last team name alphabetically. The one exception to the alphabetical sequencing is Incarnate Word (#856), which was added recently. Having numbers so high avoids any possible confusion with game scores and ensures that all team id's will be exactly three digits long. There are four teams that are not present in the historical data, but nevertheless exist in the teams file: 501 (Abilene Chr), 609 (Grand Canyon), 656 (MA Lowell), and 856 (Incarnate Word). These three teams are present for the first time in the 2013-2014 data (the current season), and so they have already been assigned id numbers.
name - this is the text name of the team, as determined in Kenneth Massey's historical game data. If a team name changed from one year to the next, then the new name is applied historically as well. For instance, the name of one college team used to be “MD Baltimore Co”, and is now “UMBC”, but since we know those are the same team, that is just represented as team 813, with a current team name of UMBC. Therefore, if you want to know what a team's results were in previous years, you don't have to look for different spellings of the team name, but instead you can just look for games played by that same team id in the previous years.

head(tourneyRes)

##   season daynum wteam wscore lteam lscore numot
## 1      A    136   515     86   729     80    NA
## 2      A    136   559     68   555     59    NA
## 3      A    136   576     75   666     63    NA
## 4      A    136   581     75   577     60    NA
## 5      A    136   604     81   551     74    NA
## 6      A    136   629     74   539     64    NA

From Kaggle: This file identifies the game-by-game NCAA tournament results for all 18 seasons of historical data, from season A (1995-6) through season R (2012-3). The data is formatted exactly like the “regular season results” data, except that all games are assumed to be on a neutral court, and therefore the “wloc” column is not included. Note that these games also include the play-in games (which always occurred on day 134/135) for those years that had play-in games.

head(tourneySeeds)

##   season seed team
## 1      A  W01  663
## 2      A  W02  603
## 3      A  W03  796
## 4      A  W04  660
## 5      A  W05  729
## 6      A  W06  708

From Kaggle: This file identifies the seeds for the final 64 teams in each NCAA tournament, for all 18 seasons of historical data. The losers of play-in games are not listed, though their games are included in the “tourney_results.csv” file. Therefore there are exactly 64 rows for each year, and a total of 64x18=1152 rows.

season - this is the one-letter identifier of the season, corresponding to the “season” column in the “seasons.csv” file “seed” - this is a three-character identifier of the seed, where the first character is either W, X, Y, or Z (identifying the region the team was in) and the next two digits (either 01, 02, …, 15, or 16) tells you the seed within the region. For example, the first record in the file is seed W01, which means we are looking at the #1 seed in the W region (which we can see from the “seasons.csv” file was the East region). This seed is also referenced in the “tourney_slots.csv” file that tells us which bracket slots face which other bracket slots in which rounds.
team - this identifies the id number of the team, as specified in the “teams.csv” file.

head(tourneySlots)

##   season slot strongseed weakseed
## 1      A R1W1        W01      W16
## 2      A R1W2        W02      W15
## 3      A R1W3        W03      W14
## 4      A R1W4        W04      W13
## 5      A R1W5        W05      W12
## 6      A R1W6        W06      W11

From Kaggle: This file identifies the mechanism by which teams are paired against each other, depending upon their seeds. Because of the existence of play-in games for particular seed numbers, the pairings have small differences from year to year. If there were N teams in the tournament during a particular year, there were N-1 teams eliminated (leaving one champion) and therefore N-1 games played, as well as N-1 slots in the tournament bracket, and thus there will be N-1 records in this file for that season. There were 64 tournament teams in seasons A-E, 65 tournament teams in seasons F-O, and 68 tournament teams in seasons P-R.

season - this is the one-letter identifier of the season, corresponding to the “season” column in the “seasons.csv” file “slot” - this uniquely identifies one of the tournament games. For play-in games, it is a three-character string identifying the seed fulfilled by the winning team, such as W16 or Z13. For regular tournament games, it is a four-character string, where the first two characters tell you which round the game is (R1, R2, R3, R4, R5, or R6) and the second two characters tell you the expected seed of the favored team. Thus the first row is R1W1, identifying the Round 1 game played in the W bracket, where the favored team is the 1 seed. As a further example, the R2W1 slot indicates the Round 2 game that would have the 1 seed from the W bracket, assuming that all favored teams have won up to that point. The slot names are different for the final two rounds, where R5WX identifies the national semifinal game between the winners of regions W and X, and R5YZ identifies the national semifinal game between the winners of regions Y and Z, and R6CH identifies the championship game. The “slot” value is used in other columns in order to represent the advancement and pairings of winners of previous games.
strongseed - this indicates the expected stronger-seeded team that plays in this game. For Round 1 games, a team seed is identified in this column (as listed in the “seed” column in the “tourney_seeds.csv” file), whereas for subsequent games, a slot is identified in this column. In the first record of this file (slot R1W1), we see that seed W01 is the “strongseed”, which during the 1996 tournament would have been Massachusetts. Whereas for games from Round 2 or later, rather than a team seed, we will see a “slot” referenced in this column. So in the 33rd record of this file (slot R2W1), it tells us that the winners of slots R1W1 and R1W8 will face each other in Round 2. Of course, in the last few games of the tournament - the national semifinals and finals - it's not really meaningful to talk about a “strong seed” or “weak seed”, but those games are represented in the same format for the sake of uniformity.
weakseed - this indicates the expected weaker-seeded team that plays in this game, assuming all favored teams have won so far. For Round 1 games, a team seed is identified in this column (as listed in the “seed” column in the “tourney seeds.csv” file), whereas for subsequent games, a slot is identified in this column. So in the first record of this file (slot R1W1), we see that seed W16 is the “weakseed”. So we can look this up in the “tourney_seeds.csv” file to see that in season A, the W16 seed was team 809, which we can then find (from the “teams.csv” file) to be UCF. Thus we know that this game was Massachusetts against UCF during the 1996 tournament. And just like the “strongseed” column, for games from Round 2 or later, rather than a team seed, we will see a “slot” referenced in this column.

Goal of the Competition

From Kaggle:

In stage one of this two-stage competition, participants will build and test their models against the previous five tournaments. In the second stage, participants will predict the outcome of the 2014 tournament.

In the first stage of the competition, we are asked to make predictions on every possible first-round tournament matchups between every team for seasons N, O, P, Q, and R (each alphabet represents a season). Taking season N as an example, there were a total of 65 teams, so you have to make predictions for:

Team 1 vs. Team 2, Team 1 vs. Team 3,…Team 1 vs. Team 65 and then…
Team 2 vs. Team 3, Team 2 vs. Team 4,… Team 2 vs. Team 65 and so on…
until every combination is listed for each season N-R

Here's a snapshot of the sample submission file:

Code for creating the submission file:

create a custom function submissionFile() which creates columns in the form SEASONLETTER_TEAMID_TEAMID
The format is a list of every possible matchup between the tournament teams. Since team1 vs. team2 is the same as team2 vs. team1, we only include the game pairs where team1 has the lower team id. For example, in a tournament of 68 teams (64 + 4 play-in teams), you will predict (68*67)/2 = 2278 matchups.

submissionFile <- function(season) {
    playoffTeams <- sort(tourneySeeds$team[which(tourneySeeds$season == season)])
    numTeams <- length(playoffTeams)
    matrix <- matrix(nrow = numTeams, ncol = numTeams)
    for (i in c(1:numTeams)) {
        for (j in c(1:numTeams)) {
            matrix[i, j] <- paste(season, "_", playoffTeams[i], "_", playoffTeams[j], 
                sep = "")
        }
    }
    keep <- upper.tri(matrix, diag = F)
    idcol <- vector()
    for (i in c(1:numTeams)) {
        for (j in c(1:numTeams)) {
            if (keep[i, j] == T) {
                idcol <- c(idcol, matrix[i, j])
            }
        }
    }
    form <- data.frame(Matchup = idcol, Win = NA)
    return(form)
}
sub_file <- data.frame()
for (i in LETTERS[19:19]) {
    sub_file <- rbind(sub_file, submissionFile(i))
    # 14-18th letters: N to R, 19th is S
}

First Submission

To test the submission process, we simply guess 50% for every possible matchup. Then store this in a file called “sub1.csv”:

colnames(sub_file) <- c("id", "pred")
sub_file$pred <- 0.5
write.csv(sub_file, file = "sub1.csv", row.names = FALSE)

So here we go, first submission resulted in a ranking that fits in the “benchmark 50%” spot as expected!

Sample Model and How to Create Training Data Set

The following methodology is how the statsguys performed their logistic model.

Using playoffs as the response variables and the regular seasons win/loss as the predictors.

They provided past 18 seasons and 18 playoffs. And since we only need to predict 1 season, we can use 17 seasons as training set.

The first step is to create a data frame in R which lists these match ups in one column and the result of the game in another column. The following does this for season A:

first, select the games from season A and name that data frame season_matches.
next, loop through each row of season_matches and we concatenate the season “A” with the teamID for the winning team and the losing team.
finally, place these newly formed strings into a new data frame train_data_frame along with the result of the game. So in the end, it'd look something like: A_515_729 for two teams that played in the “A” season.

season_matches <- tourneyRes[which(tourneyRes$season == "A"), ]
team <- vector()
result <- vector()
for (i in c(1:nrow(season_matches))) {
    row <- season_matches[i, ]
    if (row$wteam < row$lteam) {
        vector <- paste("A", "_", row$wteam, "_", row$lteam, sep = "")
        team <- c(team, vector)
        result <- c(result, 1)
    } else {
        oth <- paste("A", "_", row$lteam, "_", row$wteam, sep = "")
        team <- c(team, oth)
        result <- c(result, 0)
    }
}
train_data_frame <- data.frame(Matchup = team, Win = result)

Taking a look at the training data set created:

head(train_data_frame)

##     Matchup Win
## 1 A_515_729   1
## 2 A_555_559   0
## 3 A_576_666   1
## 4 A_577_581   0
## 5 A_551_604   0
## 6 A_539_629   0

Next, create a dataset which is organized by teamID rather than season. Doing this just for Season “A”:

# install.packages('stringr')
library("stringr")
# Selecting and sorting the playoff teamIDs least to greatest for season A
playoff_teams <- sort(tourneySeeds$team[which(tourneySeeds$season == "A")])

# Selecting the seeds for season A
playoff_seeds <- tourneySeeds[which(tourneySeeds$season == "A"), ]

# Selecting the regular season statistics for season A
season <- regSeason[which(regSeason$season == "A"), ]

# Wins by team (Regular Season wins against playoff teams)
win_freq_table <- as.data.frame(table(season$wteam))
wins_by_team <- win_freq_table[win_freq_table$Var1 %in% playoff_teams, ]

# Losses by team (Regular Season wins against playoff teams)
loss_freq_table <- as.data.frame(table(season$lteam))
loss_by_team <- loss_freq_table[loss_freq_table$Var1 %in% playoff_teams, ]

# Total Win Percentage (against playoff teams)
gamesplayed <- as.vector(wins_by_team$Freq + loss_by_team$Freq)
total_winpct <- round(wins_by_team$Freq/gamesplayed, digits = 3)
total_winpct_by_team <- as.data.frame(cbind(as.vector(loss_by_team$Var1), total_winpct))
colnames(total_winpct_by_team) <- c("Var1", "Freq")

# Num of wins in last 6 games
wins_last_six_games_by_team <- data.frame()
for (i in playoff_teams) {
    games <- season[which(season$wteam == i | season$lteam == i), ]
    numwins <- sum(tail(games$wteam) == i)
    put <- c(i, numwins)
    wins_last_six_games_by_team <- rbind(wins_last_six_games_by_team, put)
}
colnames(wins_last_six_games_by_team) <- c("Var1", "Freq")

# Seed in tournament
pattern <- "[A-Z]([0-9][0-9])"
team_seeds <- as.data.frame(str_match(playoff_seeds$seed, pattern))
seeds <- as.numeric(team_seeds$V2)
playoff_seeds$seed <- seeds
seed_col <- vector()
for (i in playoff_teams) {
    val <- match(i, playoff_seeds$team)
    seed_col <- c(seed_col, playoff_seeds$seed[val])
}
team_seed <- data.frame(Var1 = playoff_teams, Freq = seed_col)

# Combining columns together
team_metrics <- data.frame()
team_metrics <- cbind(total_winpct_by_team, wins_last_six_games_by_team$Freq, 
    team_seed$Freq)
colnames(team_metrics) <- c("TEAMID", "A_TWPCT", "A_WST6", "A_SEED")

Taking a look at the data frames created:

head(wins_by_team)

##    Var1 Freq
## 8   511   24
## 11  515   18
## 15  519   17
## 20  527   18
## 23  530   22
## 31  539   17

head(team_metrics)

##   TEAMID A_TWPCT A_WST6 A_SEED
## 1    511     0.8      5      3
## 2    515     0.6      2     12
## 3    519    0.63      5     14
## 4    527   0.643      2     11
## 5    530   0.759      5      8
## 6    539    0.63      4     12

NOTE: For the actual competition, the statsguys used the following predictors (only highlighted ones were shown in example above).

Away Wins Winning Percentage
Wins by margin less than 2
Losses by margin less than 2
Wins by margin greater than 7
Losses by margin greater than 7
Win Percentage in last 4 weeks
Win Percentage against playoff teams
Number of wins in last 6 games of the season
Seed in Tournament
Wins in Tournament

To get a better picture of what we have now, here is a summary (we have DF 1,2, and 3 below):

In order to arrive at DF4 above: 1. Sort DF 1 by the teamIDs defined in the column titled “A_ID” of DF 4 2. Sort DF 2 by the teamIDs defined in the column titled “B_ID” of DF 4 3. Combine the columns from the two data frames in 1 and 2 to create the training set

In order to do so easily, the statsguys created blog_utility.R to create the combined dataframe (specifically the train_frame_model() function):

source("blog_utility.R")
trainData <- data.frame()
for (i in LETTERS[1:13]) {
    trainData <- rbind(trainData, train_frame_model(i))
}

head(trainData)

##      Matchup Win HomeID AwayID A_TWPCT A_WST6 A_SEED B_TWPCT B_WST6 B_SEED
## 2  A_515_729   1    515    729     0.6      2     12   0.778      3      5
## 10 A_555_559   0    555    559   0.517      5     16   0.938      6      1
## 12 A_576_666   1    576    666   0.893      6     12   0.759      4      5
## 13 A_577_581   0    577    581     0.6      4      8   0.815      5      9
## 9  A_551_604   0    551    604   0.643      3      9   0.679      5      8
## 6  A_539_629   0    539    629    0.63      4     12   0.733      5      5

Applying Classification Tree Model

The statsguys use Rpart package rather than the tree package that we learned in STATS216.

library("rpart")
train_rpart <- rpart(Win ~ A_WST6 + A_SEED + B_WST6 + B_SEED, data = trainData, 
    method = "class")

Once again, the statsguys created a useful function for creating the test data set (test_frame_model() in blog_utility.R which they provided):

testData <- data.frame()
for (i in LETTERS[19:19]) {
    testData <- rbind(testData, test_frame_model(i))
}

## Warning: longer object length is not a multiple of shorter object length
## Warning: number of rows of result is not a multiple of vector length (arg 1)


head(testData)

##     Matchup Win HomeID AwayID A_TWPCT A_WST6 A_SEED B_TWPCT B_WST6 B_SEED
## 1 S_507_509  NA    507    509   0.562      5     16   0.625      5     15
## 2 S_507_511  NA    507    511   0.562      5     16   0.882      4      1
## 3 S_507_512  NA    507    512   0.562      5     16   0.656      2     10
## 4 S_507_521  NA    507    521   0.562      5     16   0.667      5      6
## 5 S_507_536  NA    507    536   0.562      5     16   0.667      5     10
## 6 S_507_538  NA    507    538   0.562      5     16   0.367      3     16

Now we can finally make prediction using the model. Create a submission file for Kaggle:

predictions_rpart <- predict(train_rpart, newdata = testData, type = "prob")
predictions <- predictions_rpart[, 1]
subfile <- data.frame(id = testData$Matchup, pred = predictions)
write.csv(subfile, file = "tree_model.csv", row.names = FALSE)

The second submission resulted in a ranking that is below the “benchmark 50%” (ranking around 200) spot as expected!

Let's try to make some improvements!!