Kaggle March Madness Competition

Trying out first Kaggle Competition: March Madness. Although it's officially over, I'm doing this as an exercise to see how the process works.

In doing this exercise, I followed a tutorial from the statsguys' blog.

Reading the data

First, I downloaded all the data files from Kaggle. Next, I will store the datasets:

# setwd('Dropbox/Data Science/Kaggle/March Madness/') Read files
regSeason <- read.csv("regular_season_results.csv", header = TRUE, stringsAsFactors = FALSE)
seasons <- read.csv("seasons.csv", header = TRUE, stringsAsFactors = FALSE)
teams <- read.csv("teams.csv", header = TRUE, stringsAsFactors = FALSE)
tourneyRes <- read.csv("tourney_results.csv", header = TRUE, stringsAsFactors = FALSE)
tourneySeeds <- read.csv("tourney_seeds.csv", header = TRUE, stringsAsFactors = FALSE)
tourneySlots <- read.csv("tourney_slots.csv", header = TRUE, stringsAsFactors = FALSE)

Next, examine the data:

head(regSeason)
##   season daynum wteam wscore lteam lscore wloc numot
## 1      A     16   511     91   647     57    H    NA
## 2      A     16   515     75   812     67    H    NA
## 3      A     16   606     87   658     67    H    NA
## 4      A     16   670     73   573     65    H    NA
## 5      A     16   721     99   632     68    H    NA
## 6      A     16   842    102   597     86    A    NA
tail(regSeason)
##       season daynum wteam wscore lteam lscore wloc numot
## 91219      S    131   842     88   709     67    H     0
## 91220      S    132   592     61   640     60    N     0
## 91221      S    132   671     69   670     55    N     0
## 91222      S    132   779     65   825     61    N     0
## 91223      S    132   811     82   605     81    N     1
## 91224      S    132   830     72   577     63    N     0

According to Kaggle: This file identifies the game-by-game results for all 18 seasons of historical data, from season A (1995-6) through season R (2012-3). Each year, it includes all games played from daynum 0 through 132 (which by definition is “Selection Sunday”, the day that tournament pairings are announced). Each row in the file represents a single game played.

head(seasons)
##   season     years    dayzero regionW   regionX   regionY regionZ
## 1      A 1995-1996 10/30/1995    East   Midwest Southeast    West
## 2      B 1996-1997 10/28/1996    East Southeast   Midwest    West
## 3      C 1997-1998 10/27/1997    East      West   Midwest   South
## 4      D 1998-1999 10/26/1998    East   Midwest     South    West
## 5      E 1999-2000 11/01/1999 Midwest      West      East   South
## 6      F 2000-2001 10/30/2000    East      West   Midwest   South

According to Kaggle: This file identifies the 18 different seasons included in the historical data, along with certain season-level properties.

head(teams)
##    id        name
## 1 501 Abilene Chr
## 2 502   Air Force
## 3 503       Akron
## 4 504     Alabama
## 5 505 Alabama A&M
## 6 506  Alabama St

From Kaggle: This file identifies the 356 different college teams that are present in at least one of the seasons from 1995-1996 through 2013-2014. The other data files that identify teams, such as when game-by-game results are listed or tournament seeds are listed, will always reference the teams by their id number rather than by their name. This makes the files more compact and also eliminates any possible spelling issues. This is the only file that actually contains the text names of the college teams.

head(tourneyRes)
##   season daynum wteam wscore lteam lscore numot
## 1      A    136   515     86   729     80    NA
## 2      A    136   559     68   555     59    NA
## 3      A    136   576     75   666     63    NA
## 4      A    136   581     75   577     60    NA
## 5      A    136   604     81   551     74    NA
## 6      A    136   629     74   539     64    NA

From Kaggle: This file identifies the game-by-game NCAA tournament results for all 18 seasons of historical data, from season A (1995-6) through season R (2012-3). The data is formatted exactly like the “regular season results” data, except that all games are assumed to be on a neutral court, and therefore the “wloc” column is not included. Note that these games also include the play-in games (which always occurred on day 134/135) for those years that had play-in games.

head(tourneySeeds)
##   season seed team
## 1      A  W01  663
## 2      A  W02  603
## 3      A  W03  796
## 4      A  W04  660
## 5      A  W05  729
## 6      A  W06  708

From Kaggle: This file identifies the seeds for the final 64 teams in each NCAA tournament, for all 18 seasons of historical data. The losers of play-in games are not listed, though their games are included in the “tourney_results.csv” file. Therefore there are exactly 64 rows for each year, and a total of 64x18=1152 rows.

head(tourneySlots)
##   season slot strongseed weakseed
## 1      A R1W1        W01      W16
## 2      A R1W2        W02      W15
## 3      A R1W3        W03      W14
## 4      A R1W4        W04      W13
## 5      A R1W5        W05      W12
## 6      A R1W6        W06      W11

From Kaggle: This file identifies the mechanism by which teams are paired against each other, depending upon their seeds. Because of the existence of play-in games for particular seed numbers, the pairings have small differences from year to year. If there were N teams in the tournament during a particular year, there were N-1 teams eliminated (leaving one champion) and therefore N-1 games played, as well as N-1 slots in the tournament bracket, and thus there will be N-1 records in this file for that season. There were 64 tournament teams in seasons A-E, 65 tournament teams in seasons F-O, and 68 tournament teams in seasons P-R.

Goal of the Competition

From Kaggle:

In stage one of this two-stage competition, participants will build and test their models against the previous five tournaments. In the second stage, participants will predict the outcome of the 2014 tournament.

In the first stage of the competition, we are asked to make predictions on every possible first-round tournament matchups between every team for seasons N, O, P, Q, and R (each alphabet represents a season). Taking season N as an example, there were a total of 65 teams, so you have to make predictions for:

Here's a snapshot of the sample submission file:

Code for creating the submission file:

submissionFile <- function(season) {
    playoffTeams <- sort(tourneySeeds$team[which(tourneySeeds$season == season)])
    numTeams <- length(playoffTeams)
    matrix <- matrix(nrow = numTeams, ncol = numTeams)
    for (i in c(1:numTeams)) {
        for (j in c(1:numTeams)) {
            matrix[i, j] <- paste(season, "_", playoffTeams[i], "_", playoffTeams[j], 
                sep = "")
        }
    }
    keep <- upper.tri(matrix, diag = F)
    idcol <- vector()
    for (i in c(1:numTeams)) {
        for (j in c(1:numTeams)) {
            if (keep[i, j] == T) {
                idcol <- c(idcol, matrix[i, j])
            }
        }
    }
    form <- data.frame(Matchup = idcol, Win = NA)
    return(form)
}
sub_file <- data.frame()
for (i in LETTERS[19:19]) {
    sub_file <- rbind(sub_file, submissionFile(i))
    # 14-18th letters: N to R, 19th is S
}

First Submission

To test the submission process, we simply guess 50% for every possible matchup. Then store this in a file called “sub1.csv”:

colnames(sub_file) <- c("id", "pred")
sub_file$pred <- 0.5
write.csv(sub_file, file = "sub1.csv", row.names = FALSE)

So here we go, first submission resulted in a ranking that fits in the “benchmark 50%” spot as expected!

Sample Model and How to Create Training Data Set

The following methodology is how the statsguys performed their logistic model.

Using playoffs as the response variables and the regular seasons win/loss as the predictors.

The first step is to create a data frame in R which lists these match ups in one column and the result of the game in another column. The following does this for season A:

  1. first, select the games from season A and name that data frame season_matches.
  2. next, loop through each row of season_matches and we concatenate the season “A” with the teamID for the winning team and the losing team.
  3. finally, place these newly formed strings into a new data frame train_data_frame along with the result of the game. So in the end, it'd look something like: A_515_729 for two teams that played in the “A” season.
season_matches <- tourneyRes[which(tourneyRes$season == "A"), ]
team <- vector()
result <- vector()
for (i in c(1:nrow(season_matches))) {
    row <- season_matches[i, ]
    if (row$wteam < row$lteam) {
        vector <- paste("A", "_", row$wteam, "_", row$lteam, sep = "")
        team <- c(team, vector)
        result <- c(result, 1)
    } else {
        oth <- paste("A", "_", row$lteam, "_", row$wteam, sep = "")
        team <- c(team, oth)
        result <- c(result, 0)
    }
}
train_data_frame <- data.frame(Matchup = team, Win = result)

Taking a look at the training data set created:

head(train_data_frame)
##     Matchup Win
## 1 A_515_729   1
## 2 A_555_559   0
## 3 A_576_666   1
## 4 A_577_581   0
## 5 A_551_604   0
## 6 A_539_629   0

Next, create a dataset which is organized by teamID rather than season. Doing this just for Season “A”:

# install.packages('stringr')
library("stringr")
# Selecting and sorting the playoff teamIDs least to greatest for season A
playoff_teams <- sort(tourneySeeds$team[which(tourneySeeds$season == "A")])

# Selecting the seeds for season A
playoff_seeds <- tourneySeeds[which(tourneySeeds$season == "A"), ]

# Selecting the regular season statistics for season A
season <- regSeason[which(regSeason$season == "A"), ]

# Wins by team (Regular Season wins against playoff teams)
win_freq_table <- as.data.frame(table(season$wteam))
wins_by_team <- win_freq_table[win_freq_table$Var1 %in% playoff_teams, ]

# Losses by team (Regular Season wins against playoff teams)
loss_freq_table <- as.data.frame(table(season$lteam))
loss_by_team <- loss_freq_table[loss_freq_table$Var1 %in% playoff_teams, ]

# Total Win Percentage (against playoff teams)
gamesplayed <- as.vector(wins_by_team$Freq + loss_by_team$Freq)
total_winpct <- round(wins_by_team$Freq/gamesplayed, digits = 3)
total_winpct_by_team <- as.data.frame(cbind(as.vector(loss_by_team$Var1), total_winpct))
colnames(total_winpct_by_team) <- c("Var1", "Freq")

# Num of wins in last 6 games
wins_last_six_games_by_team <- data.frame()
for (i in playoff_teams) {
    games <- season[which(season$wteam == i | season$lteam == i), ]
    numwins <- sum(tail(games$wteam) == i)
    put <- c(i, numwins)
    wins_last_six_games_by_team <- rbind(wins_last_six_games_by_team, put)
}
colnames(wins_last_six_games_by_team) <- c("Var1", "Freq")

# Seed in tournament
pattern <- "[A-Z]([0-9][0-9])"
team_seeds <- as.data.frame(str_match(playoff_seeds$seed, pattern))
seeds <- as.numeric(team_seeds$V2)
playoff_seeds$seed <- seeds
seed_col <- vector()
for (i in playoff_teams) {
    val <- match(i, playoff_seeds$team)
    seed_col <- c(seed_col, playoff_seeds$seed[val])
}
team_seed <- data.frame(Var1 = playoff_teams, Freq = seed_col)

# Combining columns together
team_metrics <- data.frame()
team_metrics <- cbind(total_winpct_by_team, wins_last_six_games_by_team$Freq, 
    team_seed$Freq)
colnames(team_metrics) <- c("TEAMID", "A_TWPCT", "A_WST6", "A_SEED")

Taking a look at the data frames created:

head(wins_by_team)
##    Var1 Freq
## 8   511   24
## 11  515   18
## 15  519   17
## 20  527   18
## 23  530   22
## 31  539   17
head(team_metrics)
##   TEAMID A_TWPCT A_WST6 A_SEED
## 1    511     0.8      5      3
## 2    515     0.6      2     12
## 3    519    0.63      5     14
## 4    527   0.643      2     11
## 5    530   0.759      5      8
## 6    539    0.63      4     12

NOTE: For the actual competition, the statsguys used the following predictors (only highlighted ones were shown in example above).

To get a better picture of what we have now, here is a summary (we have DF 1,2, and 3 below):

In order to arrive at DF4 above: 1. Sort DF 1 by the teamIDs defined in the column titled “A_ID” of DF 4 2. Sort DF 2 by the teamIDs defined in the column titled “B_ID” of DF 4 3. Combine the columns from the two data frames in 1 and 2 to create the training set

In order to do so easily, the statsguys created blog_utility.R to create the combined dataframe (specifically the train_frame_model() function):

source("blog_utility.R")
trainData <- data.frame()
for (i in LETTERS[1:13]) {
    trainData <- rbind(trainData, train_frame_model(i))
}
head(trainData)
##      Matchup Win HomeID AwayID A_TWPCT A_WST6 A_SEED B_TWPCT B_WST6 B_SEED
## 2  A_515_729   1    515    729     0.6      2     12   0.778      3      5
## 10 A_555_559   0    555    559   0.517      5     16   0.938      6      1
## 12 A_576_666   1    576    666   0.893      6     12   0.759      4      5
## 13 A_577_581   0    577    581     0.6      4      8   0.815      5      9
## 9  A_551_604   0    551    604   0.643      3      9   0.679      5      8
## 6  A_539_629   0    539    629    0.63      4     12   0.733      5      5

Applying Classification Tree Model

The statsguys use Rpart package rather than the tree package that we learned in STATS216.

library("rpart")
train_rpart <- rpart(Win ~ A_WST6 + A_SEED + B_WST6 + B_SEED, data = trainData, 
    method = "class")

Once again, the statsguys created a useful function for creating the test data set (test_frame_model() in blog_utility.R which they provided):

testData <- data.frame()
for (i in LETTERS[19:19]) {
    testData <- rbind(testData, test_frame_model(i))
}
## Warning: longer object length is not a multiple of shorter object length
## Warning: number of rows of result is not a multiple of vector length (arg 1)

head(testData)
##     Matchup Win HomeID AwayID A_TWPCT A_WST6 A_SEED B_TWPCT B_WST6 B_SEED
## 1 S_507_509  NA    507    509   0.562      5     16   0.625      5     15
## 2 S_507_511  NA    507    511   0.562      5     16   0.882      4      1
## 3 S_507_512  NA    507    512   0.562      5     16   0.656      2     10
## 4 S_507_521  NA    507    521   0.562      5     16   0.667      5      6
## 5 S_507_536  NA    507    536   0.562      5     16   0.667      5     10
## 6 S_507_538  NA    507    538   0.562      5     16   0.367      3     16

Now we can finally make prediction using the model. Create a submission file for Kaggle:

predictions_rpart <- predict(train_rpart, newdata = testData, type = "prob")
predictions <- predictions_rpart[, 1]
subfile <- data.frame(id = testData$Matchup, pred = predictions)
write.csv(subfile, file = "tree_model.csv", row.names = FALSE)

The second submission resulted in a ranking that is below the “benchmark 50%” (ranking around 200) spot as expected!

Let's try to make some improvements!!