Predicting March Madness

Overview

This project attempts to choose the top 8 teams in the NCAA division 1 basketball tournament. This will be accomplished by utlizing machine learning packages available in R. Two different approaches will be used to predict the top 8 teams:

  1. Using defensive teams statistics combined with an overall ranking based on strength of schedule

    • The response variable will be the ranking of the teams. This ranking is calculated based off of each team’s performance throughout the entire year. The best team will have a value of 1, ranging up to a value of 351 for the worst team. The explanatory variables considered will be the defensive statistics that have been collected for each team over multiple years.
  2. Using offensive team statistics with a ranking system based on NCAA Tournament success

    • The response variable will be a ranking value based on a team’s success in the NCAA Tournament. The team winning the NCAA Tournament will receive the most points with 64, down to 1 point for participating in the Round of 64. The explanatory variables considered in this approach will be the offensive statistics that have been collected for each team over multiple years.

Season averages will be used instead of game by game statistics in this project. There is a total of 10 year’s worth of data from 2006 - 2015. The training data will based on the seasons from 2006 - 2014, while the test data will be based on the 2015 season.

Three models will be used to predict the ranks of the teams per method:

  1. Defensive Method
    • Model Tree
    • Decision Tree
    • KNN (K-Nearest Neighbor)
  2. Offensive Method
    • KNN (K-Nearest Neighbor)
    • SVM (Support Vector Machine)
    • Random Forest

The R packages required for this project are:

  • RCurl
  • dplyr
  • stringr
  • RMySQL
  • XML
  • kernlab
  • RWeka
  • C50
  • class
  • splitstackshape
  • htmltab
  • rvest
  • knitr
  • tidyr
  • caret

Data Sources

College Basketball Season and Division Data - GitHub

The team data for this project was collected off of Github. The Github account had all of the games played by every Division 1, 2, and 3 game for the 2002 through 2015 basketball season. The same Github repository also has the divisions for all of the teams plus game data for each season.

The GitHub repository can be found here with CSV files for:

  • Games by Season from 2007 - 2016

  • Schools By Division

Team Statistics - www.teamrankings.com

All of the statistics relating to the college basketball teams were web scraped from teamrankings.com. Team statistics for seasons from 2006 to 2015 where extracted from this site and ultimately stored in a cloud MySQL database.

Statistics across the following 14 categories were captured and combined for each team per season:

  • Scoring
  • Shooting
  • Rebounding
  • Blocks & Steals
  • Assists & Turnovers
  • Fouls
  • Scoring Defense
  • Shooting Defense
  • Opponent Rebounding
  • Opponent Block & Steals
  • Opponent Assists & Turnovers
  • Opponent Fouls
  • Other
  • Winning Percentage
TeamRankings

TeamRankings

Data Collection and Preparation

Combine Season and Division Data

The Github data for season combines teams from a variety of divisions. It has data for Division 1 teams vs Divsion 4 teams and a lot of other combinations. Since the NCAA basketball tournament being considered is for division 1 teams, the division data and the season data had to be combined into one chunk of data. Then the entire data frame can be subset to only those games that are with two division 1 teams. Any other combination will be dropped. Some of the teams have different names from year to year (Connecticut is UCONN or Alcorn St. is Alcorn etc.) so in order for the matching to work properly, all the team names have to be the same, so some of the names in certain seasons had to be changed to match the divisonal data.

#Function: Data
#Input: 
#   div, data frame of the divisions for the teams
#   result, season data for the teams
#Output:
#   returns a data frame that has only division 1 games, all other games were dropped.
#This function will take the season data and the divisional data and combine them together. This gives the ability to pick only division 1 games.
data <- function(div, result)
{
  names(div) <- c("team_id", "division")
  jointeam <- left_join(result, div, by = "team_id")
  almost <- jointeam %>% select(year, team_name, opponent_name, opponent_id, team_score, opponent_score, division)
  colnames(almost)[colnames(almost)=="division"] <- "team_division"
  colnames(almost)[colnames(almost)=="opponent_id"] <- "team_id"
  joinopp <- left_join(almost, div, by = "team_id")
  colnames(joinopp)[colnames(joinopp)=="division"] <- "opp_division"
  final <- joinopp %>% select(year, team_name, opponent_name, team_score, opponent_score, team_division, opp_division)
  
  D1 <- subset(final, team_division == "I"&opp_division == "I")
  
  return(D1)
}

#Change the names of the teams to match between the divisions and the seasonal data. These teams are common amongst all of the years
Data <- data(div,result)
Data$team_name <- as.character(Data$team_name)
Data$team_division <- as.character(Data$team_division)
Data$opponent_name <- as.character(Data$opponent_name)
Data$opp_division <- as.character(Data$opp_division)
Data$opponent_name[Data$opponent_name == "Alcorn St."] <- "Alcorn"
Data$opponent_name[Data$opponent_name == "Connecticut"] <- "UConn"
Data$opponent_name[Data$opponent_name == "Md.-East. Shore"] <- "UMES"
Data$opponent_name[Data$opponent_name == "St. Francis (NY)"] <- "St. Francis Brooklyn"
Data$opponent_name[Data$opponent_name == "UNC Wilmington"] <- "UNCW"
Data$opponent_name[Data$opponent_name == "Cal St. Northridge"] <- "CSUN"
Data$opponent_name[Data$opponent_name == "S.C. Upstate"] <- "USC Upstate"
Data$opponent_name[Data$opponent_name == "Neb. Omaha"] <- "Omaha"
Data$opponent_name[Data$opponent_name == "CSU Bakersfield"] <- "Bakersfield"
Data$opponent_name[Data$opponent_name == "SIU Edwardsville"] <- "SIUE"
Data$opponent_name[Data$opponent_name == "Fla. Gulf Coast"] <- "FGCU"

#The stats for Utah Vallet stats were missing. I hade to create their offesnive stats to all the matching with the divisnal data.
#This only happens for the 2004 season.
if (Data$year == 2004)
{
  uvu <- subset(Data, Data$opponent_name == "Utah Valley")
  i <- 1
  for (i in 1:NROW(uvu))
  {
    temp <- uvu$opponent_name[i]
    uvu$opponent_name[i] <- uvu$team_name[i]
    uvu$team_name[i] <- temp
    
    temp <- uvu$opponent_score[i]
    uvu$opponent_score[i] <- uvu$team_score[i]
    uvu$team_score[i] <- temp
    i <- i + 1
  }
  Data <- rbind(Data, uvu)
}

#The 2016 season data had a tab character (/n) in the teams name. I had to split the list on the tab character, then extract the names of each team. Team names come in a wide variety, ranging from 1 to 4 name with a wide variety of punctuation in between. This is only for the 2016 season.

if (Data$year == 2016)
{
  names <- unlist(str_split(Data$opponent_name, '\n'))
  
  i <- 2
  j <- 1
  for (j in 1:(NROW(Data)))
  {
    Data$opponent_name[j] <- names[i]
    i <- i + 2
    j <- j + 1
  }
  Data$opponent_name <- unlist(str_extract_all(Data$opponent_name, "[[:alpha:]]{1,}[[:punct:]]{0,}[[:alpha:]]{0,}[[:space:]]{0,}[[:alpha:]]{0,}[[:punct:]]{0,}[[:alpha:]]{0,}[[:space:]]{0,}[[:alpha:]]{0,}[[:punct:]]{0,}[[:alpha:]]{0,1}[[:space:]]{0,}[[:alpha:]]{0,}[[:punct]]{0,}[[:alpha:]]{0,}[[:punct:]]{0,}"))
  Data$opponent_name[Data$opponent_name == "Stephen F. Austin"] <- "SFA"
  Data$opponent_name[Data$opponent_name == "Texas-Arlington"] <- "UT Arlington"
  Data$opponent_name[Data$opponent_name == "Citadel"] <- "The Citadel"
}

Seasonal Rankings

The response variable being used for this process is the end of the year rankings. Instead of using the rankings that are given by the coaches, this project will calculate its own based on the Massy ranking system. This sytem has been used for the official rankings of ESPN and was created by a student as a part of a graduate research project. Rankings are created on a game by game basis, so they can be updated as the season goes on. This system compares the result of each game to the difference in the scores.

col <- as.character(unique(Data$team_name))

matrix <- matrix(0, ncol = length(col), nrow = NROW(Data), byrow=TRUE)
dimnames(matrix) <- list(c(1:NROW(Data)), col)

#Function: Schedule
#Input:
#  matrix, an empty matrix to hold the information for the ranking system. It is a square matrix with the dimensions being the number of teams for that season
#  Data, a data frame that contains the seaonsal data for division 1
#OUtput:
#  matrix that contains the results for each game for each team during hte season.
#This function will create a matrix that is used for input into the ranking formula. Each column of the martix contain the team name. Each row of the matrix will contain a game. When two teams play against each other, the winner will get a 1 on the intersection and the loser will get a -1. The we would move onto the next game. This will continue for every game that was played for that season. 

schedule <- function(matrix, Data)
{
  i <- 1
  for (i in 1:NROW(Data))
  {
    if (as.numeric(Data[i,4]) > as.numeric(Data[i,5]))
    {
      matrix[i,Data[i,2]] = 1
      matrix[i,Data[i,3]] = -1
    } 
    else   
    {
      matrix[i,Data[i,2]] = - 1
      matrix[i,Data[i,3]] =  1
    }
    i <- i + 1
  }
  return(matrix)
}

Team <-schedule(matrix, Data)

#Function: scoredifferential
#Input:
#  Data, data frame containing the seasonal data
#Output:
# diff, a 1 by number of games matrix that conatins the difference in the score of each game
#This is a function that will calculate the score differential for each game that is played throughout the season. If team A beat team B 70-60. Then the entry will be 10, If team A lost to team B 60-70, then the entry would be -10.

scoredifferential <- function(Data)
{
  
  i <- 1
  diff <- matrix(0, nrow = NROW(Data), ncol = 1)
  
  for (i in 1 : NROW(Data))
  {
    
    diff[i] <- as.numeric(Data[i,4])-as.numeric(Data[i,5])
    i <- i + 1
  }
  return(diff)
}

diff <- scoredifferential(Data)
absdiff <- abs(diff)

#Transpose the matrix
tteam <- t(Team)
#Multiply the transpose by the original matrix
A <- tteam %*% Team
#Multiply the transpose matrix by the matrix taht contains the differences
B <- tteam %*% absdiff

#Make the last rowm of the socre matrix 1's and the last ebtry of the socre difference matrix a 0. If this is not done, then there is no solution due to the properties of the matrix.
A[NROW(A),] = 1
B[NROW(B),1] = 0

#Solve the linear equation to get the rankings of the teams.
rankings <- solve(A,B)

#order the teams by the rankings 
rankings <- as.data.frame(rankings)
ranks <- rankings[order(-rankings[1]), , drop = FALSE]
ranks$rank <- c(1:NROW(ranks))

Massy Ranking Algorithm

The table below shows the results of the Massy Ranking algorithm:

Y2015 2015_Rank Y2014 2014_Rank Y2013 2013_Rank Y2012 2012_Rank Y2011 2011_Rank
Kansas 23.65358 Kentucky 29.59169 Louisville 25.14265 Indiana 25.44025 Kentucky 25.41010
Michigan St. 23.19914 Wisconsin 24.85308 Arizona 23.76682 Louisville 25.31585 Ohio St. 24.63799
West Virginia 22.30338 Arizona 24.64689 Florida 21.11844 Florida 24.23370 Kansas 22.40911
North Carolina 22.11190 Duke 24.33190 Kansas 20.79188 Duke 21.93963 Michigan St. 22.05064
Villanova 21.91290 Virginia 23.07023 Wisconsin 20.11526 Michigan 21.80616 North Carolina 22.02033
Louisville 21.63333 Villanova 22.90716 Duke 20.01823 Kansas 21.31411 Syracuse 21.80872
Virginia 21.23472 Utah 21.43638 Oklahoma St. 19.54340 Gonzaga 21.05962 Missouri 20.45130
Duke 20.81610 Gonzaga 21.33082 Villanova 19.41018 Syracuse 20.80983 Wisconsin 19.73984
Oklahoma 20.58445 North Carolina 21.32353 Michigan St. 19.24373 Ohio St. 20.74490 Indiana 19.31590
Miami (FL) 20.14680 Ohio St. 20.88407 Michigan 19.09298 Pittsburgh 19.38626 Florida 18.93787
Purdue 20.04783 Oklahoma 19.95099 Iowa St. 19.08060 Wisconsin 19.27926 Wichita St. 18.09982
Indiana 19.98950 Kansas 19.71435 Virginia 18.88161 Michigan St. 19.17142 Duke 17.73333
Kentucky 19.64098 Baylor 19.35853 UCLA 18.83495 Arizona 17.79115 Marquette 17.62773
Xavier 19.56783 Iowa St. 19.20751 Creighton 18.51829 Miami (FL) 17.47762 Memphis 17.53664
SMU 19.24612 Louisville 19.06722 Kentucky 18.46476 Minnesota 17.39826 Georgetown 17.49761
Arizona 19.22254 Notre Dame 18.83728 Iowa 18.30519 VCU 17.38770 Baylor 17.22571
Vanderbilt 19.06473 Michigan St. 17.34379 Ohio St. 18.29454 Creighton 16.99766 Louisville 16.87967
Iowa 18.43319 Texas 17.25475 Pittsburgh 17.98544 Georgetown 16.34374 Vanderbilt 16.39438
Texas A&M 18.42961 Butler 17.08722 Tennessee 17.74630 Oklahoma St. 16.08799 New Mexico 16.33073
Maryland 18.00047 West Virginia 16.83346 UConn 17.67079 Missouri 15.82460 Saint Louis 15.83155

Processing Team Statistics

The 14 categories of team statistics were web scraped from www.teamrankings.com using a combination of the rvest and htmltab packages in R. In total, there are 115 NCAA college basketball statistics available on teamrankings.com per team per year. Additionaly, each statistics includes the team’s rank within the season for a given statistic.

baseURL <- "https://www.teamrankings.com/ncb/stats/"
url_content  <- read_html(baseURL)

stat.links <- url_content %>%  
              html_nodes("*") %>% 
              html_nodes(xpath = "./a") %>% 
              html_attr("href") %>%  
              str_match("/ncaa-basketball/stat/.+")   

stat.links <- stat.links[complete.cases(stat.links),]

The URL for each statistic is constructed to create 115 callable URLs which can be used with a date parameter for each season.

##  [1] "/ncaa-basketball/stat/points-per-game"         
##  [2] "/ncaa-basketball/stat/average-scoring-margin"  
##  [3] "/ncaa-basketball/stat/offensive-efficiency"    
##  [4] "/ncaa-basketball/stat/floor-percentage"        
##  [5] "/ncaa-basketball/stat/1st-half-points-per-game"
##  [6] "/ncaa-basketball/stat/2nd-half-points-per-game"
##  [7] "/ncaa-basketball/stat/overtime-points-per-game"
##  [8] "/ncaa-basketball/stat/average-1st-half-margin" 
##  [9] "/ncaa-basketball/stat/average-2nd-half-margin" 
## [10] "/ncaa-basketball/stat/average-overtime-margin"
baseStatURL <- "https://www.teamrankings.com"

season_year_end <- 2016

allStatsURL <- sprintf("%s%s%s", baseStatURL, stat.links, sprintf("?date=%s-04-06", season_year_end)) 

An example of the URLs constructured for the team statistics for the 2015 season is shown below:

Team Stats URLs

https://www.teamrankings.com/ncaa-basketball/stat/points-per-game?date=2016-04-06
https://www.teamrankings.com/ncaa-basketball/stat/average-scoring-margin?date=2016-04-06
https://www.teamrankings.com/ncaa-basketball/stat/offensive-efficiency?date=2016-04-06
https://www.teamrankings.com/ncaa-basketball/stat/floor-percentage?date=2016-04-06
https://www.teamrankings.com/ncaa-basketball/stat/1st-half-points-per-game?date=2016-04-06 https://www.teamrankings.com/ncaa-basketball/stat/2nd-half-points-per-game?date=2016-04-06 https://www.teamrankings.com/ncaa-basketball/stat/overtime-points-per-game?date=2016-04-06 https://www.teamrankings.com/ncaa-basketball/stat/average-1st-half-margin?date=2016-04-06
https://www.teamrankings.com/ncaa-basketball/stat/average-2nd-half-margin?date=2016-04-06
https://www.teamrankings.com/ncaa-basketball/stat/average-overtime-margin?date=2016-04-06

## -----------------------------------------------------------
## Function: get_team_stats
##
## -----------------------------------------------------------
  
get_team_stats <- function(statURL, year, createCSV = TRUE) {

        for (i in 1:length(statURL)) {     
            
            # Keep the rank, team, and stat for the given year
            statHTMLtab <- htmltab(doc = statURL[i], which = 1)[, 1:3]
            
            # parse the current stat name from the URL string
            statname <- str_replace_all(str_extract(statURL[i], "(?<=\\/stat\\/).*?.*?(?=\\?)"), "-", "_")
            
            # convert double-underscores to a single underscore
            statname <- str_replace_all(statname, "__", "_") 
            
            stats.df <- statHTMLtab %>% gather(key = Year, value = "stat", -Team, -Rank) 
            
            # convert the year from a character to integer
            stats.df$Year <- as.integer(stats.df$Year)

            # replace % signs in the statistic value
            stats.df$stat <- as.numeric(gsub("%|--", "", stats.df$stat))
            stats.df$Rank <- as.integer(stats.df$Rank)
        
            # rename the rank column 
            names(stats.df)[names(stats.df) == "Rank"] <- sprintf("%s_%s", statname, "rank")
            
            # rename the stat column to the stat being extracted in the URL
            names(stats.df)[names(stats.df) == "stat"] <- statname
            
            # reorder the columns
            stats.df <- stats.df[, c(3, 2, 4, 1)]
            
            
            if (i == 1) {return.df <- stats.df}
            else {
                
                return.df <- return.df %>% left_join(stats.df, by = c("Year" = "Year", "Team" = "Team"))
            }
            
        }
    
        if (createCSV) { 
            
            # write the team stats to a csv file
            write.csv(return.df, sprintf("team_stats_%s.csv", year - 1), row.names = FALSE)
        }
     
       # return the stats dataframe
       return (return.df)
    
}    

## -----------------------------------------------------------
## Function: load_RMySQL_DB
##
## -----------------------------------------------------------

load_RMySQL_DB <- function(tablename, df, overwrite = TRUE, append = FALSE) {
    
    # establish the connection to the skill DB on db4free.net
    bbmetricDB = dbConnect(MySQL(), user=proj_user, password=proj_pwd, dbname=proj_db, host=proj_host)
    
    dbWriteTable(bbmetricDB, 
                 name = tablename, 
                 value = df,
                 overwrite = overwrite, 
                 append = append, 
                 row.names = FALSE)
    
    dbDisconnect(bbmetricDB)
}

Using the function get_team_stats, teams statistics for the NCAA basketball seasons (starting) from 2006 - 2015

team.stats.2016 <- get_team_stats(allStatsURL, season_year_end, createCSV = FALSE) 

The team statistics for the 2015 - 2016 season (partial listing):

team_name source_team_name season_start_year season_end_year assist_per_turnover_ratio assist_per_turnover_ratio_rank assists_per_fgm assists_per_fgm_rank assists_per_game assists_per_game_rank
North Carolina N Carolina 2015 2016 1.647 3 0.573 58 17.7 4
Michigan St. Michigan St 2015 2016 1.737 1 0.713 1 20.5 1
Mississippi St. Miss State 2015 2016 1.099 137 0.507 210 13.9 122
West Virginia W Virginia 2015 2016 1.024 179 0.541 130 14.6 78
Virginia Tech VA Tech 2015 2016 0.955 231 0.486 253 12.2 236
Florida St. Florida St 2015 2016 1.041 170 0.478 274 13.3 160
Ole Miss Mississippi 2015 2016 1.067 155 0.518 183 12.9 188
North Carolina St. NC State 2015 2016 1.052 163 0.442 332 11.6 278
Saint Joseph’s St Josephs 2015 2016 1.487 11 0.547 114 14.8 67
Fresno St. Fresno St 2015 2016 1.210 84 0.494 240 12.9 185
George Washington Geo Wshgtn 2015 2016 1.256 62 0.546 115 14.1 109
Colorado St. Colorado St 2015 2016 1.126 119 0.513 192 13.1 173
South Carolina S Carolina 2015 2016 0.996 201 0.520 172 13.5 149
UTEP TX El Paso 2015 2016 1.237 70 0.573 59 15.5 41
Boise St. Boise State 2015 2016 1.130 114 0.508 209 13.1 175
Iowa St. Iowa State 2015 2016 1.446 21 0.531 149 16.5 16
Utah Valley Utah Val St 2015 2016 1.074 149 0.557 90 14.0 112
Western Ky. W Kentucky 2015 2016 1.000 196 0.496 234 13.2 169
San Francisco San Fransco 2015 2016 0.960 229 0.462 303 12.3 229
SMU S Methodist 2015 2016 1.427 24 0.615 23 17.6 5

This process of assembling team statistics was repeated to collect 10 years of statistics. Each year’s combined team statistics were loaded into a cloud MySQL database, described below.

Collecting NCAA Tournament Results

NCAA Tournament results from collected from 2007 - 2016. For each year, the teams making the Round of 64, Round of 32, Sweet 16, Elite 8, Final 4, Championship Game, and Champion were assembled and stored.

The process to collect the NCAA Tournament bracket teams was somewhat manual since the winners of each round are most commonly represented in a bracket graphic versus as a list of teams.

The site http://www.cbssports.com/collegebasketball/ncaa-tournament/history/yearbyyear provided the winners of each round up to 2012 but stopped beyond this tournament year.

NCAA Bracket Winners

NCAA Bracket Winners

Each round of winners was parsed using the function below, which uses the splitstatshape R package to create a dataframe of the teams in each bracket.

suppressWarnings(suppressMessages(library(splitstackshape)))

parse_bracket_results <- function(team_string) {

        str(team_string)
        
        team_string.new <- 
          strsplit(
            str_replace_all(s, "No. |[0-9]|(OT)|OT|\\(|\\)", "") , "\n")
        
        f <- as.data.frame(team_string.new); 
        
        f <- as.data.frame(str_c(str_replace(as.character(f[, 1]), "(.),(.)", "|"), "|"))
        colnames(f) <- "team"             
                    
        
        f <- as.data.frame(cSplit(f, "team", "|", direction = "long"))
        f$team <- as.character(f$team)
        
        f <- f %>% 
              select(team) %>% 
              filter(team != "") %>% 
              mutate(team = str_replace_all(team, " ,$| $", ""))

        return (f)

}

Sample of the NCAA bracket winners for 2016:

tournament_year round_64 round_32 sweet_16 elite_8 final_4 championship_game champion
2016 Kansas Kansas Kansas Kansas Villanova Villanova Villanova
2016 Villanova Connecticut Maryland Villanova Oklahoma North Carolina
2016 Miami Maryland Miami (FL) Oregon North Carolina
2016 California Hawaii Villanova Oklahoma Syracuse
2016 Maryland Wichita State Oregon North Carolina
2016 Arizona Miami (FL) Duke Notre Dame
2016 Iowa Iowa Texas A&M Virginia
2016 Colorado Villanova Oklahoma Syracuse
2016 Connecticut Oregon North Carolina
2016 Temple Saint Joseph’s Indiana
2016 Wichita State Yale Notre Dame
2016 South Dakota State Duke Wisconsin
2016 Hawaii Northern Iowa Virginia
2016 Buffalo Texas A&M Iowa State
2016 UNC Asheville VCU Gonzaga
2016 Austin Peay Oklahoma Syracuse
2016 Oregon North Carolina
2016 Oklahoma Providence
2016 Texas A&M Indiana
2016 Duke Kentucky
2016 Baylor Notre Dame
2016 Texas Stephen F. Austin
2016 Oregon State Wisconsin
2016 Saint Joseph’s Xavier
2016 Cincinnati Virginia
2016 VCU Butler
2016 Northern Iowa Arkansas-Little Rock
2016 Yale Iowa State
2016 UNC Wilmington Gonzaga
2016 Green Bay Utah
2016 Cal State Bakersfield Syracuse
2016 Holy Cross Middle Tennessee
2016 North Carolina
2016 Xavier
2016 West Virginia
2016 Kentucky
2016 Indiana
2016 Notre Dame
2016 Wisconsin
2016 USC
2016 Providence
2016 Pittsburgh
2016 Michigan
2016 Chattanooga
2016 Stony Brook
2016 Stephen F. Austin
2016 Weber State
2016 Florida Gulf Coast
2016 Virginia
2016 Michigan State
2016 Utah
2016 Iowa State
2016 Purdue
2016 Seton Hall
2016 Dayton
2016 Texas Tech
2016 Butler
2016 Syracuse
2016 Gonzaga
2016 Arkansas-Little Rock
2016 Iona
2016 Fresno State
2016 Middle Tennessee
2016 Hampton

The NCAA bracket winners for the 2013, 2014, 2015, and 2016 tournaments were manually extracted from Wikipedia:

https://en.wikipedia.org/wiki/2013_NCAA_Men%27s_Division_I_Basketball_Tournament

The bracket results for ten seasons can be found here in CSV format. These files were subsequently loaded into a MySQL database using the load_RMySQL_DB function

MySQL Database

To facilitate centralized data storage and access for this project, a MySQL database on dbfree.net was chosen as a cloud repository for the team statistics and bracket results.

MySQL DB

MySQL DB

The team statistics assembled for each year were staged in the MySQL database and combined into a single, consolidated team statistics table. The SQL for the table DDL and the dynamically executed insert statement can be found here.

The code to insert the team statistics by year is shown below:

## Insert Team Stats By Year 

suppressWarnings(suppressMessages(library(readr)))

url <- "https://raw.githubusercontent.com/kfolsom98/DATA607/master/Final-Project/sql/insert_team_stats_year.sql"

ins <- read_file(url)

stat.tables  <-    dbGetQuery(bbmetricDB, "SELECT DISTINCT TABLE_NAME 
                                             FROM INFORMATION_SCHEMA.COLUMNS 
                                            WHERE TABLE_SCHEMA = 'bbmetric' 
                                              AND TABLE_NAME like 'team_stats_%'")

dbGetQuery(bbmetricDB, "TRUNCATE TABLE team_stats")

for (i in 1:nrow(stat.tables)) {
    
    insert <- str_replace(ins, "<<!table_name>>", stat.tables[i, ])
    
    print(paste0("Preparing insert for ", stat.tables[i, ]))
          
    res <- dbSendQuery(bbmetricDB, insert)

}


dbSendQuery(bbmetricDB, "select count(*) from team_stats")

The NCAA bracket result data was combined locally and uploaded to the MySQL database with this process:

## Load the ncaa bracket data ##
 
 suppressWarnings(suppressMessages(library(data.table)))

 df <- data.frame()
 
 files <- list.files(pattern="ncaa*.*.csv")
 
 for (f in files)  df<-rbind(Dataf, fread(f))
 
 ## push the brackets per year to the DB
 
 dbSendQuery(bbmetricDB, "TRUNCATE TABLE ncaa_brackets")
 
 load_RMySQL_DB("ncaa_brackets", df, overwrite = FALSE, append = TRUE)
 
 #close the connection
 dbDisconnect(bbmetricDB)

One significant challenge encountered while processing the team statistics and bracket winners was that the same team could be entered with multiple variations of a name. To standardize the team names, the source team name was mapped to a consistent team name while being inserted into the team_stats combined table.

To illustrate the approach, the following example shows the mappings for the team St. Mary’s (CA) and all the variations seen in the source team stats data.

old_name new_name
St Marys St. Mary’s (CA)
St. Mary’s St. Mary’s (CA)
Saint Mary’s, Calif. St. Mary’s (CA)
Saint Mary’s Calif. St. Mary’s (CA)
St. Mary’s (Calif.) St. Mary’s (CA)
Saint Mary’s St. Mary’s (CA)

The final step in preparing the combined team statistics table is to apply the NCAA bracket winners for each year. A team reaching a specific level in the bracket (e.g. Sweet 16 or Final 4) for a specific year will be given a value of 1, as an indicator variable. Otherwise, the team will have a value of 0 for each bracket level.

 ## UPDATE - Reset Bracket Values  
 
 dbGetQuery(bbmetricDB, " update team_stats ts
                             set ts.ncaa_round_64 = 0,
                                 ts.ncaa_round_32 = 0,
                                 ts.ncaa_sweet_16 = 0,
                                 ts.ncaa_elite_8  = 0, 
                                 ts.ncaa_final_4  = 0, 
                                 ts.ncaa_championship_game  = 0, 
                                 ts.ncaa_champion = 0
            ") 
 
## UPDATE  - Set the bracket column value to 1 for each team in the given bracket for the given year  
 
update_cols <- c("round_64","round_32", "sweet_16", 
                 "elite_8", "final_4",
                 "championship_game", 
                 "champion")

 
for (i in 1:length(update_cols)) {
    
   update <- 
    sprintf(" update team_stats ts
               inner join 
                     (
                      select nb.tournament_year, 
                             ifnull(map.new_name, nb.%s) team_name
                        from ncaa_brackets nb  left outer join team_name_map map on nb.%s = map.old_name) br 
                  on ts.season_end_year = br.tournament_year and ts.team_name = br.team_name
                 set ts.ncaa_%s = 1", update_cols[i], update_cols[i], update_cols[i])
 
    dbGetQuery(bbmetricDB, update)
                      
}

Extracting Defensive Stats - MySQL Database

#Get the column names for the query
column_names <- read.csv("https://raw.githubusercontent.com/DanielBrooks39/IS607/master/Final_Project/Column_names.csv", sep=",", header=FALSE, stringsAsFactors = FALSE)
years <- c("2015", "2014", "2013", "2012", "2011", "2010", "2009", "2008", "2007", "2006")

#Function: getdata
#Input:
#  years, the years for the database table
#  column_names, the names of the columns that are in the database
#  username, the username to connect tot the database
#  password, the password for teh database
#  database, the database name to connect to the database
#  hotname, the hostname where the database resides
#Output:
#  Writes a csv to the working directoy for every table that is specified by the query.
#This function will create a query for each of the years that we wabt in the database. It will connect to the database and poull back the desired columns for each year that we want. It will take each year that is qurries and writes it to a csv on the working directory.
getdata <- function(years, column_names, username, password, database, hostname)
{
  con <- dbConnect(MySQL(), user=username, password = password, dbname = database, host = hostname)

  base_table <- "team_stats_"
  
  i <- 1
  j <- 1
  
  for (i in 1 : NROW(years))
  {
    switch(years[i],
           "2015" = data <- data.frame(double(351)), 
           "2014" = data <- data.frame(double(351)),
           "2013" = data <- data.frame(double(351)),
           "2012" = data <- data.frame(double(347)),
           "2011" = data <- data.frame(double(345)),
           "2010" = data <- data.frame(double(345)),
           "2009" = data <- data.frame(double(345)),
           "2008" = data <- data.frame(double(341)),
           "2007" = data <- data.frame(double(341)),
           "2006" = data <- data.frame(double(336))
          )
    for (j in 1 : NROW(column_names))
    {
      query <- dbGetQuery(con, paste("select", " ", column_names$V1[j], " ", "from", " ", base_table, years[i], sep=""))
      data[,j] <- query
      j <- j + 1 
    }
    write.csv(data, paste("data", years[i], ".csv", sep = ""), append = TRUE, col.names = TRUE)
    i <- i + 1
  }
  dbDisconnect(con)
}

getdata(years, column_names, username, password, database, hostname)

Manipulate Query Data

The database data is read in from the CSV’s that were created earlier. The CSV’s are read in one at a time and stored in their separate data frames. There is also a mismatch between the database team names and the names that are used in the rankings. The teams names need to be the same in order to join the two tables together and and get everything in the same data frame.

data2014 <- read.csv("data2014.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE)
colnames(data2014)[2] <- "V1"
data2015 <- read.csv("data2015.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE)
colnames(data2015)[2] <- "V1"

Modeling

Predictions Based on Defensive Statistics

The following section will cover the processes used to model the defensive statistics to predict the top 8 teams. The models used will be Model Tree, Decision Tree, and KNN. The following lists the defensive variables used.

Defensive Variables
Variable Variable Variable
Team_name opponent_two_point_rate opponent_turnover_pct
opponent_points_per_game opponent_fta_per_fga opponent_personal_fouls_per_game
opponent_average_scoring_margin opponent_ftm_per_100_possessions opponent_personal_fouls_per_possession
defensive_efficiency opponent_free_throw_rate opponent_personal_foul_pct
opponent_floor_percentage opponent_non_blocked_2_pt_pct opponent_win_pct_close_games
opponent_1st_half_points_per_game opponent_offensive_rebounds_per_game opponent_win_pct_all_games
opponent_2nd_half_points_per_game opponent_defensive_rebounds_per_game opponent_effective_possession_ratio
opponent_overtime_points_per_game opponent_offensive_rebounding_pct games_played
opponent_points_from_2_pointers opponent_defensive_rebounding_pct
opponent_points_from_3_pointers opponent_blocks_per_game
opponent_percent_of_points_from_2_pointers opponent_steals_per_game
opponent_percent_of_points_from_3_pointers opponent_block_pct
opponent_percent_of_points_from_free_throws opponent_steals_perpossession
opponent_shooting_pct opponent_steal_pct
opponent_effective_field_goal_pct opponent_assists_per_game
opponent_three_point_pct opponent_turnovers_per_game
opponent_two_point_pct opponent_assist_per_turnover_ratio
opponent_free_throw_pct opponent_assists_per_fgm
opponent_true_shooting_percentage opponent_assists_per_possession
opponent_three_point_rate opponent_turnovers_per_possession

Training and Test Data

For the machine learning algorithms to work, it needs to be trained before it can predict anything. For this data, the training data will be the year of 2006-2014. The test data will be the most current year(2015). The training data is used to give the model a baseline and allow it to make groups or trees (depedning on the model type). Then the model can predict the rankings or the class of the test data.

traindata <- rbind(table2006, table2007, table2008, table2009, table2010, table2011, table2012,table2013, table2014)
traindata <- na.omit(traindata)
traindata <- traindata[,-1]
traindata$rank <- factor(traindata$rank)

testdata <- rbind(table2015)
testdata <- na.omit(testdata)
testnames <- testdata[1]
testdata <- testdata[,-1]
testdata$rank <- factor(testdata$rank)

Model Tree

The first model that will be used to calculate the top eight teams in division one basketball woud be the model tree. This is a more robust decision tree. Basically, it takes the training dataset and comes up with a set of rules. Each of those rules are a node on the tree. The difference between a model tree and a decision tree is, a model tree gives an equation for each node of a tree. Each rull that is decided on by the model will give a specific formula. That formula is then used to forecast the categories of the new data, based on what nodes the new data hits. For the case of this model, there was only one rule that was decided by the model. The equation is:

\[ rank = -14.831 * opponent_points_per_game + 13.6137 * opponent_average_scoring_margin + 852.236 * defensive_efficiency - 3.8714 * opponent_floor_percentage + 4.5235 * opponent_1st_half_points_per_game + 5.0138 * opponent_2nd_half_points_per_game + 7.6484 * opponent_overtime_points_per_game + 10.7681 * opponent_points_from_2_pointers + 10.0136 * opponent_points_from_3_pointers + 8.4694 * opponent_percent_of_points_from_2_pointers + 22.4938 * opponent_percent_of_points_from_3_pointers + 10.4896 * opponent_percent_of_points_from_free_throws - 33.1115 * opponent_shooting_pct - 3.455 * opponent_effective_field_goal_pct - 3.8401 * opponent_three_point_pct + 23.4914 * opponent_two_point_pct + 3.9098 * opponent_true_shooting_percentage - 16.814 * opponent_three_point_rate + 814.8023 * opponent_fta_per_fga - 10.9082 * opponent_ftm_per_100_possessions - 5.4238 * opponent_non_blocked_2_pt_pct - 8.2394 * opponent_offensive_rebounds_per_game + 3.6402 * opponent_offensive_rebounding_pct - 37.5087 * opponent_blocks_per_game + 16.5483 * opponent_steals_per_game + 16.4092 * opponent_block_pct - 28.4589 * opponent_steals_perpossession + 19.2984 * opponent_steal_pct + 5.6475 * opponent_assists_per_game - 9.1663 * opponent_turnovers_per_game + 243.8074 * opponent_assists_per_fgm - 1320.9676 * opponent_assists_per_possession + 52.0246 * opponent_turnovers_per_possession - 62.0717 * opponent_turnover_pct + 17.4931 * opponent_personal_fouls_per_game - 2.5374 * opponent_personal_fouls_per_possession - 12.8002 * opponent_personal_foul_pct - 98.1536 * opponent_win_pct_all_games - 1181.6373 * opponent_effective_possession_ratio - 8.2799 * games_played + 994.0719 \]

This can be used to get the new order of the teams for the new data.

====Summary==== Value
Correlation coefficient: 0.891
Mean absolute error: 35.658
Root mean squared error: 44.7633
Relative absolute error: 41.7723 %
Root relative squared error: 45.4043 %
Total Number of Instances: 2952

The summary data is showing that the correlation between this equation and the rank is rather high. Implying that this is a good equation to use for ranking of the teams.

suppressWarnings(suppressMessages(library(RWeka)))

traindata$rank <- as.numeric(traindata$rank)
testdata$rank <- as.numeric(testdata$rank)

model_M5p <- M5P(rank~., data = traindata)
pred <- predict(model_M5p, testdata[,-1])
pred <- as.data.frame(pred)
pred$team <- testnames
sort <- pred[with(pred, order(pred)), ]

Decision Tree

A decision tree follows the same idea as a model tree. The data is broken apart into “rules”. Each rule is a node of a tree. Depending on which side of the tree the data falls upon, determines what category the data will end up in. This kind of tree doesn’t have any equations associated with it, so it will be used for classification and not used to atualy predict the value of the team.

Classification Tree Value
Number of Samples: 2952
Number of Predictors 47
Tree Size: 1213

We can see that there was 2952 samples that were used to make the tree. There was 47 Predictors that were used in the model and the tree has 1213 nodes in it.

We can also see the break down of what stats the tree uses. The more important stats get used the most in the model. Here are some of the stats used and the amount they were used in the model:

Attribute % Usage
opponent_average_scoring_margin 100%
games_played 92.65%
opponent_points_per_game 36.79%
opponent_overtime_points_per_game 36.52%
opponent_free_throw_pct 33.16%
opponent_defensive_rebounding_pct 28.22%
opponent_blocks_per_game 26.56%
opponent_floor_percentage 25.68%
opponent_turnover_pct 23.78%
opponent_assists_per_possession 21.54%
opponent_win_pct_all_games 21.00%
suppressWarnings(suppressMessages(library(C50)))

trainmodel <- C5.0(traindata[,-1], traindata[,1, trials = 3])
testmodel <- predict(trainmodel, testdata[,-1], type="class")
rank <- as.data.frame(testmodel)

rank$teams <- testnames
names(rank) <- c("ranks", "teams")
sort_teams <- rank[with(rank, order(ranks)), ]

KNN (K-Nearest Neighbor) Model

The KNN model is the third model that is being used in this project. This model takes the number of categories, in our case, the rank, and separates the data into groups. In our case there are around 351 groups. The data is then placed into their groups and their stats act like a coordinate (in two dimensions x and y). The new data is then brought in and the euclidean distance is calculated between the new data and all the points that were in the training set. The closest point to the new point is how the category gets determined.

suppressWarnings(suppressMessages(library(class)))

model <- knn(traindata[,-1], testdata[,-1], traindata[,1], k = 3 )
rank <- as.data.frame(model)
rank$team <- testnames
names(rank) <- c("ranks", "teams")
sort_teams_knn <- rank[with(rank, order(ranks)), ]

Defensive Prediction Results

These are the results that were obtained by the machine learning algorithms. It shows the break down of the top ten teams by the model that predicted them. It then shows the ranking of the teams with the three models combined. This is calculated by the 1st place team getting 10 pts and the last place team getting 1. This is repeated for all of the models. The points are totaled up and then the teams are ranked by the total points. The last part comapres the predictied results with the actual results. It comapares where the top 8 teams finished in the actual tournament to where the models predicted they would finish.

Rank KNN Model Tree Decision Tree
1 Villanova Michigan St. Villanova
2 Kansas Villanova North Carolina
3 SMU North Carolina Indiana
4 Syracuse Purdue Kentucky
5 Miami (FL.) Kansas Virgina
6 Gonzaga Wichita St. Michigan St.
7 Arizona Virgina Louisville
8 Oklahoma Indiana UCONN
9 BYU Gonzaga Kansas
10 Yale St. Mary’s (CA) Xavier
Rank Teams Points
1 Villanova 29
2 North Carolina 17
3 Kansas 17
4 Michigan St. 15
5 Indiana 11
6 Virgina 10
7 SMU 8
8 Syracuse 7
Tournament Place Actual Teams Predicted
Champ Villanova Villanova
Finals North Carolina North Carolina
Final 4 Syracuse Kansas
Final 4 Oklahoma Michigan St
Elite 8 Virgina Virgina
Elite 8 Oregon Indiana
Elite 8 Kansas SMU
Elite 8 Notre Dame Syracuse

Predictions Based on Offensive Statistics

The following section will cover the process of modeling the offensive statistics to predict the top 8 teams. The models used will be KNN, SVM, and Random Forest. As a precursor to modeling, the offensive statistics are retrieved from the cloud MySQL database.

bbmetricDB <- DBconnect()

stats  <- dbGetQuery(bbmetricDB, 
 "                              
  select  
         team_name,                season_start_year,
         points_per_game,  average_scoring_margin,   offensive_efficiency,
         floor_percentage, overtime_points_per_game, average_1st_half_margin,
         average_2nd_half_margin, average_overtime_margin, points_from_2_pointers,
         points_from_3_pointers, percent_of_points_from_2_pointers, 
         percent_of_points_from_3_pointers,
         percent_of_points_from_free_throws, shooting_pct, effective_field_goal_pct,
         three_point_pct,  two_point_pct,  free_throw_pct,
         true_shooting_percentage, field_goals_made_per_game, field_goals_attempted_per_game,
         three_pointers_made_per_game, three_pointers_attempted_per_game,
         free_throws_made_per_game, free_throws_attempted_per_game, three_point_rate,
         two_point_rate, fta_per_fga, ftm_per_100_possessions,  free_throw_rate,
         non_blocked_2_pt_pct, offensive_rebounds_per_game, offensive_rebounding_pct,
         assists_per_game, turnovers_per_game,  turnovers_per_possession,
         assist_per_turnover_ratio, assists_per_fgm, assists_per_possession,
         turnover_pct,games_played, possessions_per_game,
         extra_chances_per_game, effective_possession_ratio, win_pct_all_games,
         ncaa_round_64,   ncaa_round_32, ncaa_sweet_16,
         ncaa_elite_8,   ncaa_final_4,   ncaa_championship_game,
         ncaa_champion,
         ncaa_round_64 +
         ncaa_round_32 * 2 +
         ncaa_sweet_16 * 4 +
         ncaa_elite_8 * 8 + 
         ncaa_final_4 * 16 +
         ncaa_championship_game * 32 +
         ncaa_champion  * 64 as weighted_rank
   from team_stats")

The following offensive statistics will be used to to predict the top 8 teams of the 2016 NCAA tournament. Not all statistics will be important but are included from comprehesiveness. The model will determine the importance of each variable during the training process.

Offensive Variables
Variable Variable Variable
team_name true_shooting_percentage assists_per_possession
season_start_year field_goals_made_per_game turnover_pct
points_per_game field_goals_attempted_per_game games_played
average_scoring_margin three_pointers_made_per_game possessions_per_game
offensive_efficiency three_pointers_attempted_per_game extra_chances_per_game
floor_percentage free_throws_made_per_game effective_possession_ratio
overtime_points_per_game free_throws_attempted_per_game win_pct_all_games
average_1st_half_margin three_point_rate ncaa_round_64
average_2nd_half_margin two_point_rate ncaa_round_32
average_overtime_margin fta_per_fga ncaa_sweet_16
points_from_2_pointers ftm_per_100_possessions ncaa_elite_8
points_from_3_pointers free_throw_rate ncaa_final_4
percent_of_points_from_2_pointers non_blocked_2_pt_pct ncaa_championship_game
percent_of_points_from_3_pointers offensive_rebounds_per_game ncaa_champion
percent_of_points_from_free_throws offensive_rebounding_pct weighted_rank
shooting_pct assists_per_game
effective_field_goal_pct turnovers_per_game
three_point_pct turnovers_per_possession
two_point_pct assist_per_turnover_ratio
free_throw_pct assists_per_fgm

The predictor variable will be weighted_rank, which is calculated based on a team’s advancement in the NCAA tournament.

Bracket Points Assigned
Round of 64 1
Round of 32 2
Sweet 16 4
Elite 8 8
Final 4 16
Championship Game 32
Champion 64

Training and Test Data

Create a offensive training dataset using the team statistics for years prior to 2015.

# create a training dataset based on all years prior to 2015
stats_train <- subset(stats, season_start_year < 2015) # take all years prior to 2015

# store the team, year, ncaa_tournament indicators, and the weigted bracket rank
stats_train_results <- select(stats_train, team_name, season_start_year, ncaa_round_64:weighted_rank)

Create the test dataset using the team statistics for the 2015 NCAA basketball season.

# create a training dataset based on all years prior to 2015
stats_test <- filter(stats, season_start_year == 2015, ncaa_round_64 == 1) 


# store the team, year, ncaa_tournament indicators, and the weigted bracket rank
stats_test_results <- select(stats_test, team_name, season_start_year, ncaa_round_64:weighted_rank)

Finally, remove the columns from the training dataset which will not be used in models

stats_train <- select(stats_train, -(team_name:season_start_year), -(ncaa_round_32:ncaa_champion))

str(stats_train)
## 'data.frame':    3124 obs. of  47 variables:
##  $ points_per_game                   : num  85.7 74.5 78.6 76.9 72.1 70.8 74.5 73.5 74.3 70.4 ...
##  $ average_scoring_margin            : num  17.1 12.5 9.2 8.4 9 6.1 5.5 8 4.5 7.8 ...
##  $ offensive_efficiency              : num  1.15 1.13 1.1 1.08 1.11 ...
##  $ floor_percentage                  : num  56.4 53.3 53.3 51.2 49.5 52.7 51 51.5 53.9 50.4 ...
##  $ overtime_points_per_game          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ average_1st_half_margin           : num  8.3 7.3 4.9 4.9 7.3 1.7 4 3.8 3 6.3 ...
##  $ average_2nd_half_margin           : num  9.1 5 4.3 3.4 2.4 4.2 1.3 4.2 1.6 1.2 ...
##  $ average_overtime_margin           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ points_from_2_pointers            : num  51.2 38.1 45.9 38.6 30.3 42.1 39.9 35.9 39.2 39.5 ...
##  $ points_from_3_pointers            : num  17 22.1 18.8 23.6 30.6 14.2 20.3 22.1 17.8 14.2 ...
##  $ percent_of_points_from_2_pointers : num  59.7 51.1 58.4 50.2 42 59.5 53.6 48.9 52.7 56.2 ...
##  $ percent_of_points_from_3_pointers : num  19.8 29.6 24 30.6 42.4 20 27.3 30.1 24 20.2 ...
##  $ percent_of_points_from_free_throws: num  20.5 19.3 17.7 19.1 15.5 20.3 19.1 21 23.4 23.6 ...
##  $ shooting_pct                      : num  49.9 47.2 49 45.9 45.5 46.7 48.1 46.8 47.8 42.2 ...
##  $ effective_field_goal_pct          : num  54.4 53.8 54.2 52.6 54.7 50.9 54.2 53.6 53.3 46.3 ...
##  $ three_point_pct                   : num  35.8 36.1 37.2 35.5 37.4 36.4 38 39.2 34.9 32.1 ...
##  $ two_point_pct                     : num  54.6 53.6 53.6 52.2 53.4 49.8 52.8 50.9 53.8 45.7 ...
##  $ free_throw_pct                    : num  71.2 70.3 68.1 66.5 71.3 66.2 75.8 69.8 72.4 62.8 ...
##  $ true_shooting_percentage          : num  115 114 113 110 114 ...
##  $ field_goals_made_per_game         : num  31.2 26.4 29.2 27.2 25.3 25.8 26.7 25.3 25.5 24.5 ...
##  $ field_goals_attempted_per_game    : num  62.6 55.8 59.7 59.2 55.7 55.3 55.6 54.1 53.4 58 ...
##  $ three_pointers_made_per_game      : num  5.7 7.4 6.3 7.9 10.2 4.7 6.8 7.4 5.9 4.7 ...
##  $ three_pointers_attempted_per_game : num  15.8 20.4 16.9 22.1 27.3 13 17.8 18.8 17 14.8 ...
##  $ free_throws_made_per_game         : num  17.6 14.4 13.9 14.7 11.2 14.4 14.2 15.4 17.4 16.6 ...
##  $ free_throws_attempted_per_game    : num  24.7 20.4 20.4 22.1 15.7 21.8 18.8 22.1 24 26.4 ...
##  $ three_point_rate                  : num  25.2 36.5 28.3 37.4 49 23.5 32 34.8 31.9 25.5 ...
##  $ two_point_rate                    : num  74.8 63.5 71.7 62.6 51 76.5 68 65.2 68.1 74.5 ...
##  $ fta_per_fga                       : num  0.395 0.366 0.342 0.374 0.282 0.394 0.337 0.408 0.449 0.455 ...
##  $ ftm_per_100_possessions           : num  23.7 21.7 19.5 20.6 17.2 ...
##  $ free_throw_rate                   : num  28.2 26.5 24.1 26.4 21.1 28.4 23.7 28.2 30.5 30.8 ...
##  $ non_blocked_2_pt_pct              : num  59.8 57.7 57.7 58.6 58.3 53.2 57.9 57.6 59.5 50.6 ...
##  $ offensive_rebounds_per_game       : num  13.4 10.9 13.5 12.4 9.4 10.3 10.3 10.4 11.7 15.5 ...
##  $ offensive_rebounding_pct          : num  39.2 34.1 41.1 35.2 29.4 32.2 34.7 32.6 38 41.5 ...
##  $ assists_per_game                  : num  18.3 14.9 15.3 14.6 17.4 12.9 12.4 16.2 14.6 13 ...
##  $ turnovers_per_game                : num  13.8 11.3 15.7 14 11.3 10.8 14.6 13.2 13.7 14.9 ...
##  $ turnovers_per_possession          : num  18.5 17.1 22.1 19.6 17.3 16.3 21.2 19.4 20.5 21.3 ...
##  $ assist_per_turnover_ratio         : num  1.328 1.325 0.978 1.045 1.546 ...
##  $ assists_per_fgm                   : num  0.586 0.567 0.525 0.538 0.687 0.499 0.463 0.639 0.574 0.529 ...
##  $ assists_per_possession            : num  0.246 0.226 0.216 0.205 0.267 0.194 0.18 0.239 0.219 0.185 ...
##  $ turnover_pct                      : num  15.7 14.7 18.5 16.7 15.1 14.1 18.4 16.8 17.5 17.4 ...
##  $ games_played                      : num  38 39 32 35 35 33 35 32 33 31 ...
##  $ possessions_per_game              : num  74.4 66.1 71.1 71.3 65.1 66.3 68.9 67.7 66.8 70.1 ...
##  $ extra_chances_per_game            : num  4.7 2.6 4.2 0.2 2.8 3.2 0.5 3.2 -2.9 3.3 ...
##  $ effective_possession_ratio        : num  0.995 0.994 0.969 0.977 0.971 ...
##  $ win_pct_all_games                 : num  0.816 0.897 0.625 0.6 0.743 0.636 0.629 0.656 0.636 0.548 ...
##  $ ncaa_round_64                     : int  1 1 1 0 0 1 0 0 1 0 ...
##  $ weighted_rank                     : num  15 63 1 0 0 3 0 0 3 0 ...

KNN Model

set.seed(600)


ctrl <- trainControl(method="repeatedcv", repeats = 5, number = 10)

knnFit  <-   train(weighted_rank ~ .,
                   data   = stats_train,
                   method = "knn",
                   trControl = ctrl,
                   preProcess = c("center","scale", "pca"),
                   tuneLength = 20)

knnFit
## k-Nearest Neighbors 
## 
## 3124 samples
##   46 predictor
## 
## Pre-processing: centered (46), scaled (46), principal component
##  signal extraction (46) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 2812, 2811, 2811, 2812, 2811, 2811, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared 
##    5  7.390945  0.1878696
##    7  7.005912  0.2339936
##    9  6.773718  0.2798062
##   11  6.660547  0.3043377
##   13  6.601636  0.3186818
##   15  6.510144  0.3424952
##   17  6.507288  0.3440947
##   19  6.497940  0.3507688
##   21  6.496347  0.3507279
##   23  6.506862  0.3500910
##   25  6.506733  0.3509608
##   27  6.497975  0.3546472
##   29  6.492643  0.3573096
##   31  6.514821  0.3530110
##   33  6.515490  0.3518267
##   35  6.533359  0.3459206
##   37  6.535085  0.3462279
##   39  6.541223  0.3464458
##   41  6.550911  0.3457281
##   43  6.551048  0.3469615
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was k = 29.

Let’s see which variables were determined to be the most important in the KNN model:

dotPlot(varImp(knnFit), main = "KNN Model - Most Relevant Variables")

Random Forest

set.seed(600)

rfFit <-train(weighted_rank ~.,
                 data=stats_train,
                 method="rf",
                 trControl=trainControl(method="cv",number=5),
                 prox=TRUE, importance = TRUE,
                 allowParallel=TRUE)
                
rfFit
## Random Forest 
## 
## 3124 samples
##   46 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 2499, 2499, 2499, 2500, 2499 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared 
##    2    6.528543  0.3757743
##   24    6.098787  0.4619901
##   46    6.273775  0.4454502
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was mtry = 24.
print(rfFit$finalModel)
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE,      proximity = TRUE, allowParallel = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 24
## 
##           Mean of squared residuals: 36.51865
##                     % Var explained: 44.67
dotPlot(varImp(rfFit), main = "Random Forest Model - Most Relevant Variables")

Support Vector Machine (SVM)

set.seed(600)

SVMFit <- train(weighted_rank ~., data = stats_train, method="svmRadial", trControl = ctrl)

SVMFit
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 3124 samples
##   46 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 2812, 2811, 2811, 2812, 2811, 2811, ... 
## Resampling results across tuning parameters:
## 
##   C     RMSE      Rsquared 
##   0.25  6.893490  0.3641109
##   0.50  6.741980  0.3966425
##   1.00  6.587426  0.4223910
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01659096
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were sigma = 0.01659096 and C = 1.
dotPlot(varImp(SVMFit), main = "SVM Model - Most Relevant Variables")

Predict

To predict the top 8 teams in the NCAA tournament using the KNN, Random Forest, and SVM models, provide the test dataset to the three models.

AllModels <- list(knn = knnFit,
                  rf  = rfFit,
                  svm = SVMFit)


AllModelPred <- predict(AllModels, newdata = stats_test)

stats_test_results <- cbind(stats_test_results, as.data.frame(AllModelPred))

Compare and store the models’ predictions against the actual winners:

# Top 8 teams based on kNN
knn_top8 <- stats_test_results %>% 
                select(team_name, knn) %>%
                arrange(desc(knn)) %>% 
                slice(1:8) %>% 
                mutate(rank = row_number())
                
# Top 8 teams based on SVM
svm_top8 <- stats_test_results %>% 
                select(team_name, svm) %>%
                arrange(desc(svm)) %>% 
                slice(1:8) %>% 
                mutate(rank = row_number())            

# Top 8 teams based on Random Forest
rf_top8 <- stats_test_results %>% 
                select(team_name, rf) %>%
                arrange(desc(rf)) %>% 
                slice(1:8) %>% 
                mutate(rank = row_number())

top8.disp <-
          knn_top8 %>% 
          inner_join(svm_top8, by="rank") %>% 
          inner_join(rf_top8, by="rank") %>%
          select(rank, team_name.x, team_name.y, team_name) %>%
          rename(Rank = rank, KNN = team_name.x, SVM = team_name.y, RF = team_name)

Offensive Prediction Results

These are the results that were obtained by the machine learning algorithms - kNN, SVM, and Random Forest.

Rank KNN SVM RF
1 North Carolina North Carolina Villanova
2 Kentucky Kentucky North Carolina
3 West Virginia Oregon Michigan St.
4 Stephen F. Austin Kansas Oklahoma
5 Gonzaga Villanova Kansas
6 Michigan St. West Virginia Virginia
7 Miami (FL) Notre Dame Oregon
8 Kansas Virginia Texas A&M

Blending the rankings from each model:

top8 <- rbindlist(list(knn_top8, svm_top8, rf_top8))

top8$wt <-  abs(top8$rank - 8 ) + 1

top8.blended <- 
                top8 %>% 
                group_by(team_name) %>% 
                summarise(points = sum(wt)) %>% 
                arrange(desc(points)) %>% 
                slice(1:8) %>% 
                mutate(rank = row_number(desc(points))) %>%
                select(rank, team_name, points)

kable(top8.blended)
rank team_name points
1 North Carolina 23
2 Kentucky 14
3 Villanova 12
4 Kansas 10
5 West Virginia 9
6 Michigan St. 9
7 Oregon 8
8 Stephen F. Austin 5

Note. The blended results identified four teams correctly in the Elite 8 round. However, it did not place any of them in rank order correctly. In comparison, the Random Forest model identified 6 teams in the Elite 8 and correctly placed three teams in rank order. The Random Forest model will be selected as the best model from an offensive perspective.

Conclusions

Prediction Results Summary

Tournament Place Actual Teams Defensive Prediction Offensive Prediction (Random Forest)
Champ Villanova Villanova Villanova
Finals North Carolina North Carolina North Carolina
Final 4 Syracuse Kansas Michigan St.
Final 4 Oklahoma Michigan St Oklahoma
Elite 8 Virginia Virgina Kansas
Elite 8 Oregon Indiana Virginia
Elite 8 Kansas SMU Oregon
Elite 8 Notre Dame Syracuse Texas A&M

Overall, the results of the models are encouraging. By design, this project separately models offensive statistics from defensive statistics. Even with this limitation, the models were able to correctly identify the championship game and the overall champion. Additionally, the defensive model predicted Virgina making it to the Elite 8. It did correctly predict Syracuse and Kansas as the top 8 teams; it just placed them in the wrong rounds. Defensive prediction results with three exactly correct and another 2 partially correct out of 8 is a good indicator.

Within the offensive models, the Random Forest model proves to be the most accurate of the three models used. This model correctly identified 6 teams among the Elite 8 with Villanova, North Carolina, and Oklahoma being placed correctly in rank order.

It could be that the teams in the top 8 that were not predicted based on the defensive model either had a huge offensive presence and/or were ranked very low on defense. These factors would keep them lower on this list and would not be predicted properly. A similar consideration would apply to the offensive models from a defensive perspective.

Combining the offensive and defensive statistics together would be the next step in the modeling process for this topic.