Predicting March Madness

Overview

This project attempts to choose the top 8 teams in the NCAA division 1 basketball tournament. This will be accomplished by utlizing machine learning packages available in R. Two different approaches will be used to predict the top 8 teams:

Using defensive teams statistics combined with an overall ranking based on strength of schedule
- The response variable will be the ranking of the teams. This ranking is calculated based off of each team’s performance throughout the entire year. The best team will have a value of 1, ranging up to a value of 351 for the worst team. The explanatory variables considered will be the defensive statistics that have been collected for each team over multiple years.
Using offensive team statistics with a ranking system based on NCAA Tournament success
- The response variable will be a ranking value based on a team’s success in the NCAA Tournament. The team winning the NCAA Tournament will receive the most points with 64, down to 1 point for participating in the Round of 64. The explanatory variables considered in this approach will be the offensive statistics that have been collected for each team over multiple years.

Season averages will be used instead of game by game statistics in this project. There is a total of 10 year’s worth of data from 2006 - 2015. The training data will based on the seasons from 2006 - 2014, while the test data will be based on the 2015 season.

Three models will be used to predict the ranks of the teams per method:

Defensive Method
- Model Tree
- Decision Tree
- KNN (K-Nearest Neighbor)
Offensive Method
- KNN (K-Nearest Neighbor)
- SVM (Support Vector Machine)
- Random Forest

The R packages required for this project are:

RCurl
dplyr
stringr
RMySQL
XML
kernlab
RWeka
C50
class
splitstackshape
htmltab
rvest
knitr
tidyr
caret

Data Sources

College Basketball Season and Division Data - GitHub

The team data for this project was collected off of Github. The Github account had all of the games played by every Division 1, 2, and 3 game for the 2002 through 2015 basketball season. The same Github repository also has the divisions for all of the teams plus game data for each season.

The GitHub repository can be found here with CSV files for:

Games by Season from 2007 - 2016
Schools By Division

Team Statistics - www.teamrankings.com

All of the statistics relating to the college basketball teams were web scraped from teamrankings.com. Team statistics for seasons from 2006 to 2015 where extracted from this site and ultimately stored in a cloud MySQL database.

Statistics across the following 14 categories were captured and combined for each team per season:

Scoring
Shooting
Rebounding
Blocks & Steals
Assists & Turnovers
Fouls
Scoring Defense
Shooting Defense
Opponent Rebounding
Opponent Block & Steals
Opponent Assists & Turnovers
Opponent Fouls
Other
Winning Percentage

TeamRankings

NCAA Tournament Results

The teams reaching the variaous brackets within the NCAA Tournament are needed for the offensive modeling approach. This data was collected from two sites.

CBS Sports

http://www.cbssports.com/collegebasketball/ncaa-tournament/history/yearbyyear

Year	URL
2007	http://www.cbssports.com/collegebasketball/ncaa-tournament/history/yearbyyear/2007
2008	http://www.cbssports.com/collegebasketball/ncaa-tournament/history/yearbyyear/2008
2009	http://www.cbssports.com/collegebasketball/ncaa-tournament/history/yearbyyear/2009
2010	http://www.cbssports.com/collegebasketball/ncaa-tournament/history/yearbyyear/2010
2011	http://www.cbssports.com/collegebasketball/ncaa-tournament/history/yearbyyear/2011
2012	http://www.cbssports.com/collegebasketball/ncaa-tournament/history/yearbyyear/2012

Wikipedia

NCAA Men’s Division Basketball Tournament

Year	URL
2013	https://en.wikipedia.org/wiki/2013_NCAA_Men%27s_Division_I_Basketball_Tournament
2014	https://en.wikipedia.org/wiki/2014_NCAA_Men%27s_Division_I_Basketball_Tournament
2015	https://en.wikipedia.org/wiki/2015_NCAA_Men%27s_Division_I_Basketball_Tournament
2016	https://en.wikipedia.org/wiki/2016_NCAA_Men%27s_Division_I_Basketball_Tournament

Data Collection and Preparation

Combine Season and Division Data

The Github data for season combines teams from a variety of divisions. It has data for Division 1 teams vs Divsion 4 teams and a lot of other combinations. Since the NCAA basketball tournament being considered is for division 1 teams, the division data and the season data had to be combined into one chunk of data. Then the entire data frame can be subset to only those games that are with two division 1 teams. Any other combination will be dropped. Some of the teams have different names from year to year (Connecticut is UCONN or Alcorn St. is Alcorn etc.) so in order for the matching to work properly, all the team names have to be the same, so some of the names in certain seasons had to be changed to match the divisonal data.

#Function: Data
#Input: 
#   div, data frame of the divisions for the teams
#   result, season data for the teams
#Output:
#   returns a data frame that has only division 1 games, all other games were dropped.
#This function will take the season data and the divisional data and combine them together. This gives the ability to pick only division 1 games.
data <- function(div, result)
{
  names(div) <- c("team_id", "division")
  jointeam <- left_join(result, div, by = "team_id")
  almost <- jointeam %>% select(year, team_name, opponent_name, opponent_id, team_score, opponent_score, division)
  colnames(almost)[colnames(almost)=="division"] <- "team_division"
  colnames(almost)[colnames(almost)=="opponent_id"] <- "team_id"
  joinopp <- left_join(almost, div, by = "team_id")
  colnames(joinopp)[colnames(joinopp)=="division"] <- "opp_division"
  final <- joinopp %>% select(year, team_name, opponent_name, team_score, opponent_score, team_division, opp_division)
  
  D1 <- subset(final, team_division == "I"&opp_division == "I")
  
  return(D1)
}

#Change the names of the teams to match between the divisions and the seasonal data. These teams are common amongst all of the years
Data <- data(div,result)
Data$team_name <- as.character(Data$team_name)
Data$team_division <- as.character(Data$team_division)
Data$opponent_name <- as.character(Data$opponent_name)
Data$opp_division <- as.character(Data$opp_division)
Data$opponent_name[Data$opponent_name == "Alcorn St."] <- "Alcorn"
Data$opponent_name[Data$opponent_name == "Connecticut"] <- "UConn"
Data$opponent_name[Data$opponent_name == "Md.-East. Shore"] <- "UMES"
Data$opponent_name[Data$opponent_name == "St. Francis (NY)"] <- "St. Francis Brooklyn"
Data$opponent_name[Data$opponent_name == "UNC Wilmington"] <- "UNCW"
Data$opponent_name[Data$opponent_name == "Cal St. Northridge"] <- "CSUN"
Data$opponent_name[Data$opponent_name == "S.C. Upstate"] <- "USC Upstate"
Data$opponent_name[Data$opponent_name == "Neb. Omaha"] <- "Omaha"
Data$opponent_name[Data$opponent_name == "CSU Bakersfield"] <- "Bakersfield"
Data$opponent_name[Data$opponent_name == "SIU Edwardsville"] <- "SIUE"
Data$opponent_name[Data$opponent_name == "Fla. Gulf Coast"] <- "FGCU"

#The stats for Utah Vallet stats were missing. I hade to create their offesnive stats to all the matching with the divisnal data.
#This only happens for the 2004 season.
if (Data$year == 2004)
{
  uvu <- subset(Data, Data$opponent_name == "Utah Valley")
  i <- 1
  for (i in 1:NROW(uvu))
  {
    temp <- uvu$opponent_name[i]
    uvu$opponent_name[i] <- uvu$team_name[i]
    uvu$team_name[i] <- temp
    
    temp <- uvu$opponent_score[i]
    uvu$opponent_score[i] <- uvu$team_score[i]
    uvu$team_score[i] <- temp
    i <- i + 1
  }
  Data <- rbind(Data, uvu)
}

#The 2016 season data had a tab character (/n) in the teams name. I had to split the list on the tab character, then extract the names of each team. Team names come in a wide variety, ranging from 1 to 4 name with a wide variety of punctuation in between. This is only for the 2016 season.

if (Data$year == 2016)
{
  names <- unlist(str_split(Data$opponent_name, '\n'))
  
  i <- 2
  j <- 1
  for (j in 1:(NROW(Data)))
  {
    Data$opponent_name[j] <- names[i]
    i <- i + 2
    j <- j + 1
  }
  Data$opponent_name <- unlist(str_extract_all(Data$opponent_name, "[[:alpha:]]{1,}[[:punct:]]{0,}[[:alpha:]]{0,}[[:space:]]{0,}[[:alpha:]]{0,}[[:punct:]]{0,}[[:alpha:]]{0,}[[:space:]]{0,}[[:alpha:]]{0,}[[:punct:]]{0,}[[:alpha:]]{0,1}[[:space:]]{0,}[[:alpha:]]{0,}[[:punct]]{0,}[[:alpha:]]{0,}[[:punct:]]{0,}"))
  Data$opponent_name[Data$opponent_name == "Stephen F. Austin"] <- "SFA"
  Data$opponent_name[Data$opponent_name == "Texas-Arlington"] <- "UT Arlington"
  Data$opponent_name[Data$opponent_name == "Citadel"] <- "The Citadel"
}

Seasonal Rankings

The response variable being used for this process is the end of the year rankings. Instead of using the rankings that are given by the coaches, this project will calculate its own based on the Massy ranking system. This sytem has been used for the official rankings of ESPN and was created by a student as a part of a graduate research project. Rankings are created on a game by game basis, so they can be updated as the season goes on. This system compares the result of each game to the difference in the scores.

col <- as.character(unique(Data$team_name))

matrix <- matrix(0, ncol = length(col), nrow = NROW(Data), byrow=TRUE)
dimnames(matrix) <- list(c(1:NROW(Data)), col)

#Function: Schedule
#Input:
#  matrix, an empty matrix to hold the information for the ranking system. It is a square matrix with the dimensions being the number of teams for that season
#  Data, a data frame that contains the seaonsal data for division 1
#OUtput:
#  matrix that contains the results for each game for each team during hte season.
#This function will create a matrix that is used for input into the ranking formula. Each column of the martix contain the team name. Each row of the matrix will contain a game. When two teams play against each other, the winner will get a 1 on the intersection and the loser will get a -1. The we would move onto the next game. This will continue for every game that was played for that season. 

schedule <- function(matrix, Data)
{
  i <- 1
  for (i in 1:NROW(Data))
  {
    if (as.numeric(Data[i,4]) > as.numeric(Data[i,5]))
    {
      matrix[i,Data[i,2]] = 1
      matrix[i,Data[i,3]] = -1
    } 
    else   
    {
      matrix[i,Data[i,2]] = - 1
      matrix[i,Data[i,3]] =  1
    }
    i <- i + 1
  }
  return(matrix)
}

Team <-schedule(matrix, Data)

#Function: scoredifferential
#Input:
#  Data, data frame containing the seasonal data
#Output:
# diff, a 1 by number of games matrix that conatins the difference in the score of each game
#This is a function that will calculate the score differential for each game that is played throughout the season. If team A beat team B 70-60. Then the entry will be 10, If team A lost to team B 60-70, then the entry would be -10.

scoredifferential <- function(Data)
{
  
  i <- 1
  diff <- matrix(0, nrow = NROW(Data), ncol = 1)
  
  for (i in 1 : NROW(Data))
  {
    
    diff[i] <- as.numeric(Data[i,4])-as.numeric(Data[i,5])
    i <- i + 1
  }
  return(diff)
}

diff <- scoredifferential(Data)
absdiff <- abs(diff)

#Transpose the matrix
tteam <- t(Team)
#Multiply the transpose by the original matrix
A <- tteam %*% Team
#Multiply the transpose matrix by the matrix taht contains the differences
B <- tteam %*% absdiff

#Make the last rowm of the socre matrix 1's and the last ebtry of the socre difference matrix a 0. If this is not done, then there is no solution due to the properties of the matrix.
A[NROW(A),] = 1
B[NROW(B),1] = 0

#Solve the linear equation to get the rankings of the teams.
rankings <- solve(A,B)

#order the teams by the rankings 
rankings <- as.data.frame(rankings)
ranks <- rankings[order(-rankings[1]), , drop = FALSE]
ranks$rank <- c(1:NROW(ranks))

Massy Ranking Algorithm

The table below shows the results of the Massy Ranking algorithm:

Y2015	2015_Rank	Y2014	2014_Rank	Y2013	2013_Rank	Y2012	2012_Rank	Y2011	2011_Rank
Kansas	23.65358	Kentucky	29.59169	Louisville	25.14265	Indiana	25.44025	Kentucky	25.41010
Michigan St.	23.19914	Wisconsin	24.85308	Arizona	23.76682	Louisville	25.31585	Ohio St.	24.63799
West Virginia	22.30338	Arizona	24.64689	Florida	21.11844	Florida	24.23370	Kansas	22.40911
North Carolina	22.11190	Duke	24.33190	Kansas	20.79188	Duke	21.93963	Michigan St.	22.05064
Villanova	21.91290	Virginia	23.07023	Wisconsin	20.11526	Michigan	21.80616	North Carolina	22.02033
Louisville	21.63333	Villanova	22.90716	Duke	20.01823	Kansas	21.31411	Syracuse	21.80872
Virginia	21.23472	Utah	21.43638	Oklahoma St.	19.54340	Gonzaga	21.05962	Missouri	20.45130
Duke	20.81610	Gonzaga	21.33082	Villanova	19.41018	Syracuse	20.80983	Wisconsin	19.73984
Oklahoma	20.58445	North Carolina	21.32353	Michigan St.	19.24373	Ohio St.	20.74490	Indiana	19.31590
Miami (FL)	20.14680	Ohio St.	20.88407	Michigan	19.09298	Pittsburgh	19.38626	Florida	18.93787
Purdue	20.04783	Oklahoma	19.95099	Iowa St.	19.08060	Wisconsin	19.27926	Wichita St.	18.09982
Indiana	19.98950	Kansas	19.71435	Virginia	18.88161	Michigan St.	19.17142	Duke	17.73333
Kentucky	19.64098	Baylor	19.35853	UCLA	18.83495	Arizona	17.79115	Marquette	17.62773
Xavier	19.56783	Iowa St.	19.20751	Creighton	18.51829	Miami (FL)	17.47762	Memphis	17.53664
SMU	19.24612	Louisville	19.06722	Kentucky	18.46476	Minnesota	17.39826	Georgetown	17.49761
Arizona	19.22254	Notre Dame	18.83728	Iowa	18.30519	VCU	17.38770	Baylor	17.22571
Vanderbilt	19.06473	Michigan St.	17.34379	Ohio St.	18.29454	Creighton	16.99766	Louisville	16.87967
Iowa	18.43319	Texas	17.25475	Pittsburgh	17.98544	Georgetown	16.34374	Vanderbilt	16.39438
Texas A&M	18.42961	Butler	17.08722	Tennessee	17.74630	Oklahoma St.	16.08799	New Mexico	16.33073
Maryland	18.00047	West Virginia	16.83346	UConn	17.67079	Missouri	15.82460	Saint Louis	15.83155

Processing Team Statistics

The 14 categories of team statistics were web scraped from www.teamrankings.com using a combination of the rvest and htmltab packages in R. In total, there are 115 NCAA college basketball statistics available on teamrankings.com per team per year. Additionaly, each statistics includes the team’s rank within the season for a given statistic.

baseURL <- "https://www.teamrankings.com/ncb/stats/"
url_content  <- read_html(baseURL)

stat.links <- url_content %>%  
              html_nodes("*") %>% 
              html_nodes(xpath = "./a") %>% 
              html_attr("href") %>%  
              str_match("/ncaa-basketball/stat/.+")   

stat.links <- stat.links[complete.cases(stat.links),]

The URL for each statistic is constructed to create 115 callable URLs which can be used with a date parameter for each season.

##  [1] "/ncaa-basketball/stat/points-per-game"         
##  [2] "/ncaa-basketball/stat/average-scoring-margin"  
##  [3] "/ncaa-basketball/stat/offensive-efficiency"    
##  [4] "/ncaa-basketball/stat/floor-percentage"        
##  [5] "/ncaa-basketball/stat/1st-half-points-per-game"
##  [6] "/ncaa-basketball/stat/2nd-half-points-per-game"
##  [7] "/ncaa-basketball/stat/overtime-points-per-game"
##  [8] "/ncaa-basketball/stat/average-1st-half-margin" 
##  [9] "/ncaa-basketball/stat/average-2nd-half-margin" 
## [10] "/ncaa-basketball/stat/average-overtime-margin"

baseStatURL <- "https://www.teamrankings.com"

season_year_end <- 2016

allStatsURL <- sprintf("%s%s%s", baseStatURL, stat.links, sprintf("?date=%s-04-06", season_year_end))

An example of the URLs constructured for the team statistics for the 2015 season is shown below:

Team Stats URLs

https://www.teamrankings.com/ncaa-basketball/stat/points-per-game?date=2016-04-06
https://www.teamrankings.com/ncaa-basketball/stat/average-scoring-margin?date=2016-04-06
https://www.teamrankings.com/ncaa-basketball/stat/offensive-efficiency?date=2016-04-06
https://www.teamrankings.com/ncaa-basketball/stat/floor-percentage?date=2016-04-06
https://www.teamrankings.com/ncaa-basketball/stat/1st-half-points-per-game?date=2016-04-06 https://www.teamrankings.com/ncaa-basketball/stat/2nd-half-points-per-game?date=2016-04-06 https://www.teamrankings.com/ncaa-basketball/stat/overtime-points-per-game?date=2016-04-06 https://www.teamrankings.com/ncaa-basketball/stat/average-1st-half-margin?date=2016-04-06
https://www.teamrankings.com/ncaa-basketball/stat/average-2nd-half-margin?date=2016-04-06
https://www.teamrankings.com/ncaa-basketball/stat/average-overtime-margin?date=2016-04-06

## -----------------------------------------------------------
## Function: get_team_stats
##
## -----------------------------------------------------------
  
get_team_stats <- function(statURL, year, createCSV = TRUE) {

        for (i in 1:length(statURL)) {     
            
            # Keep the rank, team, and stat for the given year
            statHTMLtab <- htmltab(doc = statURL[i], which = 1)[, 1:3]
            
            # parse the current stat name from the URL string
            statname <- str_replace_all(str_extract(statURL[i], "(?<=\\/stat\\/).*?.*?(?=\\?)"), "-", "_")
            
            # convert double-underscores to a single underscore
            statname <- str_replace_all(statname, "__", "_") 
            
            stats.df <- statHTMLtab %>% gather(key = Year, value = "stat", -Team, -Rank) 
            
            # convert the year from a character to integer
            stats.df$Year <- as.integer(stats.df$Year)

            # replace % signs in the statistic value
            stats.df$stat <- as.numeric(gsub("%|--", "", stats.df$stat))
            stats.df$Rank <- as.integer(stats.df$Rank)
        
            # rename the rank column 
            names(stats.df)[names(stats.df) == "Rank"] <- sprintf("%s_%s", statname, "rank")
            
            # rename the stat column to the stat being extracted in the URL
            names(stats.df)[names(stats.df) == "stat"] <- statname
            
            # reorder the columns
            stats.df <- stats.df[, c(3, 2, 4, 1)]
            
            
            if (i == 1) {return.df <- stats.df}
            else {
                
                return.df <- return.df %>% left_join(stats.df, by = c("Year" = "Year", "Team" = "Team"))
            }
            
        }
    
        if (createCSV) { 
            
            # write the team stats to a csv file
            write.csv(return.df, sprintf("team_stats_%s.csv", year - 1), row.names = FALSE)
        }
     
       # return the stats dataframe
       return (return.df)
    
}    

## -----------------------------------------------------------
## Function: load_RMySQL_DB
##
## -----------------------------------------------------------

load_RMySQL_DB <- function(tablename, df, overwrite = TRUE, append = FALSE) {
    
    # establish the connection to the skill DB on db4free.net
    bbmetricDB = dbConnect(MySQL(), user=proj_user, password=proj_pwd, dbname=proj_db, host=proj_host)
    
    dbWriteTable(bbmetricDB, 
                 name = tablename, 
                 value = df,
                 overwrite = overwrite, 
                 append = append, 
                 row.names = FALSE)
    
    dbDisconnect(bbmetricDB)
}

Using the function get_team_stats, teams statistics for the NCAA basketball seasons (starting) from 2006 - 2015

team.stats.2016 <- get_team_stats(allStatsURL, season_year_end, createCSV = FALSE)

The team statistics for the 2015 - 2016 season (partial listing):

team_name	source_team_name	season_start_year	season_end_year	assist_per_turnover_ratio	assist_per_turnover_ratio_rank	assists_per_fgm	assists_per_fgm_rank	assists_per_game	assists_per_game_rank
North Carolina	N Carolina	2015	2016	1.647	3	0.573	58	17.7	4
Michigan St.	Michigan St	2015	2016	1.737	1	0.713	1	20.5	1
Mississippi St.	Miss State	2015	2016	1.099	137	0.507	210	13.9	122
West Virginia	W Virginia	2015	2016	1.024	179	0.541	130	14.6	78
Virginia Tech	VA Tech	2015	2016	0.955	231	0.486	253	12.2	236
Florida St.	Florida St	2015	2016	1.041	170	0.478	274	13.3	160
Ole Miss	Mississippi	2015	2016	1.067	155	0.518	183	12.9	188
North Carolina St.	NC State	2015	2016	1.052	163	0.442	332	11.6	278
Saint Joseph’s	St Josephs	2015	2016	1.487	11	0.547	114	14.8	67
Fresno St.	Fresno St	2015	2016	1.210	84	0.494	240	12.9	185
George Washington	Geo Wshgtn	2015	2016	1.256	62	0.546	115	14.1	109
Colorado St.	Colorado St	2015	2016	1.126	119	0.513	192	13.1	173
South Carolina	S Carolina	2015	2016	0.996	201	0.520	172	13.5	149
UTEP	TX El Paso	2015	2016	1.237	70	0.573	59	15.5	41
Boise St.	Boise State	2015	2016	1.130	114	0.508	209	13.1	175
Iowa St.	Iowa State	2015	2016	1.446	21	0.531	149	16.5	16
Utah Valley	Utah Val St	2015	2016	1.074	149	0.557	90	14.0	112
Western Ky.	W Kentucky	2015	2016	1.000	196	0.496	234	13.2	169
San Francisco	San Fransco	2015	2016	0.960	229	0.462	303	12.3	229
SMU	S Methodist	2015	2016	1.427	24	0.615	23	17.6	5

This process of assembling team statistics was repeated to collect 10 years of statistics. Each year’s combined team statistics were loaded into a cloud MySQL database, described below.

Collecting NCAA Tournament Results

NCAA Tournament results from collected from 2007 - 2016. For each year, the teams making the Round of 64, Round of 32, Sweet 16, Elite 8, Final 4, Championship Game, and Champion were assembled and stored.

The process to collect the NCAA Tournament bracket teams was somewhat manual since the winners of each round are most commonly represented in a bracket graphic versus as a list of teams.

The site http://www.cbssports.com/collegebasketball/ncaa-tournament/history/yearbyyear provided the winners of each round up to 2012 but stopped beyond this tournament year.

NCAA Bracket Winners

Each round of winners was parsed using the function below, which uses the splitstatshape R package to create a dataframe of the teams in each bracket.

suppressWarnings(suppressMessages(library(splitstackshape)))

parse_bracket_results <- function(team_string) {

        str(team_string)
        
        team_string.new <- 
          strsplit(
            str_replace_all(s, "No. |[0-9]|(OT)|OT|\\(|\\)", "") , "\n")
        
        f <- as.data.frame(team_string.new); 
        
        f <- as.data.frame(str_c(str_replace(as.character(f[, 1]), "(.),(.)", "|"), "|"))
        colnames(f) <- "team"             
                    
        
        f <- as.data.frame(cSplit(f, "team", "|", direction = "long"))
        f$team <- as.character(f$team)
        
        f <- f %>% 
              select(team) %>% 
              filter(team != "") %>% 
              mutate(team = str_replace_all(team, " ,$| $", ""))

        return (f)

}

Sample of the NCAA bracket winners for 2016:

tournament_year	round_64	round_32	sweet_16	elite_8	final_4	championship_game	champion
2016	Kansas	Kansas	Kansas	Kansas	Villanova	Villanova	Villanova
2016	Villanova	Connecticut	Maryland	Villanova	Oklahoma	North Carolina
2016	Miami	Maryland	Miami (FL)	Oregon	North Carolina
2016	California	Hawaii	Villanova	Oklahoma	Syracuse
2016	Maryland	Wichita State	Oregon	North Carolina
2016	Arizona	Miami (FL)	Duke	Notre Dame
2016	Iowa	Iowa	Texas A&M	Virginia
2016	Colorado	Villanova	Oklahoma	Syracuse
2016	Connecticut	Oregon	North Carolina
2016	Temple	Saint Joseph’s	Indiana
2016	Wichita State	Yale	Notre Dame
2016	South Dakota State	Duke	Wisconsin
2016	Hawaii	Northern Iowa	Virginia
2016	Buffalo	Texas A&M	Iowa State
2016	UNC Asheville	VCU	Gonzaga
2016	Austin Peay	Oklahoma	Syracuse
2016	Oregon	North Carolina
2016	Oklahoma	Providence
2016	Texas A&M	Indiana
2016	Duke	Kentucky
2016	Baylor	Notre Dame
2016	Texas	Stephen F. Austin
2016	Oregon State	Wisconsin
2016	Saint Joseph’s	Xavier
2016	Cincinnati	Virginia
2016	VCU	Butler
2016	Northern Iowa	Arkansas-Little Rock
2016	Yale	Iowa State
2016	UNC Wilmington	Gonzaga
2016	Green Bay	Utah
2016	Cal State Bakersfield	Syracuse
2016	Holy Cross	Middle Tennessee
2016	North Carolina
2016	Xavier
2016	West Virginia
2016	Kentucky
2016	Indiana
2016	Notre Dame
2016	Wisconsin
2016	USC
2016	Providence
2016	Pittsburgh
2016	Michigan
2016	Chattanooga
2016	Stony Brook
2016	Stephen F. Austin
2016	Weber State
2016	Florida Gulf Coast
2016	Virginia
2016	Michigan State
2016	Utah
2016	Iowa State
2016	Purdue
2016	Seton Hall
2016	Dayton
2016	Texas Tech
2016	Butler
2016	Syracuse
2016	Gonzaga
2016	Arkansas-Little Rock
2016	Iona
2016	Fresno State
2016	Middle Tennessee
2016	Hampton

The NCAA bracket winners for the 2013, 2014, 2015, and 2016 tournaments were manually extracted from Wikipedia:

https://en.wikipedia.org/wiki/2013_NCAA_Men%27s_Division_I_Basketball_Tournament

The bracket results for ten seasons can be found here in CSV format. These files were subsequently loaded into a MySQL database using the load_RMySQL_DB function

MySQL Database

To facilitate centralized data storage and access for this project, a MySQL database on dbfree.net was chosen as a cloud repository for the team statistics and bracket results.

MySQL DB

The team statistics assembled for each year were staged in the MySQL database and combined into a single, consolidated team statistics table. The SQL for the table DDL and the dynamically executed insert statement can be found here.

The code to insert the team statistics by year is shown below:

## Insert Team Stats By Year 

suppressWarnings(suppressMessages(library(readr)))

url <- "https://raw.githubusercontent.com/kfolsom98/DATA607/master/Final-Project/sql/insert_team_stats_year.sql"

ins <- read_file(url)

stat.tables  <-    dbGetQuery(bbmetricDB, "SELECT DISTINCT TABLE_NAME 
                                             FROM INFORMATION_SCHEMA.COLUMNS 
                                            WHERE TABLE_SCHEMA = 'bbmetric' 
                                              AND TABLE_NAME like 'team_stats_%'")

dbGetQuery(bbmetricDB, "TRUNCATE TABLE team_stats")

for (i in 1:nrow(stat.tables)) {
    
    insert <- str_replace(ins, "<<!table_name>>", stat.tables[i, ])
    
    print(paste0("Preparing insert for ", stat.tables[i, ]))
          
    res <- dbSendQuery(bbmetricDB, insert)

}


dbSendQuery(bbmetricDB, "select count(*) from team_stats")

The NCAA bracket result data was combined locally and uploaded to the MySQL database with this process:

## Load the ncaa bracket data ##
 
 suppressWarnings(suppressMessages(library(data.table)))

 df <- data.frame()
 
 files <- list.files(pattern="ncaa*.*.csv")
 
 for (f in files)  df<-rbind(Dataf, fread(f))
 
 ## push the brackets per year to the DB
 
 dbSendQuery(bbmetricDB, "TRUNCATE TABLE ncaa_brackets")
 
 load_RMySQL_DB("ncaa_brackets", df, overwrite = FALSE, append = TRUE)
 
 #close the connection
 dbDisconnect(bbmetricDB)

One significant challenge encountered while processing the team statistics and bracket winners was that the same team could be entered with multiple variations of a name. To standardize the team names, the source team name was mapped to a consistent team name while being inserted into the team_stats combined table.

To illustrate the approach, the following example shows the mappings for the team St. Mary’s (CA) and all the variations seen in the source team stats data.

old_name	new_name
St Marys	St. Mary’s (CA)
St. Mary’s	St. Mary’s (CA)
Saint Mary’s, Calif.	St. Mary’s (CA)
Saint Mary’s Calif.	St. Mary’s (CA)
St. Mary’s (Calif.)	St. Mary’s (CA)
Saint Mary’s	St. Mary’s (CA)

The final step in preparing the combined team statistics table is to apply the NCAA bracket winners for each year. A team reaching a specific level in the bracket (e.g. Sweet 16 or Final 4) for a specific year will be given a value of 1, as an indicator variable. Otherwise, the team will have a value of 0 for each bracket level.

 ## UPDATE - Reset Bracket Values  
 
 dbGetQuery(bbmetricDB, " update team_stats ts
                             set ts.ncaa_round_64 = 0,
                                 ts.ncaa_round_32 = 0,
                                 ts.ncaa_sweet_16 = 0,
                                 ts.ncaa_elite_8  = 0, 
                                 ts.ncaa_final_4  = 0, 
                                 ts.ncaa_championship_game  = 0, 
                                 ts.ncaa_champion = 0
            ") 
 
## UPDATE  - Set the bracket column value to 1 for each team in the given bracket for the given year  
 
update_cols <- c("round_64","round_32", "sweet_16", 
                 "elite_8", "final_4",
                 "championship_game", 
                 "champion")

 
for (i in 1:length(update_cols)) {
    
   update <- 
    sprintf(" update team_stats ts
               inner join 
                     (
                      select nb.tournament_year, 
                             ifnull(map.new_name, nb.%s) team_name
                        from ncaa_brackets nb  left outer join team_name_map map on nb.%s = map.old_name) br 
                  on ts.season_end_year = br.tournament_year and ts.team_name = br.team_name
                 set ts.ncaa_%s = 1", update_cols[i], update_cols[i], update_cols[i])
 
    dbGetQuery(bbmetricDB, update)
                      
}

Extracting Defensive Stats - MySQL Database

#Get the column names for the query
column_names <- read.csv("https://raw.githubusercontent.com/DanielBrooks39/IS607/master/Final_Project/Column_names.csv", sep=",", header=FALSE, stringsAsFactors = FALSE)
years <- c("2015", "2014", "2013", "2012", "2011", "2010", "2009", "2008", "2007", "2006")

#Function: getdata
#Input:
#  years, the years for the database table
#  column_names, the names of the columns that are in the database
#  username, the username to connect tot the database
#  password, the password for teh database
#  database, the database name to connect to the database
#  hotname, the hostname where the database resides
#Output:
#  Writes a csv to the working directoy for every table that is specified by the query.
#This function will create a query for each of the years that we wabt in the database. It will connect to the database and poull back the desired columns for each year that we want. It will take each year that is qurries and writes it to a csv on the working directory.
getdata <- function(years, column_names, username, password, database, hostname)
{
  con <- dbConnect(MySQL(), user=username, password = password, dbname = database, host = hostname)

  base_table <- "team_stats_"
  
  i <- 1
  j <- 1
  
  for (i in 1 : NROW(years))
  {
    switch(years[i],
           "2015" = data <- data.frame(double(351)), 
           "2014" = data <- data.frame(double(351)),
           "2013" = data <- data.frame(double(351)),
           "2012" = data <- data.frame(double(347)),
           "2011" = data <- data.frame(double(345)),
           "2010" = data <- data.frame(double(345)),
           "2009" = data <- data.frame(double(345)),
           "2008" = data <- data.frame(double(341)),
           "2007" = data <- data.frame(double(341)),
           "2006" = data <- data.frame(double(336))
          )
    for (j in 1 : NROW(column_names))
    {
      query <- dbGetQuery(con, paste("select", " ", column_names$V1[j], " ", "from", " ", base_table, years[i], sep=""))
      data[,j] <- query
      j <- j + 1 
    }
    write.csv(data, paste("data", years[i], ".csv", sep = ""), append = TRUE, col.names = TRUE)
    i <- i + 1
  }
  dbDisconnect(con)
}

getdata(years, column_names, username, password, database, hostname)

Manipulate Query Data

The database data is read in from the CSV’s that were created earlier. The CSV’s are read in one at a time and stored in their separate data frames. There is also a mismatch between the database team names and the names that are used in the rankings. The teams names need to be the same in order to join the two tables together and and get everything in the same data frame.

data2014 <- read.csv("data2014.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE)
colnames(data2014)[2] <- "V1"
data2015 <- read.csv("data2015.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE)
colnames(data2015)[2] <- "V1"

Modeling

Predictions Based on Defensive Statistics

The following section will cover the processes used to model the defensive statistics to predict the top 8 teams. The models used will be Model Tree, Decision Tree, and KNN. The following lists the defensive variables used.

Defensive Variables
Variable	Variable	Variable
Team_name	opponent_two_point_rate	opponent_turnover_pct
opponent_points_per_game	opponent_fta_per_fga	opponent_personal_fouls_per_game
opponent_average_scoring_margin	opponent_ftm_per_100_possessions	opponent_personal_fouls_per_possession
defensive_efficiency	opponent_free_throw_rate	opponent_personal_foul_pct
opponent_floor_percentage	opponent_non_blocked_2_pt_pct	opponent_win_pct_close_games
opponent_1st_half_points_per_game	opponent_offensive_rebounds_per_game	opponent_win_pct_all_games
opponent_2nd_half_points_per_game	opponent_defensive_rebounds_per_game	opponent_effective_possession_ratio
opponent_overtime_points_per_game	opponent_offensive_rebounding_pct	games_played
opponent_points_from_2_pointers	opponent_defensive_rebounding_pct
opponent_points_from_3_pointers	opponent_blocks_per_game
opponent_percent_of_points_from_2_pointers	opponent_steals_per_game
opponent_percent_of_points_from_3_pointers	opponent_block_pct
opponent_percent_of_points_from_free_throws	opponent_steals_perpossession
opponent_shooting_pct	opponent_steal_pct
opponent_effective_field_goal_pct	opponent_assists_per_game
opponent_three_point_pct	opponent_turnovers_per_game
opponent_two_point_pct	opponent_assist_per_turnover_ratio
opponent_free_throw_pct	opponent_assists_per_fgm
opponent_true_shooting_percentage	opponent_assists_per_possession
opponent_three_point_rate	opponent_turnovers_per_possession

Training and Test Data

For the machine learning algorithms to work, it needs to be trained before it can predict anything. For this data, the training data will be the year of 2006-2014. The test data will be the most current year(2015). The training data is used to give the model a baseline and allow it to make groups or trees (depedning on the model type). Then the model can predict the rankings or the class of the test data.

traindata <- rbind(table2006, table2007, table2008, table2009, table2010, table2011, table2012,table2013, table2014)
traindata <- na.omit(traindata)
traindata <- traindata[,-1]
traindata$rank <- factor(traindata$rank)

testdata <- rbind(table2015)
testdata <- na.omit(testdata)
testnames <- testdata[1]
testdata <- testdata[,-1]
testdata$rank <- factor(testdata$rank)

Model Tree

The first model that will be used to calculate the top eight teams in division one basketball woud be the model tree. This is a more robust decision tree. Basically, it takes the training dataset and comes up with a set of rules. Each of those rules are a node on the tree. The difference between a model tree and a decision tree is, a model tree gives an equation for each node of a tree. Each rull that is decided on by the model will give a specific formula. That formula is then used to forecast the categories of the new data, based on what nodes the new data hits. For the case of this model, there was only one rule that was decided by the model. The equation is:

\[ rank = -14.831 * opponent_points_per_game + 13.6137 * opponent_average_scoring_margin + 852.236 * defensive_efficiency - 3.8714 * opponent_floor_percentage + 4.5235 * opponent_1st_half_points_per_game + 5.0138 * opponent_2nd_half_points_per_game + 7.6484 * opponent_overtime_points_per_game + 10.7681 * opponent_points_from_2_pointers + 10.0136 * opponent_points_from_3_pointers + 8.4694 * opponent_percent_of_points_from_2_pointers + 22.4938 * opponent_percent_of_points_from_3_pointers + 10.4896 * opponent_percent_of_points_from_free_throws - 33.1115 * opponent_shooting_pct - 3.455 * opponent_effective_field_goal_pct - 3.8401 * opponent_three_point_pct + 23.4914 * opponent_two_point_pct + 3.9098 * opponent_true_shooting_percentage - 16.814 * opponent_three_point_rate + 814.8023 * opponent_fta_per_fga - 10.9082 * opponent_ftm_per_100_possessions - 5.4238 * opponent_non_blocked_2_pt_pct - 8.2394 * opponent_offensive_rebounds_per_game + 3.6402 * opponent_offensive_rebounding_pct - 37.5087 * opponent_blocks_per_game + 16.5483 * opponent_steals_per_game + 16.4092 * opponent_block_pct - 28.4589 * opponent_steals_perpossession + 19.2984 * opponent_steal_pct + 5.6475 * opponent_assists_per_game - 9.1663 * opponent_turnovers_per_game + 243.8074 * opponent_assists_per_fgm - 1320.9676 * opponent_assists_per_possession + 52.0246 * opponent_turnovers_per_possession - 62.0717 * opponent_turnover_pct + 17.4931 * opponent_personal_fouls_per_game - 2.5374 * opponent_personal_fouls_per_possession - 12.8002 * opponent_personal_foul_pct - 98.1536 * opponent_win_pct_all_games - 1181.6373 * opponent_effective_possession_ratio - 8.2799 * games_played + 994.0719 \]

This can be used to get the new order of the teams for the new data.

====Summary====	Value
Correlation coefficient:	0.891
Mean absolute error:	35.658
Root mean squared error:	44.7633
Relative absolute error:	41.7723 %
Root relative squared error:	45.4043 %
Total Number of Instances:	2952

The summary data is showing that the correlation between this equation and the rank is rather high. Implying that this is a good equation to use for ranking of the teams.

suppressWarnings(suppressMessages(library(RWeka)))

traindata$rank <- as.numeric(traindata$rank)
testdata$rank <- as.numeric(testdata$rank)

model_M5p <- M5P(rank~., data = traindata)
pred <- predict(model_M5p, testdata[,-1])
pred <- as.data.frame(pred)
pred$team <- testnames
sort <- pred[with(pred, order(pred)), ]

Decision Tree

A decision tree follows the same idea as a model tree. The data is broken apart into “rules”. Each rule is a node of a tree. Depending on which side of the tree the data falls upon, determines what category the data will end up in. This kind of tree doesn’t have any equations associated with it, so it will be used for classification and not used to atualy predict the value of the team.

Classification Tree	Value
Number of Samples:	2952
Number of Predictors	47
Tree Size:	1213

We can see that there was 2952 samples that were used to make the tree. There was 47 Predictors that were used in the model and the tree has 1213 nodes in it.

We can also see the break down of what stats the tree uses. The more important stats get used the most in the model. Here are some of the stats used and the amount they were used in the model:

Attribute	% Usage
opponent_average_scoring_margin	100%
games_played	92.65%
opponent_points_per_game	36.79%
opponent_overtime_points_per_game	36.52%
opponent_free_throw_pct	33.16%
opponent_defensive_rebounding_pct	28.22%
opponent_blocks_per_game	26.56%
opponent_floor_percentage	25.68%
opponent_turnover_pct	23.78%
opponent_assists_per_possession	21.54%
opponent_win_pct_all_games	21.00%

suppressWarnings(suppressMessages(library(C50)))

trainmodel <- C5.0(traindata[,-1], traindata[,1, trials = 3])
testmodel <- predict(trainmodel, testdata[,-1], type="class")
rank <- as.data.frame(testmodel)

rank$teams <- testnames
names(rank) <- c("ranks", "teams")
sort_teams <- rank[with(rank, order(ranks)), ]

KNN (K-Nearest Neighbor) Model

The KNN model is the third model that is being used in this project. This model takes the number of categories, in our case, the rank, and separates the data into groups. In our case there are around 351 groups. The data is then placed into their groups and their stats act like a coordinate (in two dimensions x and y). The new data is then brought in and the euclidean distance is calculated between the new data and all the points that were in the training set. The closest point to the new point is how the category gets determined.

suppressWarnings(suppressMessages(library(class)))

model <- knn(traindata[,-1], testdata[,-1], traindata[,1], k = 3 )
rank <- as.data.frame(model)
rank$team <- testnames
names(rank) <- c("ranks", "teams")
sort_teams_knn <- rank[with(rank, order(ranks)), ]

Defensive Prediction Results

These are the results that were obtained by the machine learning algorithms. It shows the break down of the top ten teams by the model that predicted them. It then shows the ranking of the teams with the three models combined. This is calculated by the 1st place team getting 10 pts and the last place team getting 1. This is repeated for all of the models. The points are totaled up and then the teams are ranked by the total points. The last part comapres the predictied results with the actual results. It comapares where the top 8 teams finished in the actual tournament to where the models predicted they would finish.

Rank	KNN	Model Tree	Decision Tree
1	Villanova	Michigan St.	Villanova
2	Kansas	Villanova	North Carolina
3	SMU	North Carolina	Indiana
4	Syracuse	Purdue	Kentucky
5	Miami (FL.)	Kansas	Virgina
6	Gonzaga	Wichita St.	Michigan St.
7	Arizona	Virgina	Louisville
8	Oklahoma	Indiana	UCONN
9	BYU	Gonzaga	Kansas
10	Yale	St. Mary’s (CA)	Xavier

Rank	Teams	Points
1	Villanova	29
2	North Carolina	17
3	Kansas	17
4	Michigan St.	15
5	Indiana	11
6	Virgina	10
7	SMU	8
8	Syracuse	7

Tournament Place	Actual Teams	Predicted
Champ	Villanova	Villanova
Finals	North Carolina	North Carolina
Final 4	Syracuse	Kansas
Final 4	Oklahoma	Michigan St
Elite 8	Virgina	Virgina
Elite 8	Oregon	Indiana
Elite 8	Kansas	SMU
Elite 8	Notre Dame	Syracuse

Predictions Based on Offensive Statistics

The following section will cover the process of modeling the offensive statistics to predict the top 8 teams. The models used will be KNN, SVM, and Random Forest. As a precursor to modeling, the offensive statistics are retrieved from the cloud MySQL database.

bbmetricDB <- DBconnect()

stats  <- dbGetQuery(bbmetricDB, 
 "                              
  select  
         team_name,                season_start_year,
         points_per_game,  average_scoring_margin,   offensive_efficiency,
         floor_percentage, overtime_points_per_game, average_1st_half_margin,
         average_2nd_half_margin, average_overtime_margin, points_from_2_pointers,
         points_from_3_pointers, percent_of_points_from_2_pointers, 
         percent_of_points_from_3_pointers,
         percent_of_points_from_free_throws, shooting_pct, effective_field_goal_pct,
         three_point_pct,  two_point_pct,  free_throw_pct,
         true_shooting_percentage, field_goals_made_per_game, field_goals_attempted_per_game,
         three_pointers_made_per_game, three_pointers_attempted_per_game,
         free_throws_made_per_game, free_throws_attempted_per_game, three_point_rate,
         two_point_rate, fta_per_fga, ftm_per_100_possessions,  free_throw_rate,
         non_blocked_2_pt_pct, offensive_rebounds_per_game, offensive_rebounding_pct,
         assists_per_game, turnovers_per_game,  turnovers_per_possession,
         assist_per_turnover_ratio, assists_per_fgm, assists_per_possession,
         turnover_pct,games_played, possessions_per_game,
         extra_chances_per_game, effective_possession_ratio, win_pct_all_games,
         ncaa_round_64,   ncaa_round_32, ncaa_sweet_16,
         ncaa_elite_8,   ncaa_final_4,   ncaa_championship_game,
         ncaa_champion,
         ncaa_round_64 +
         ncaa_round_32 * 2 +
         ncaa_sweet_16 * 4 +
         ncaa_elite_8 * 8 + 
         ncaa_final_4 * 16 +
         ncaa_championship_game * 32 +
         ncaa_champion  * 64 as weighted_rank
   from team_stats")

The following offensive statistics will be used to to predict the top 8 teams of the 2016 NCAA tournament. Not all statistics will be important but are included from comprehesiveness. The model will determine the importance of each variable during the training process.

Offensive Variables
Variable	Variable	Variable
team_name	true_shooting_percentage	assists_per_possession
season_start_year	field_goals_made_per_game	turnover_pct
points_per_game	field_goals_attempted_per_game	games_played
average_scoring_margin	three_pointers_made_per_game	possessions_per_game
offensive_efficiency	three_pointers_attempted_per_game	extra_chances_per_game
floor_percentage	free_throws_made_per_game	effective_possession_ratio
overtime_points_per_game	free_throws_attempted_per_game	win_pct_all_games
average_1st_half_margin	three_point_rate	ncaa_round_64
average_2nd_half_margin	two_point_rate	ncaa_round_32
average_overtime_margin	fta_per_fga	ncaa_sweet_16
points_from_2_pointers	ftm_per_100_possessions	ncaa_elite_8
points_from_3_pointers	free_throw_rate	ncaa_final_4
percent_of_points_from_2_pointers	non_blocked_2_pt_pct	ncaa_championship_game
percent_of_points_from_3_pointers	offensive_rebounds_per_game	ncaa_champion
percent_of_points_from_free_throws	offensive_rebounding_pct	weighted_rank
shooting_pct	assists_per_game
effective_field_goal_pct	turnovers_per_game
three_point_pct	turnovers_per_possession
two_point_pct	assist_per_turnover_ratio
free_throw_pct	assists_per_fgm

The predictor variable will be weighted_rank, which is calculated based on a team’s advancement in the NCAA tournament.

Bracket	Points Assigned
Round of 64	1
Round of 32	2
Sweet 16	4
Elite 8	8
Final 4	16
Championship Game	32
Champion	64

Training and Test Data

Create a offensive training dataset using the team statistics for years prior to 2015.

# create a training dataset based on all years prior to 2015
stats_train <- subset(stats, season_start_year < 2015) # take all years prior to 2015

# store the team, year, ncaa_tournament indicators, and the weigted bracket rank
stats_train_results <- select(stats_train, team_name, season_start_year, ncaa_round_64:weighted_rank)

Create the test dataset using the team statistics for the 2015 NCAA basketball season.

# create a training dataset based on all years prior to 2015
stats_test <- filter(stats, season_start_year == 2015, ncaa_round_64 == 1) 


# store the team, year, ncaa_tournament indicators, and the weigted bracket rank
stats_test_results <- select(stats_test, team_name, season_start_year, ncaa_round_64:weighted_rank)

Finally, remove the columns from the training dataset which will not be used in models

stats_train <- select(stats_train, -(team_name:season_start_year), -(ncaa_round_32:ncaa_champion))

str(stats_train)

## 'data.frame':    3124 obs. of  47 variables:
##  $ points_per_game                   : num  85.7 74.5 78.6 76.9 72.1 70.8 74.5 73.5 74.3 70.4 ...
##  $ average_scoring_margin            : num  17.1 12.5 9.2 8.4 9 6.1 5.5 8 4.5 7.8 ...
##  $ offensive_efficiency              : num  1.15 1.13 1.1 1.08 1.11 ...
##  $ floor_percentage                  : num  56.4 53.3 53.3 51.2 49.5 52.7 51 51.5 53.9 50.4 ...
##  $ overtime_points_per_game          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ average_1st_half_margin           : num  8.3 7.3 4.9 4.9 7.3 1.7 4 3.8 3 6.3 ...
##  $ average_2nd_half_margin           : num  9.1 5 4.3 3.4 2.4 4.2 1.3 4.2 1.6 1.2 ...
##  $ average_overtime_margin           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ points_from_2_pointers            : num  51.2 38.1 45.9 38.6 30.3 42.1 39.9 35.9 39.2 39.5 ...
##  $ points_from_3_pointers            : num  17 22.1 18.8 23.6 30.6 14.2 20.3 22.1 17.8 14.2 ...
##  $ percent_of_points_from_2_pointers : num  59.7 51.1 58.4 50.2 42 59.5 53.6 48.9 52.7 56.2 ...
##  $ percent_of_points_from_3_pointers : num  19.8 29.6 24 30.6 42.4 20 27.3 30.1 24 20.2 ...
##  $ percent_of_points_from_free_throws: num  20.5 19.3 17.7 19.1 15.5 20.3 19.1 21 23.4 23.6 ...
##  $ shooting_pct                      : num  49.9 47.2 49 45.9 45.5 46.7 48.1 46.8 47.8 42.2 ...
##  $ effective_field_goal_pct          : num  54.4 53.8 54.2 52.6 54.7 50.9 54.2 53.6 53.3 46.3 ...
##  $ three_point_pct                   : num  35.8 36.1 37.2 35.5 37.4 36.4 38 39.2 34.9 32.1 ...
##  $ two_point_pct                     : num  54.6 53.6 53.6 52.2 53.4 49.8 52.8 50.9 53.8 45.7 ...
##  $ free_throw_pct                    : num  71.2 70.3 68.1 66.5 71.3 66.2 75.8 69.8 72.4 62.8 ...
##  $ true_shooting_percentage          : num  115 114 113 110 114 ...
##  $ field_goals_made_per_game         : num  31.2 26.4 29.2 27.2 25.3 25.8 26.7 25.3 25.5 24.5 ...
##  $ field_goals_attempted_per_game    : num  62.6 55.8 59.7 59.2 55.7 55.3 55.6 54.1 53.4 58 ...
##  $ three_pointers_made_per_game      : num  5.7 7.4 6.3 7.9 10.2 4.7 6.8 7.4 5.9 4.7 ...
##  $ three_pointers_attempted_per_game : num  15.8 20.4 16.9 22.1 27.3 13 17.8 18.8 17 14.8 ...
##  $ free_throws_made_per_game         : num  17.6 14.4 13.9 14.7 11.2 14.4 14.2 15.4 17.4 16.6 ...
##  $ free_throws_attempted_per_game    : num  24.7 20.4 20.4 22.1 15.7 21.8 18.8 22.1 24 26.4 ...
##  $ three_point_rate                  : num  25.2 36.5 28.3 37.4 49 23.5 32 34.8 31.9 25.5 ...
##  $ two_point_rate                    : num  74.8 63.5 71.7 62.6 51 76.5 68 65.2 68.1 74.5 ...
##  $ fta_per_fga                       : num  0.395 0.366 0.342 0.374 0.282 0.394 0.337 0.408 0.449 0.455 ...
##  $ ftm_per_100_possessions           : num  23.7 21.7 19.5 20.6 17.2 ...
##  $ free_throw_rate                   : num  28.2 26.5 24.1 26.4 21.1 28.4 23.7 28.2 30.5 30.8 ...
##  $ non_blocked_2_pt_pct              : num  59.8 57.7 57.7 58.6 58.3 53.2 57.9 57.6 59.5 50.6 ...
##  $ offensive_rebounds_per_game       : num  13.4 10.9 13.5 12.4 9.4 10.3 10.3 10.4 11.7 15.5 ...
##  $ offensive_rebounding_pct          : num  39.2 34.1 41.1 35.2 29.4 32.2 34.7 32.6 38 41.5 ...
##  $ assists_per_game                  : num  18.3 14.9 15.3 14.6 17.4 12.9 12.4 16.2 14.6 13 ...
##  $ turnovers_per_game                : num  13.8 11.3 15.7 14 11.3 10.8 14.6 13.2 13.7 14.9 ...
##  $ turnovers_per_possession          : num  18.5 17.1 22.1 19.6 17.3 16.3 21.2 19.4 20.5 21.3 ...
##  $ assist_per_turnover_ratio         : num  1.328 1.325 0.978 1.045 1.546 ...
##  $ assists_per_fgm                   : num  0.586 0.567 0.525 0.538 0.687 0.499 0.463 0.639 0.574 0.529 ...
##  $ assists_per_possession            : num  0.246 0.226 0.216 0.205 0.267 0.194 0.18 0.239 0.219 0.185 ...
##  $ turnover_pct                      : num  15.7 14.7 18.5 16.7 15.1 14.1 18.4 16.8 17.5 17.4 ...
##  $ games_played                      : num  38 39 32 35 35 33 35 32 33 31 ...
##  $ possessions_per_game              : num  74.4 66.1 71.1 71.3 65.1 66.3 68.9 67.7 66.8 70.1 ...
##  $ extra_chances_per_game            : num  4.7 2.6 4.2 0.2 2.8 3.2 0.5 3.2 -2.9 3.3 ...
##  $ effective_possession_ratio        : num  0.995 0.994 0.969 0.977 0.971 ...
##  $ win_pct_all_games                 : num  0.816 0.897 0.625 0.6 0.743 0.636 0.629 0.656 0.636 0.548 ...
##  $ ncaa_round_64                     : int  1 1 1 0 0 1 0 0 1 0 ...
##  $ weighted_rank                     : num  15 63 1 0 0 3 0 0 3 0 ...

KNN Model

set.seed(600)


ctrl <- trainControl(method="repeatedcv", repeats = 5, number = 10)

knnFit  <-   train(weighted_rank ~ .,
                   data   = stats_train,
                   method = "knn",
                   trControl = ctrl,
                   preProcess = c("center","scale", "pca"),
                   tuneLength = 20)

knnFit

## k-Nearest Neighbors 
## 
## 3124 samples
##   46 predictor
## 
## Pre-processing: centered (46), scaled (46), principal component
##  signal extraction (46) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 2812, 2811, 2811, 2812, 2811, 2811, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared 
##    5  7.390945  0.1878696
##    7  7.005912  0.2339936
##    9  6.773718  0.2798062
##   11  6.660547  0.3043377
##   13  6.601636  0.3186818
##   15  6.510144  0.3424952
##   17  6.507288  0.3440947
##   19  6.497940  0.3507688
##   21  6.496347  0.3507279
##   23  6.506862  0.3500910
##   25  6.506733  0.3509608
##   27  6.497975  0.3546472
##   29  6.492643  0.3573096
##   31  6.514821  0.3530110
##   33  6.515490  0.3518267
##   35  6.533359  0.3459206
##   37  6.535085  0.3462279
##   39  6.541223  0.3464458
##   41  6.550911  0.3457281
##   43  6.551048  0.3469615
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was k = 29.

Let’s see which variables were determined to be the most important in the KNN model:

dotPlot(varImp(knnFit), main = "KNN Model - Most Relevant Variables")

Random Forest

set.seed(600)

rfFit <-train(weighted_rank ~.,
                 data=stats_train,
                 method="rf",
                 trControl=trainControl(method="cv",number=5),
                 prox=TRUE, importance = TRUE,
                 allowParallel=TRUE)
                
rfFit

## Random Forest 
## 
## 3124 samples
##   46 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 2499, 2499, 2499, 2500, 2499 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared 
##    2    6.528543  0.3757743
##   24    6.098787  0.4619901
##   46    6.273775  0.4454502
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was mtry = 24.

print(rfFit$finalModel)

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE,      proximity = TRUE, allowParallel = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 24
## 
##           Mean of squared residuals: 36.51865
##                     % Var explained: 44.67

dotPlot(varImp(rfFit), main = "Random Forest Model - Most Relevant Variables")

Support Vector Machine (SVM)

set.seed(600)

SVMFit <- train(weighted_rank ~., data = stats_train, method="svmRadial", trControl = ctrl)

SVMFit

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 3124 samples
##   46 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 2812, 2811, 2811, 2812, 2811, 2811, ... 
## Resampling results across tuning parameters:
## 
##   C     RMSE      Rsquared 
##   0.25  6.893490  0.3641109
##   0.50  6.741980  0.3966425
##   1.00  6.587426  0.4223910
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01659096
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were sigma = 0.01659096 and C = 1.

dotPlot(varImp(SVMFit), main = "SVM Model - Most Relevant Variables")

Predict

To predict the top 8 teams in the NCAA tournament using the KNN, Random Forest, and SVM models, provide the test dataset to the three models.

AllModels <- list(knn = knnFit,
                  rf  = rfFit,
                  svm = SVMFit)


AllModelPred <- predict(AllModels, newdata = stats_test)

stats_test_results <- cbind(stats_test_results, as.data.frame(AllModelPred))

Compare and store the models’ predictions against the actual winners:

# Top 8 teams based on kNN
knn_top8 <- stats_test_results %>% 
                select(team_name, knn) %>%
                arrange(desc(knn)) %>% 
                slice(1:8) %>% 
                mutate(rank = row_number())
                
# Top 8 teams based on SVM
svm_top8 <- stats_test_results %>% 
                select(team_name, svm) %>%
                arrange(desc(svm)) %>% 
                slice(1:8) %>% 
                mutate(rank = row_number())            

# Top 8 teams based on Random Forest
rf_top8 <- stats_test_results %>% 
                select(team_name, rf) %>%
                arrange(desc(rf)) %>% 
                slice(1:8) %>% 
                mutate(rank = row_number())

top8.disp <-
          knn_top8 %>% 
          inner_join(svm_top8, by="rank") %>% 
          inner_join(rf_top8, by="rank") %>%
          select(rank, team_name.x, team_name.y, team_name) %>%
          rename(Rank = rank, KNN = team_name.x, SVM = team_name.y, RF = team_name)

Offensive Prediction Results

These are the results that were obtained by the machine learning algorithms - kNN, SVM, and Random Forest.

Rank	KNN	SVM	RF
1	North Carolina	North Carolina	Villanova
2	Kentucky	Kentucky	North Carolina
3	West Virginia	Oregon	Michigan St.
4	Stephen F. Austin	Kansas	Oklahoma
5	Gonzaga	Villanova	Kansas
6	Michigan St.	West Virginia	Virginia
7	Miami (FL)	Notre Dame	Oregon
8	Kansas	Virginia	Texas A&M

Blending the rankings from each model:

top8 <- rbindlist(list(knn_top8, svm_top8, rf_top8))

top8$wt <-  abs(top8$rank - 8 ) + 1

top8.blended <- 
                top8 %>% 
                group_by(team_name) %>% 
                summarise(points = sum(wt)) %>% 
                arrange(desc(points)) %>% 
                slice(1:8) %>% 
                mutate(rank = row_number(desc(points))) %>%
                select(rank, team_name, points)

kable(top8.blended)

rank	team_name	points
1	North Carolina	23
2	Kentucky	14
3	Villanova	12
4	Kansas	10
5	West Virginia	9
6	Michigan St.	9
7	Oregon	8
8	Stephen F. Austin	5

Note. The blended results identified four teams correctly in the Elite 8 round. However, it did not place any of them in rank order correctly. In comparison, the Random Forest model identified 6 teams in the Elite 8 and correctly placed three teams in rank order. The Random Forest model will be selected as the best model from an offensive perspective.

Conclusions

Prediction Results Summary

Tournament Place	Actual Teams	Defensive Prediction	Offensive Prediction (Random Forest)
Champ	Villanova	Villanova	Villanova
Finals	North Carolina	North Carolina	North Carolina
Final 4	Syracuse	Kansas	Michigan St.
Final 4	Oklahoma	Michigan St	Oklahoma
Elite 8	Virginia	Virgina	Kansas
Elite 8	Oregon	Indiana	Virginia
Elite 8	Kansas	SMU	Oregon
Elite 8	Notre Dame	Syracuse	Texas A&M

Overall, the results of the models are encouraging. By design, this project separately models offensive statistics from defensive statistics. Even with this limitation, the models were able to correctly identify the championship game and the overall champion. Additionally, the defensive model predicted Virgina making it to the Elite 8. It did correctly predict Syracuse and Kansas as the top 8 teams; it just placed them in the wrong rounds. Defensive prediction results with three exactly correct and another 2 partially correct out of 8 is a good indicator.

Within the offensive models, the Random Forest model proves to be the most accurate of the three models used. This model correctly identified 6 teams among the Elite 8 with Villanova, North Carolina, and Oklahoma being placed correctly in rank order.

It could be that the teams in the top 8 that were not predicted based on the defensive model either had a huge offensive presence and/or were ranked very low on defense. These factors would keep them lower on this list and would not be predicted properly. A similar consideration would apply to the offensive models from a defensive perspective.

Combining the offensive and defensive statistics together would be the next step in the modeling process for this topic.

DATA 607: Final Project

Dan Brooks, Keith Folsom

May 11, 2016

Predicting March Madness

Overview

Data Sources

College Basketball Season and Division Data - GitHub

Team Statistics - www.teamrankings.com

NCAA Tournament Results

CBS Sports

Wikipedia

Data Collection and Preparation

Combine Season and Division Data

Seasonal Rankings

Massy Ranking Algorithm

Processing Team Statistics

Team Stats URLs

Collecting NCAA Tournament Results

MySQL Database

Extracting Defensive Stats - MySQL Database

Manipulate Query Data

Modeling

Predictions Based on Defensive Statistics

Training and Test Data

Model Tree

Decision Tree

KNN (K-Nearest Neighbor) Model

Defensive Prediction Results

Predictions Based on Offensive Statistics

Training and Test Data

KNN Model

Random Forest

Support Vector Machine (SVM)

Predict

Offensive Prediction Results

Conclusions

Prediction Results Summary