This project attempts to choose the top 8 teams in the NCAA division 1 basketball tournament. This will be accomplished by utlizing machine learning packages available in R. Two different approaches will be used to predict the top 8 teams:
Using defensive teams statistics combined with an overall ranking based on strength of schedule
Using offensive team statistics with a ranking system based on NCAA Tournament success
Season averages will be used instead of game by game statistics in this project. There is a total of 10 year’s worth of data from 2006 - 2015. The training data will based on the seasons from 2006 - 2014, while the test data will be based on the 2015 season.
Three models will be used to predict the ranks of the teams per method:
The R packages required for this project are:
The team data for this project was collected off of Github. The Github account had all of the games played by every Division 1, 2, and 3 game for the 2002 through 2015 basketball season. The same Github repository also has the divisions for all of the teams plus game data for each season.
The GitHub repository can be found here with CSV files for:
Games by Season from 2007 - 2016
Schools By Division
All of the statistics relating to the college basketball teams were web scraped from teamrankings.com. Team statistics for seasons from 2006 to 2015 where extracted from this site and ultimately stored in a cloud MySQL database.
Statistics across the following 14 categories were captured and combined for each team per season:
TeamRankings
The teams reaching the variaous brackets within the NCAA Tournament are needed for the offensive modeling approach. This data was collected from two sites.
http://www.cbssports.com/collegebasketball/ncaa-tournament/history/yearbyyear
NCAA Men’s Division Basketball Tournament
The Github data for season combines teams from a variety of divisions. It has data for Division 1 teams vs Divsion 4 teams and a lot of other combinations. Since the NCAA basketball tournament being considered is for division 1 teams, the division data and the season data had to be combined into one chunk of data. Then the entire data frame can be subset to only those games that are with two division 1 teams. Any other combination will be dropped. Some of the teams have different names from year to year (Connecticut is UCONN or Alcorn St. is Alcorn etc.) so in order for the matching to work properly, all the team names have to be the same, so some of the names in certain seasons had to be changed to match the divisonal data.
#Function: Data
#Input:
# div, data frame of the divisions for the teams
# result, season data for the teams
#Output:
# returns a data frame that has only division 1 games, all other games were dropped.
#This function will take the season data and the divisional data and combine them together. This gives the ability to pick only division 1 games.
data <- function(div, result)
{
names(div) <- c("team_id", "division")
jointeam <- left_join(result, div, by = "team_id")
almost <- jointeam %>% select(year, team_name, opponent_name, opponent_id, team_score, opponent_score, division)
colnames(almost)[colnames(almost)=="division"] <- "team_division"
colnames(almost)[colnames(almost)=="opponent_id"] <- "team_id"
joinopp <- left_join(almost, div, by = "team_id")
colnames(joinopp)[colnames(joinopp)=="division"] <- "opp_division"
final <- joinopp %>% select(year, team_name, opponent_name, team_score, opponent_score, team_division, opp_division)
D1 <- subset(final, team_division == "I"&opp_division == "I")
return(D1)
}
#Change the names of the teams to match between the divisions and the seasonal data. These teams are common amongst all of the years
Data <- data(div,result)
Data$team_name <- as.character(Data$team_name)
Data$team_division <- as.character(Data$team_division)
Data$opponent_name <- as.character(Data$opponent_name)
Data$opp_division <- as.character(Data$opp_division)
Data$opponent_name[Data$opponent_name == "Alcorn St."] <- "Alcorn"
Data$opponent_name[Data$opponent_name == "Connecticut"] <- "UConn"
Data$opponent_name[Data$opponent_name == "Md.-East. Shore"] <- "UMES"
Data$opponent_name[Data$opponent_name == "St. Francis (NY)"] <- "St. Francis Brooklyn"
Data$opponent_name[Data$opponent_name == "UNC Wilmington"] <- "UNCW"
Data$opponent_name[Data$opponent_name == "Cal St. Northridge"] <- "CSUN"
Data$opponent_name[Data$opponent_name == "S.C. Upstate"] <- "USC Upstate"
Data$opponent_name[Data$opponent_name == "Neb. Omaha"] <- "Omaha"
Data$opponent_name[Data$opponent_name == "CSU Bakersfield"] <- "Bakersfield"
Data$opponent_name[Data$opponent_name == "SIU Edwardsville"] <- "SIUE"
Data$opponent_name[Data$opponent_name == "Fla. Gulf Coast"] <- "FGCU"
#The stats for Utah Vallet stats were missing. I hade to create their offesnive stats to all the matching with the divisnal data.
#This only happens for the 2004 season.
if (Data$year == 2004)
{
uvu <- subset(Data, Data$opponent_name == "Utah Valley")
i <- 1
for (i in 1:NROW(uvu))
{
temp <- uvu$opponent_name[i]
uvu$opponent_name[i] <- uvu$team_name[i]
uvu$team_name[i] <- temp
temp <- uvu$opponent_score[i]
uvu$opponent_score[i] <- uvu$team_score[i]
uvu$team_score[i] <- temp
i <- i + 1
}
Data <- rbind(Data, uvu)
}
#The 2016 season data had a tab character (/n) in the teams name. I had to split the list on the tab character, then extract the names of each team. Team names come in a wide variety, ranging from 1 to 4 name with a wide variety of punctuation in between. This is only for the 2016 season.
if (Data$year == 2016)
{
names <- unlist(str_split(Data$opponent_name, '\n'))
i <- 2
j <- 1
for (j in 1:(NROW(Data)))
{
Data$opponent_name[j] <- names[i]
i <- i + 2
j <- j + 1
}
Data$opponent_name <- unlist(str_extract_all(Data$opponent_name, "[[:alpha:]]{1,}[[:punct:]]{0,}[[:alpha:]]{0,}[[:space:]]{0,}[[:alpha:]]{0,}[[:punct:]]{0,}[[:alpha:]]{0,}[[:space:]]{0,}[[:alpha:]]{0,}[[:punct:]]{0,}[[:alpha:]]{0,1}[[:space:]]{0,}[[:alpha:]]{0,}[[:punct]]{0,}[[:alpha:]]{0,}[[:punct:]]{0,}"))
Data$opponent_name[Data$opponent_name == "Stephen F. Austin"] <- "SFA"
Data$opponent_name[Data$opponent_name == "Texas-Arlington"] <- "UT Arlington"
Data$opponent_name[Data$opponent_name == "Citadel"] <- "The Citadel"
}
The response variable being used for this process is the end of the year rankings. Instead of using the rankings that are given by the coaches, this project will calculate its own based on the Massy ranking system. This sytem has been used for the official rankings of ESPN and was created by a student as a part of a graduate research project. Rankings are created on a game by game basis, so they can be updated as the season goes on. This system compares the result of each game to the difference in the scores.
col <- as.character(unique(Data$team_name))
matrix <- matrix(0, ncol = length(col), nrow = NROW(Data), byrow=TRUE)
dimnames(matrix) <- list(c(1:NROW(Data)), col)
#Function: Schedule
#Input:
# matrix, an empty matrix to hold the information for the ranking system. It is a square matrix with the dimensions being the number of teams for that season
# Data, a data frame that contains the seaonsal data for division 1
#OUtput:
# matrix that contains the results for each game for each team during hte season.
#This function will create a matrix that is used for input into the ranking formula. Each column of the martix contain the team name. Each row of the matrix will contain a game. When two teams play against each other, the winner will get a 1 on the intersection and the loser will get a -1. The we would move onto the next game. This will continue for every game that was played for that season.
schedule <- function(matrix, Data)
{
i <- 1
for (i in 1:NROW(Data))
{
if (as.numeric(Data[i,4]) > as.numeric(Data[i,5]))
{
matrix[i,Data[i,2]] = 1
matrix[i,Data[i,3]] = -1
}
else
{
matrix[i,Data[i,2]] = - 1
matrix[i,Data[i,3]] = 1
}
i <- i + 1
}
return(matrix)
}
Team <-schedule(matrix, Data)
#Function: scoredifferential
#Input:
# Data, data frame containing the seasonal data
#Output:
# diff, a 1 by number of games matrix that conatins the difference in the score of each game
#This is a function that will calculate the score differential for each game that is played throughout the season. If team A beat team B 70-60. Then the entry will be 10, If team A lost to team B 60-70, then the entry would be -10.
scoredifferential <- function(Data)
{
i <- 1
diff <- matrix(0, nrow = NROW(Data), ncol = 1)
for (i in 1 : NROW(Data))
{
diff[i] <- as.numeric(Data[i,4])-as.numeric(Data[i,5])
i <- i + 1
}
return(diff)
}
diff <- scoredifferential(Data)
absdiff <- abs(diff)
#Transpose the matrix
tteam <- t(Team)
#Multiply the transpose by the original matrix
A <- tteam %*% Team
#Multiply the transpose matrix by the matrix taht contains the differences
B <- tteam %*% absdiff
#Make the last rowm of the socre matrix 1's and the last ebtry of the socre difference matrix a 0. If this is not done, then there is no solution due to the properties of the matrix.
A[NROW(A),] = 1
B[NROW(B),1] = 0
#Solve the linear equation to get the rankings of the teams.
rankings <- solve(A,B)
#order the teams by the rankings
rankings <- as.data.frame(rankings)
ranks <- rankings[order(-rankings[1]), , drop = FALSE]
ranks$rank <- c(1:NROW(ranks))
The table below shows the results of the Massy Ranking algorithm:
Y2015 | 2015_Rank | Y2014 | 2014_Rank | Y2013 | 2013_Rank | Y2012 | 2012_Rank | Y2011 | 2011_Rank |
---|---|---|---|---|---|---|---|---|---|
Kansas | 23.65358 | Kentucky | 29.59169 | Louisville | 25.14265 | Indiana | 25.44025 | Kentucky | 25.41010 |
Michigan St. | 23.19914 | Wisconsin | 24.85308 | Arizona | 23.76682 | Louisville | 25.31585 | Ohio St. | 24.63799 |
West Virginia | 22.30338 | Arizona | 24.64689 | Florida | 21.11844 | Florida | 24.23370 | Kansas | 22.40911 |
North Carolina | 22.11190 | Duke | 24.33190 | Kansas | 20.79188 | Duke | 21.93963 | Michigan St. | 22.05064 |
Villanova | 21.91290 | Virginia | 23.07023 | Wisconsin | 20.11526 | Michigan | 21.80616 | North Carolina | 22.02033 |
Louisville | 21.63333 | Villanova | 22.90716 | Duke | 20.01823 | Kansas | 21.31411 | Syracuse | 21.80872 |
Virginia | 21.23472 | Utah | 21.43638 | Oklahoma St. | 19.54340 | Gonzaga | 21.05962 | Missouri | 20.45130 |
Duke | 20.81610 | Gonzaga | 21.33082 | Villanova | 19.41018 | Syracuse | 20.80983 | Wisconsin | 19.73984 |
Oklahoma | 20.58445 | North Carolina | 21.32353 | Michigan St. | 19.24373 | Ohio St. | 20.74490 | Indiana | 19.31590 |
Miami (FL) | 20.14680 | Ohio St. | 20.88407 | Michigan | 19.09298 | Pittsburgh | 19.38626 | Florida | 18.93787 |
Purdue | 20.04783 | Oklahoma | 19.95099 | Iowa St. | 19.08060 | Wisconsin | 19.27926 | Wichita St. | 18.09982 |
Indiana | 19.98950 | Kansas | 19.71435 | Virginia | 18.88161 | Michigan St. | 19.17142 | Duke | 17.73333 |
Kentucky | 19.64098 | Baylor | 19.35853 | UCLA | 18.83495 | Arizona | 17.79115 | Marquette | 17.62773 |
Xavier | 19.56783 | Iowa St. | 19.20751 | Creighton | 18.51829 | Miami (FL) | 17.47762 | Memphis | 17.53664 |
SMU | 19.24612 | Louisville | 19.06722 | Kentucky | 18.46476 | Minnesota | 17.39826 | Georgetown | 17.49761 |
Arizona | 19.22254 | Notre Dame | 18.83728 | Iowa | 18.30519 | VCU | 17.38770 | Baylor | 17.22571 |
Vanderbilt | 19.06473 | Michigan St. | 17.34379 | Ohio St. | 18.29454 | Creighton | 16.99766 | Louisville | 16.87967 |
Iowa | 18.43319 | Texas | 17.25475 | Pittsburgh | 17.98544 | Georgetown | 16.34374 | Vanderbilt | 16.39438 |
Texas A&M | 18.42961 | Butler | 17.08722 | Tennessee | 17.74630 | Oklahoma St. | 16.08799 | New Mexico | 16.33073 |
Maryland | 18.00047 | West Virginia | 16.83346 | UConn | 17.67079 | Missouri | 15.82460 | Saint Louis | 15.83155 |
The 14 categories of team statistics were web scraped from www.teamrankings.com using a combination of the rvest
and htmltab
packages in R. In total, there are 115 NCAA college basketball statistics available on teamrankings.com per team per year. Additionaly, each statistics includes the team’s rank within the season for a given statistic.
baseURL <- "https://www.teamrankings.com/ncb/stats/"
url_content <- read_html(baseURL)
stat.links <- url_content %>%
html_nodes("*") %>%
html_nodes(xpath = "./a") %>%
html_attr("href") %>%
str_match("/ncaa-basketball/stat/.+")
stat.links <- stat.links[complete.cases(stat.links),]
The URL for each statistic is constructed to create 115 callable URLs which can be used with a date parameter for each season.
## [1] "/ncaa-basketball/stat/points-per-game"
## [2] "/ncaa-basketball/stat/average-scoring-margin"
## [3] "/ncaa-basketball/stat/offensive-efficiency"
## [4] "/ncaa-basketball/stat/floor-percentage"
## [5] "/ncaa-basketball/stat/1st-half-points-per-game"
## [6] "/ncaa-basketball/stat/2nd-half-points-per-game"
## [7] "/ncaa-basketball/stat/overtime-points-per-game"
## [8] "/ncaa-basketball/stat/average-1st-half-margin"
## [9] "/ncaa-basketball/stat/average-2nd-half-margin"
## [10] "/ncaa-basketball/stat/average-overtime-margin"
baseStatURL <- "https://www.teamrankings.com"
season_year_end <- 2016
allStatsURL <- sprintf("%s%s%s", baseStatURL, stat.links, sprintf("?date=%s-04-06", season_year_end))
An example of the URLs constructured for the team statistics for the 2015 season is shown below:
https://www.teamrankings.com/ncaa-basketball/stat/points-per-game?date=2016-04-06
https://www.teamrankings.com/ncaa-basketball/stat/average-scoring-margin?date=2016-04-06
https://www.teamrankings.com/ncaa-basketball/stat/offensive-efficiency?date=2016-04-06
https://www.teamrankings.com/ncaa-basketball/stat/floor-percentage?date=2016-04-06
https://www.teamrankings.com/ncaa-basketball/stat/1st-half-points-per-game?date=2016-04-06 https://www.teamrankings.com/ncaa-basketball/stat/2nd-half-points-per-game?date=2016-04-06 https://www.teamrankings.com/ncaa-basketball/stat/overtime-points-per-game?date=2016-04-06 https://www.teamrankings.com/ncaa-basketball/stat/average-1st-half-margin?date=2016-04-06
https://www.teamrankings.com/ncaa-basketball/stat/average-2nd-half-margin?date=2016-04-06
https://www.teamrankings.com/ncaa-basketball/stat/average-overtime-margin?date=2016-04-06
## -----------------------------------------------------------
## Function: get_team_stats
##
## -----------------------------------------------------------
get_team_stats <- function(statURL, year, createCSV = TRUE) {
for (i in 1:length(statURL)) {
# Keep the rank, team, and stat for the given year
statHTMLtab <- htmltab(doc = statURL[i], which = 1)[, 1:3]
# parse the current stat name from the URL string
statname <- str_replace_all(str_extract(statURL[i], "(?<=\\/stat\\/).*?.*?(?=\\?)"), "-", "_")
# convert double-underscores to a single underscore
statname <- str_replace_all(statname, "__", "_")
stats.df <- statHTMLtab %>% gather(key = Year, value = "stat", -Team, -Rank)
# convert the year from a character to integer
stats.df$Year <- as.integer(stats.df$Year)
# replace % signs in the statistic value
stats.df$stat <- as.numeric(gsub("%|--", "", stats.df$stat))
stats.df$Rank <- as.integer(stats.df$Rank)
# rename the rank column
names(stats.df)[names(stats.df) == "Rank"] <- sprintf("%s_%s", statname, "rank")
# rename the stat column to the stat being extracted in the URL
names(stats.df)[names(stats.df) == "stat"] <- statname
# reorder the columns
stats.df <- stats.df[, c(3, 2, 4, 1)]
if (i == 1) {return.df <- stats.df}
else {
return.df <- return.df %>% left_join(stats.df, by = c("Year" = "Year", "Team" = "Team"))
}
}
if (createCSV) {
# write the team stats to a csv file
write.csv(return.df, sprintf("team_stats_%s.csv", year - 1), row.names = FALSE)
}
# return the stats dataframe
return (return.df)
}
## -----------------------------------------------------------
## Function: load_RMySQL_DB
##
## -----------------------------------------------------------
load_RMySQL_DB <- function(tablename, df, overwrite = TRUE, append = FALSE) {
# establish the connection to the skill DB on db4free.net
bbmetricDB = dbConnect(MySQL(), user=proj_user, password=proj_pwd, dbname=proj_db, host=proj_host)
dbWriteTable(bbmetricDB,
name = tablename,
value = df,
overwrite = overwrite,
append = append,
row.names = FALSE)
dbDisconnect(bbmetricDB)
}
Using the function get_team_stats
, teams statistics for the NCAA basketball seasons (starting) from 2006 - 2015
team.stats.2016 <- get_team_stats(allStatsURL, season_year_end, createCSV = FALSE)
The team statistics for the 2015 - 2016 season (partial listing):
team_name | source_team_name | season_start_year | season_end_year | assist_per_turnover_ratio | assist_per_turnover_ratio_rank | assists_per_fgm | assists_per_fgm_rank | assists_per_game | assists_per_game_rank |
---|---|---|---|---|---|---|---|---|---|
North Carolina | N Carolina | 2015 | 2016 | 1.647 | 3 | 0.573 | 58 | 17.7 | 4 |
Michigan St. | Michigan St | 2015 | 2016 | 1.737 | 1 | 0.713 | 1 | 20.5 | 1 |
Mississippi St. | Miss State | 2015 | 2016 | 1.099 | 137 | 0.507 | 210 | 13.9 | 122 |
West Virginia | W Virginia | 2015 | 2016 | 1.024 | 179 | 0.541 | 130 | 14.6 | 78 |
Virginia Tech | VA Tech | 2015 | 2016 | 0.955 | 231 | 0.486 | 253 | 12.2 | 236 |
Florida St. | Florida St | 2015 | 2016 | 1.041 | 170 | 0.478 | 274 | 13.3 | 160 |
Ole Miss | Mississippi | 2015 | 2016 | 1.067 | 155 | 0.518 | 183 | 12.9 | 188 |
North Carolina St. | NC State | 2015 | 2016 | 1.052 | 163 | 0.442 | 332 | 11.6 | 278 |
Saint Joseph’s | St Josephs | 2015 | 2016 | 1.487 | 11 | 0.547 | 114 | 14.8 | 67 |
Fresno St. | Fresno St | 2015 | 2016 | 1.210 | 84 | 0.494 | 240 | 12.9 | 185 |
George Washington | Geo Wshgtn | 2015 | 2016 | 1.256 | 62 | 0.546 | 115 | 14.1 | 109 |
Colorado St. | Colorado St | 2015 | 2016 | 1.126 | 119 | 0.513 | 192 | 13.1 | 173 |
South Carolina | S Carolina | 2015 | 2016 | 0.996 | 201 | 0.520 | 172 | 13.5 | 149 |
UTEP | TX El Paso | 2015 | 2016 | 1.237 | 70 | 0.573 | 59 | 15.5 | 41 |
Boise St. | Boise State | 2015 | 2016 | 1.130 | 114 | 0.508 | 209 | 13.1 | 175 |
Iowa St. | Iowa State | 2015 | 2016 | 1.446 | 21 | 0.531 | 149 | 16.5 | 16 |
Utah Valley | Utah Val St | 2015 | 2016 | 1.074 | 149 | 0.557 | 90 | 14.0 | 112 |
Western Ky. | W Kentucky | 2015 | 2016 | 1.000 | 196 | 0.496 | 234 | 13.2 | 169 |
San Francisco | San Fransco | 2015 | 2016 | 0.960 | 229 | 0.462 | 303 | 12.3 | 229 |
SMU | S Methodist | 2015 | 2016 | 1.427 | 24 | 0.615 | 23 | 17.6 | 5 |
This process of assembling team statistics was repeated to collect 10 years of statistics. Each year’s combined team statistics were loaded into a cloud MySQL database, described below.
NCAA Tournament results from collected from 2007 - 2016. For each year, the teams making the Round of 64, Round of 32, Sweet 16, Elite 8, Final 4, Championship Game, and Champion were assembled and stored.
The process to collect the NCAA Tournament bracket teams was somewhat manual since the winners of each round are most commonly represented in a bracket graphic versus as a list of teams.
The site http://www.cbssports.com/collegebasketball/ncaa-tournament/history/yearbyyear provided the winners of each round up to 2012 but stopped beyond this tournament year.
NCAA Bracket Winners
Each round of winners was parsed using the function below, which uses the splitstatshape
R package to create a dataframe of the teams in each bracket.
suppressWarnings(suppressMessages(library(splitstackshape)))
parse_bracket_results <- function(team_string) {
str(team_string)
team_string.new <-
strsplit(
str_replace_all(s, "No. |[0-9]|(OT)|OT|\\(|\\)", "") , "\n")
f <- as.data.frame(team_string.new);
f <- as.data.frame(str_c(str_replace(as.character(f[, 1]), "(.),(.)", "|"), "|"))
colnames(f) <- "team"
f <- as.data.frame(cSplit(f, "team", "|", direction = "long"))
f$team <- as.character(f$team)
f <- f %>%
select(team) %>%
filter(team != "") %>%
mutate(team = str_replace_all(team, " ,$| $", ""))
return (f)
}
Sample of the NCAA bracket winners for 2016:
tournament_year | round_64 | round_32 | sweet_16 | elite_8 | final_4 | championship_game | champion |
---|---|---|---|---|---|---|---|
2016 | Kansas | Kansas | Kansas | Kansas | Villanova | Villanova | Villanova |
2016 | Villanova | Connecticut | Maryland | Villanova | Oklahoma | North Carolina | |
2016 | Miami | Maryland | Miami (FL) | Oregon | North Carolina | ||
2016 | California | Hawaii | Villanova | Oklahoma | Syracuse | ||
2016 | Maryland | Wichita State | Oregon | North Carolina | |||
2016 | Arizona | Miami (FL) | Duke | Notre Dame | |||
2016 | Iowa | Iowa | Texas A&M | Virginia | |||
2016 | Colorado | Villanova | Oklahoma | Syracuse | |||
2016 | Connecticut | Oregon | North Carolina | ||||
2016 | Temple | Saint Joseph’s | Indiana | ||||
2016 | Wichita State | Yale | Notre Dame | ||||
2016 | South Dakota State | Duke | Wisconsin | ||||
2016 | Hawaii | Northern Iowa | Virginia | ||||
2016 | Buffalo | Texas A&M | Iowa State | ||||
2016 | UNC Asheville | VCU | Gonzaga | ||||
2016 | Austin Peay | Oklahoma | Syracuse | ||||
2016 | Oregon | North Carolina | |||||
2016 | Oklahoma | Providence | |||||
2016 | Texas A&M | Indiana | |||||
2016 | Duke | Kentucky | |||||
2016 | Baylor | Notre Dame | |||||
2016 | Texas | Stephen F. Austin | |||||
2016 | Oregon State | Wisconsin | |||||
2016 | Saint Joseph’s | Xavier | |||||
2016 | Cincinnati | Virginia | |||||
2016 | VCU | Butler | |||||
2016 | Northern Iowa | Arkansas-Little Rock | |||||
2016 | Yale | Iowa State | |||||
2016 | UNC Wilmington | Gonzaga | |||||
2016 | Green Bay | Utah | |||||
2016 | Cal State Bakersfield | Syracuse | |||||
2016 | Holy Cross | Middle Tennessee | |||||
2016 | North Carolina | ||||||
2016 | Xavier | ||||||
2016 | West Virginia | ||||||
2016 | Kentucky | ||||||
2016 | Indiana | ||||||
2016 | Notre Dame | ||||||
2016 | Wisconsin | ||||||
2016 | USC | ||||||
2016 | Providence | ||||||
2016 | Pittsburgh | ||||||
2016 | Michigan | ||||||
2016 | Chattanooga | ||||||
2016 | Stony Brook | ||||||
2016 | Stephen F. Austin | ||||||
2016 | Weber State | ||||||
2016 | Florida Gulf Coast | ||||||
2016 | Virginia | ||||||
2016 | Michigan State | ||||||
2016 | Utah | ||||||
2016 | Iowa State | ||||||
2016 | Purdue | ||||||
2016 | Seton Hall | ||||||
2016 | Dayton | ||||||
2016 | Texas Tech | ||||||
2016 | Butler | ||||||
2016 | Syracuse | ||||||
2016 | Gonzaga | ||||||
2016 | Arkansas-Little Rock | ||||||
2016 | Iona | ||||||
2016 | Fresno State | ||||||
2016 | Middle Tennessee | ||||||
2016 | Hampton |
The NCAA bracket winners for the 2013, 2014, 2015, and 2016 tournaments were manually extracted from Wikipedia:
https://en.wikipedia.org/wiki/2013_NCAA_Men%27s_Division_I_Basketball_Tournament
The bracket results for ten seasons can be found here in CSV format. These files were subsequently loaded into a MySQL database using the load_RMySQL_DB
function
To facilitate centralized data storage and access for this project, a MySQL database on dbfree.net was chosen as a cloud repository for the team statistics and bracket results.
MySQL DB
The team statistics assembled for each year were staged in the MySQL database and combined into a single, consolidated team statistics table. The SQL for the table DDL and the dynamically executed insert statement can be found here.
The code to insert the team statistics by year is shown below:
## Insert Team Stats By Year
suppressWarnings(suppressMessages(library(readr)))
url <- "https://raw.githubusercontent.com/kfolsom98/DATA607/master/Final-Project/sql/insert_team_stats_year.sql"
ins <- read_file(url)
stat.tables <- dbGetQuery(bbmetricDB, "SELECT DISTINCT TABLE_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = 'bbmetric'
AND TABLE_NAME like 'team_stats_%'")
dbGetQuery(bbmetricDB, "TRUNCATE TABLE team_stats")
for (i in 1:nrow(stat.tables)) {
insert <- str_replace(ins, "<<!table_name>>", stat.tables[i, ])
print(paste0("Preparing insert for ", stat.tables[i, ]))
res <- dbSendQuery(bbmetricDB, insert)
}
dbSendQuery(bbmetricDB, "select count(*) from team_stats")
The NCAA bracket result data was combined locally and uploaded to the MySQL database with this process:
## Load the ncaa bracket data ##
suppressWarnings(suppressMessages(library(data.table)))
df <- data.frame()
files <- list.files(pattern="ncaa*.*.csv")
for (f in files) df<-rbind(Dataf, fread(f))
## push the brackets per year to the DB
dbSendQuery(bbmetricDB, "TRUNCATE TABLE ncaa_brackets")
load_RMySQL_DB("ncaa_brackets", df, overwrite = FALSE, append = TRUE)
#close the connection
dbDisconnect(bbmetricDB)
One significant challenge encountered while processing the team statistics and bracket winners was that the same team could be entered with multiple variations of a name. To standardize the team names, the source team name was mapped to a consistent team name while being inserted into the team_stats
combined table.
To illustrate the approach, the following example shows the mappings for the team St. Mary’s (CA) and all the variations seen in the source team stats data.
old_name | new_name |
---|---|
St Marys | St. Mary’s (CA) |
St. Mary’s | St. Mary’s (CA) |
Saint Mary’s, Calif. | St. Mary’s (CA) |
Saint Mary’s Calif. | St. Mary’s (CA) |
St. Mary’s (Calif.) | St. Mary’s (CA) |
Saint Mary’s | St. Mary’s (CA) |
The final step in preparing the combined team statistics table is to apply the NCAA bracket winners for each year. A team reaching a specific level in the bracket (e.g. Sweet 16 or Final 4) for a specific year will be given a value of 1, as an indicator variable. Otherwise, the team will have a value of 0 for each bracket level.
## UPDATE - Reset Bracket Values
dbGetQuery(bbmetricDB, " update team_stats ts
set ts.ncaa_round_64 = 0,
ts.ncaa_round_32 = 0,
ts.ncaa_sweet_16 = 0,
ts.ncaa_elite_8 = 0,
ts.ncaa_final_4 = 0,
ts.ncaa_championship_game = 0,
ts.ncaa_champion = 0
")
## UPDATE - Set the bracket column value to 1 for each team in the given bracket for the given year
update_cols <- c("round_64","round_32", "sweet_16",
"elite_8", "final_4",
"championship_game",
"champion")
for (i in 1:length(update_cols)) {
update <-
sprintf(" update team_stats ts
inner join
(
select nb.tournament_year,
ifnull(map.new_name, nb.%s) team_name
from ncaa_brackets nb left outer join team_name_map map on nb.%s = map.old_name) br
on ts.season_end_year = br.tournament_year and ts.team_name = br.team_name
set ts.ncaa_%s = 1", update_cols[i], update_cols[i], update_cols[i])
dbGetQuery(bbmetricDB, update)
}
#Get the column names for the query
column_names <- read.csv("https://raw.githubusercontent.com/DanielBrooks39/IS607/master/Final_Project/Column_names.csv", sep=",", header=FALSE, stringsAsFactors = FALSE)
years <- c("2015", "2014", "2013", "2012", "2011", "2010", "2009", "2008", "2007", "2006")
#Function: getdata
#Input:
# years, the years for the database table
# column_names, the names of the columns that are in the database
# username, the username to connect tot the database
# password, the password for teh database
# database, the database name to connect to the database
# hotname, the hostname where the database resides
#Output:
# Writes a csv to the working directoy for every table that is specified by the query.
#This function will create a query for each of the years that we wabt in the database. It will connect to the database and poull back the desired columns for each year that we want. It will take each year that is qurries and writes it to a csv on the working directory.
getdata <- function(years, column_names, username, password, database, hostname)
{
con <- dbConnect(MySQL(), user=username, password = password, dbname = database, host = hostname)
base_table <- "team_stats_"
i <- 1
j <- 1
for (i in 1 : NROW(years))
{
switch(years[i],
"2015" = data <- data.frame(double(351)),
"2014" = data <- data.frame(double(351)),
"2013" = data <- data.frame(double(351)),
"2012" = data <- data.frame(double(347)),
"2011" = data <- data.frame(double(345)),
"2010" = data <- data.frame(double(345)),
"2009" = data <- data.frame(double(345)),
"2008" = data <- data.frame(double(341)),
"2007" = data <- data.frame(double(341)),
"2006" = data <- data.frame(double(336))
)
for (j in 1 : NROW(column_names))
{
query <- dbGetQuery(con, paste("select", " ", column_names$V1[j], " ", "from", " ", base_table, years[i], sep=""))
data[,j] <- query
j <- j + 1
}
write.csv(data, paste("data", years[i], ".csv", sep = ""), append = TRUE, col.names = TRUE)
i <- i + 1
}
dbDisconnect(con)
}
getdata(years, column_names, username, password, database, hostname)
The database data is read in from the CSV’s that were created earlier. The CSV’s are read in one at a time and stored in their separate data frames. There is also a mismatch between the database team names and the names that are used in the rankings. The teams names need to be the same in order to join the two tables together and and get everything in the same data frame.
data2014 <- read.csv("data2014.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE)
colnames(data2014)[2] <- "V1"
data2015 <- read.csv("data2015.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE)
colnames(data2015)[2] <- "V1"
The following section will cover the processes used to model the defensive statistics to predict the top 8 teams. The models used will be Model Tree, Decision Tree, and KNN. The following lists the defensive variables used.
Variable | Variable | Variable |
---|---|---|
Team_name | opponent_two_point_rate | opponent_turnover_pct |
opponent_points_per_game | opponent_fta_per_fga | opponent_personal_fouls_per_game |
opponent_average_scoring_margin | opponent_ftm_per_100_possessions | opponent_personal_fouls_per_possession |
defensive_efficiency | opponent_free_throw_rate | opponent_personal_foul_pct |
opponent_floor_percentage | opponent_non_blocked_2_pt_pct | opponent_win_pct_close_games |
opponent_1st_half_points_per_game | opponent_offensive_rebounds_per_game | opponent_win_pct_all_games |
opponent_2nd_half_points_per_game | opponent_defensive_rebounds_per_game | opponent_effective_possession_ratio |
opponent_overtime_points_per_game | opponent_offensive_rebounding_pct | games_played |
opponent_points_from_2_pointers | opponent_defensive_rebounding_pct | |
opponent_points_from_3_pointers | opponent_blocks_per_game | |
opponent_percent_of_points_from_2_pointers | opponent_steals_per_game | |
opponent_percent_of_points_from_3_pointers | opponent_block_pct | |
opponent_percent_of_points_from_free_throws | opponent_steals_perpossession | |
opponent_shooting_pct | opponent_steal_pct | |
opponent_effective_field_goal_pct | opponent_assists_per_game | |
opponent_three_point_pct | opponent_turnovers_per_game | |
opponent_two_point_pct | opponent_assist_per_turnover_ratio | |
opponent_free_throw_pct | opponent_assists_per_fgm | |
opponent_true_shooting_percentage | opponent_assists_per_possession | |
opponent_three_point_rate | opponent_turnovers_per_possession |
For the machine learning algorithms to work, it needs to be trained before it can predict anything. For this data, the training data will be the year of 2006-2014. The test data will be the most current year(2015). The training data is used to give the model a baseline and allow it to make groups or trees (depedning on the model type). Then the model can predict the rankings or the class of the test data.
traindata <- rbind(table2006, table2007, table2008, table2009, table2010, table2011, table2012,table2013, table2014)
traindata <- na.omit(traindata)
traindata <- traindata[,-1]
traindata$rank <- factor(traindata$rank)
testdata <- rbind(table2015)
testdata <- na.omit(testdata)
testnames <- testdata[1]
testdata <- testdata[,-1]
testdata$rank <- factor(testdata$rank)
The first model that will be used to calculate the top eight teams in division one basketball woud be the model tree. This is a more robust decision tree. Basically, it takes the training dataset and comes up with a set of rules. Each of those rules are a node on the tree. The difference between a model tree and a decision tree is, a model tree gives an equation for each node of a tree. Each rull that is decided on by the model will give a specific formula. That formula is then used to forecast the categories of the new data, based on what nodes the new data hits. For the case of this model, there was only one rule that was decided by the model. The equation is:
\[ rank = -14.831 * opponent_points_per_game + 13.6137 * opponent_average_scoring_margin + 852.236 * defensive_efficiency - 3.8714 * opponent_floor_percentage + 4.5235 * opponent_1st_half_points_per_game + 5.0138 * opponent_2nd_half_points_per_game + 7.6484 * opponent_overtime_points_per_game + 10.7681 * opponent_points_from_2_pointers + 10.0136 * opponent_points_from_3_pointers + 8.4694 * opponent_percent_of_points_from_2_pointers + 22.4938 * opponent_percent_of_points_from_3_pointers + 10.4896 * opponent_percent_of_points_from_free_throws - 33.1115 * opponent_shooting_pct - 3.455 * opponent_effective_field_goal_pct - 3.8401 * opponent_three_point_pct + 23.4914 * opponent_two_point_pct + 3.9098 * opponent_true_shooting_percentage - 16.814 * opponent_three_point_rate + 814.8023 * opponent_fta_per_fga - 10.9082 * opponent_ftm_per_100_possessions - 5.4238 * opponent_non_blocked_2_pt_pct - 8.2394 * opponent_offensive_rebounds_per_game + 3.6402 * opponent_offensive_rebounding_pct - 37.5087 * opponent_blocks_per_game + 16.5483 * opponent_steals_per_game + 16.4092 * opponent_block_pct - 28.4589 * opponent_steals_perpossession + 19.2984 * opponent_steal_pct + 5.6475 * opponent_assists_per_game - 9.1663 * opponent_turnovers_per_game + 243.8074 * opponent_assists_per_fgm - 1320.9676 * opponent_assists_per_possession + 52.0246 * opponent_turnovers_per_possession - 62.0717 * opponent_turnover_pct + 17.4931 * opponent_personal_fouls_per_game - 2.5374 * opponent_personal_fouls_per_possession - 12.8002 * opponent_personal_foul_pct - 98.1536 * opponent_win_pct_all_games - 1181.6373 * opponent_effective_possession_ratio - 8.2799 * games_played + 994.0719 \]
This can be used to get the new order of the teams for the new data.
====Summary==== | Value |
---|---|
Correlation coefficient: | 0.891 |
Mean absolute error: | 35.658 |
Root mean squared error: | 44.7633 |
Relative absolute error: | 41.7723 % |
Root relative squared error: | 45.4043 % |
Total Number of Instances: | 2952 |
The summary data is showing that the correlation between this equation and the rank is rather high. Implying that this is a good equation to use for ranking of the teams.
suppressWarnings(suppressMessages(library(RWeka)))
traindata$rank <- as.numeric(traindata$rank)
testdata$rank <- as.numeric(testdata$rank)
model_M5p <- M5P(rank~., data = traindata)
pred <- predict(model_M5p, testdata[,-1])
pred <- as.data.frame(pred)
pred$team <- testnames
sort <- pred[with(pred, order(pred)), ]
A decision tree follows the same idea as a model tree. The data is broken apart into “rules”. Each rule is a node of a tree. Depending on which side of the tree the data falls upon, determines what category the data will end up in. This kind of tree doesn’t have any equations associated with it, so it will be used for classification and not used to atualy predict the value of the team.
Classification Tree | Value |
---|---|
Number of Samples: | 2952 |
Number of Predictors | 47 |
Tree Size: | 1213 |
We can see that there was 2952 samples that were used to make the tree. There was 47 Predictors that were used in the model and the tree has 1213 nodes in it.
We can also see the break down of what stats the tree uses. The more important stats get used the most in the model. Here are some of the stats used and the amount they were used in the model:
Attribute | % Usage |
---|---|
opponent_average_scoring_margin | 100% |
games_played | 92.65% |
opponent_points_per_game | 36.79% |
opponent_overtime_points_per_game | 36.52% |
opponent_free_throw_pct | 33.16% |
opponent_defensive_rebounding_pct | 28.22% |
opponent_blocks_per_game | 26.56% |
opponent_floor_percentage | 25.68% |
opponent_turnover_pct | 23.78% |
opponent_assists_per_possession | 21.54% |
opponent_win_pct_all_games | 21.00% |
suppressWarnings(suppressMessages(library(C50)))
trainmodel <- C5.0(traindata[,-1], traindata[,1, trials = 3])
testmodel <- predict(trainmodel, testdata[,-1], type="class")
rank <- as.data.frame(testmodel)
rank$teams <- testnames
names(rank) <- c("ranks", "teams")
sort_teams <- rank[with(rank, order(ranks)), ]
The KNN model is the third model that is being used in this project. This model takes the number of categories, in our case, the rank, and separates the data into groups. In our case there are around 351 groups. The data is then placed into their groups and their stats act like a coordinate (in two dimensions x and y). The new data is then brought in and the euclidean distance is calculated between the new data and all the points that were in the training set. The closest point to the new point is how the category gets determined.
suppressWarnings(suppressMessages(library(class)))
model <- knn(traindata[,-1], testdata[,-1], traindata[,1], k = 3 )
rank <- as.data.frame(model)
rank$team <- testnames
names(rank) <- c("ranks", "teams")
sort_teams_knn <- rank[with(rank, order(ranks)), ]
These are the results that were obtained by the machine learning algorithms. It shows the break down of the top ten teams by the model that predicted them. It then shows the ranking of the teams with the three models combined. This is calculated by the 1st place team getting 10 pts and the last place team getting 1. This is repeated for all of the models. The points are totaled up and then the teams are ranked by the total points. The last part comapres the predictied results with the actual results. It comapares where the top 8 teams finished in the actual tournament to where the models predicted they would finish.
Rank | KNN | Model Tree | Decision Tree |
---|---|---|---|
1 | Villanova | Michigan St. | Villanova |
2 | Kansas | Villanova | North Carolina |
3 | SMU | North Carolina | Indiana |
4 | Syracuse | Purdue | Kentucky |
5 | Miami (FL.) | Kansas | Virgina |
6 | Gonzaga | Wichita St. | Michigan St. |
7 | Arizona | Virgina | Louisville |
8 | Oklahoma | Indiana | UCONN |
9 | BYU | Gonzaga | Kansas |
10 | Yale | St. Mary’s (CA) | Xavier |
Rank | Teams | Points |
---|---|---|
1 | Villanova | 29 |
2 | North Carolina | 17 |
3 | Kansas | 17 |
4 | Michigan St. | 15 |
5 | Indiana | 11 |
6 | Virgina | 10 |
7 | SMU | 8 |
8 | Syracuse | 7 |
Tournament Place | Actual Teams | Predicted |
---|---|---|
Champ | Villanova | Villanova |
Finals | North Carolina | North Carolina |
Final 4 | Syracuse | Kansas |
Final 4 | Oklahoma | Michigan St |
Elite 8 | Virgina | Virgina |
Elite 8 | Oregon | Indiana |
Elite 8 | Kansas | SMU |
Elite 8 | Notre Dame | Syracuse |
The following section will cover the process of modeling the offensive statistics to predict the top 8 teams. The models used will be KNN, SVM, and Random Forest. As a precursor to modeling, the offensive statistics are retrieved from the cloud MySQL database.
bbmetricDB <- DBconnect()
stats <- dbGetQuery(bbmetricDB,
"
select
team_name, season_start_year,
points_per_game, average_scoring_margin, offensive_efficiency,
floor_percentage, overtime_points_per_game, average_1st_half_margin,
average_2nd_half_margin, average_overtime_margin, points_from_2_pointers,
points_from_3_pointers, percent_of_points_from_2_pointers,
percent_of_points_from_3_pointers,
percent_of_points_from_free_throws, shooting_pct, effective_field_goal_pct,
three_point_pct, two_point_pct, free_throw_pct,
true_shooting_percentage, field_goals_made_per_game, field_goals_attempted_per_game,
three_pointers_made_per_game, three_pointers_attempted_per_game,
free_throws_made_per_game, free_throws_attempted_per_game, three_point_rate,
two_point_rate, fta_per_fga, ftm_per_100_possessions, free_throw_rate,
non_blocked_2_pt_pct, offensive_rebounds_per_game, offensive_rebounding_pct,
assists_per_game, turnovers_per_game, turnovers_per_possession,
assist_per_turnover_ratio, assists_per_fgm, assists_per_possession,
turnover_pct,games_played, possessions_per_game,
extra_chances_per_game, effective_possession_ratio, win_pct_all_games,
ncaa_round_64, ncaa_round_32, ncaa_sweet_16,
ncaa_elite_8, ncaa_final_4, ncaa_championship_game,
ncaa_champion,
ncaa_round_64 +
ncaa_round_32 * 2 +
ncaa_sweet_16 * 4 +
ncaa_elite_8 * 8 +
ncaa_final_4 * 16 +
ncaa_championship_game * 32 +
ncaa_champion * 64 as weighted_rank
from team_stats")
The following offensive statistics will be used to to predict the top 8 teams of the 2016 NCAA tournament. Not all statistics will be important but are included from comprehesiveness. The model will determine the importance of each variable during the training process.
Variable | Variable | Variable |
---|---|---|
team_name | true_shooting_percentage | assists_per_possession |
season_start_year | field_goals_made_per_game | turnover_pct |
points_per_game | field_goals_attempted_per_game | games_played |
average_scoring_margin | three_pointers_made_per_game | possessions_per_game |
offensive_efficiency | three_pointers_attempted_per_game | extra_chances_per_game |
floor_percentage | free_throws_made_per_game | effective_possession_ratio |
overtime_points_per_game | free_throws_attempted_per_game | win_pct_all_games |
average_1st_half_margin | three_point_rate | ncaa_round_64 |
average_2nd_half_margin | two_point_rate | ncaa_round_32 |
average_overtime_margin | fta_per_fga | ncaa_sweet_16 |
points_from_2_pointers | ftm_per_100_possessions | ncaa_elite_8 |
points_from_3_pointers | free_throw_rate | ncaa_final_4 |
percent_of_points_from_2_pointers | non_blocked_2_pt_pct | ncaa_championship_game |
percent_of_points_from_3_pointers | offensive_rebounds_per_game | ncaa_champion |
percent_of_points_from_free_throws | offensive_rebounding_pct | weighted_rank |
shooting_pct | assists_per_game | |
effective_field_goal_pct | turnovers_per_game | |
three_point_pct | turnovers_per_possession | |
two_point_pct | assist_per_turnover_ratio | |
free_throw_pct | assists_per_fgm |
The predictor variable will be weighted_rank
, which is calculated based on a team’s advancement in the NCAA tournament.
Bracket | Points Assigned |
---|---|
Round of 64 | 1 |
Round of 32 | 2 |
Sweet 16 | 4 |
Elite 8 | 8 |
Final 4 | 16 |
Championship Game | 32 |
Champion | 64 |
Create a offensive training dataset using the team statistics for years prior to 2015.
# create a training dataset based on all years prior to 2015
stats_train <- subset(stats, season_start_year < 2015) # take all years prior to 2015
# store the team, year, ncaa_tournament indicators, and the weigted bracket rank
stats_train_results <- select(stats_train, team_name, season_start_year, ncaa_round_64:weighted_rank)
Create the test dataset using the team statistics for the 2015 NCAA basketball season.
# create a training dataset based on all years prior to 2015
stats_test <- filter(stats, season_start_year == 2015, ncaa_round_64 == 1)
# store the team, year, ncaa_tournament indicators, and the weigted bracket rank
stats_test_results <- select(stats_test, team_name, season_start_year, ncaa_round_64:weighted_rank)
Finally, remove the columns from the training dataset which will not be used in models
stats_train <- select(stats_train, -(team_name:season_start_year), -(ncaa_round_32:ncaa_champion))
str(stats_train)
## 'data.frame': 3124 obs. of 47 variables:
## $ points_per_game : num 85.7 74.5 78.6 76.9 72.1 70.8 74.5 73.5 74.3 70.4 ...
## $ average_scoring_margin : num 17.1 12.5 9.2 8.4 9 6.1 5.5 8 4.5 7.8 ...
## $ offensive_efficiency : num 1.15 1.13 1.1 1.08 1.11 ...
## $ floor_percentage : num 56.4 53.3 53.3 51.2 49.5 52.7 51 51.5 53.9 50.4 ...
## $ overtime_points_per_game : num 0 0 0 0 0 0 0 0 0 0 ...
## $ average_1st_half_margin : num 8.3 7.3 4.9 4.9 7.3 1.7 4 3.8 3 6.3 ...
## $ average_2nd_half_margin : num 9.1 5 4.3 3.4 2.4 4.2 1.3 4.2 1.6 1.2 ...
## $ average_overtime_margin : num 0 0 0 0 0 0 0 0 0 0 ...
## $ points_from_2_pointers : num 51.2 38.1 45.9 38.6 30.3 42.1 39.9 35.9 39.2 39.5 ...
## $ points_from_3_pointers : num 17 22.1 18.8 23.6 30.6 14.2 20.3 22.1 17.8 14.2 ...
## $ percent_of_points_from_2_pointers : num 59.7 51.1 58.4 50.2 42 59.5 53.6 48.9 52.7 56.2 ...
## $ percent_of_points_from_3_pointers : num 19.8 29.6 24 30.6 42.4 20 27.3 30.1 24 20.2 ...
## $ percent_of_points_from_free_throws: num 20.5 19.3 17.7 19.1 15.5 20.3 19.1 21 23.4 23.6 ...
## $ shooting_pct : num 49.9 47.2 49 45.9 45.5 46.7 48.1 46.8 47.8 42.2 ...
## $ effective_field_goal_pct : num 54.4 53.8 54.2 52.6 54.7 50.9 54.2 53.6 53.3 46.3 ...
## $ three_point_pct : num 35.8 36.1 37.2 35.5 37.4 36.4 38 39.2 34.9 32.1 ...
## $ two_point_pct : num 54.6 53.6 53.6 52.2 53.4 49.8 52.8 50.9 53.8 45.7 ...
## $ free_throw_pct : num 71.2 70.3 68.1 66.5 71.3 66.2 75.8 69.8 72.4 62.8 ...
## $ true_shooting_percentage : num 115 114 113 110 114 ...
## $ field_goals_made_per_game : num 31.2 26.4 29.2 27.2 25.3 25.8 26.7 25.3 25.5 24.5 ...
## $ field_goals_attempted_per_game : num 62.6 55.8 59.7 59.2 55.7 55.3 55.6 54.1 53.4 58 ...
## $ three_pointers_made_per_game : num 5.7 7.4 6.3 7.9 10.2 4.7 6.8 7.4 5.9 4.7 ...
## $ three_pointers_attempted_per_game : num 15.8 20.4 16.9 22.1 27.3 13 17.8 18.8 17 14.8 ...
## $ free_throws_made_per_game : num 17.6 14.4 13.9 14.7 11.2 14.4 14.2 15.4 17.4 16.6 ...
## $ free_throws_attempted_per_game : num 24.7 20.4 20.4 22.1 15.7 21.8 18.8 22.1 24 26.4 ...
## $ three_point_rate : num 25.2 36.5 28.3 37.4 49 23.5 32 34.8 31.9 25.5 ...
## $ two_point_rate : num 74.8 63.5 71.7 62.6 51 76.5 68 65.2 68.1 74.5 ...
## $ fta_per_fga : num 0.395 0.366 0.342 0.374 0.282 0.394 0.337 0.408 0.449 0.455 ...
## $ ftm_per_100_possessions : num 23.7 21.7 19.5 20.6 17.2 ...
## $ free_throw_rate : num 28.2 26.5 24.1 26.4 21.1 28.4 23.7 28.2 30.5 30.8 ...
## $ non_blocked_2_pt_pct : num 59.8 57.7 57.7 58.6 58.3 53.2 57.9 57.6 59.5 50.6 ...
## $ offensive_rebounds_per_game : num 13.4 10.9 13.5 12.4 9.4 10.3 10.3 10.4 11.7 15.5 ...
## $ offensive_rebounding_pct : num 39.2 34.1 41.1 35.2 29.4 32.2 34.7 32.6 38 41.5 ...
## $ assists_per_game : num 18.3 14.9 15.3 14.6 17.4 12.9 12.4 16.2 14.6 13 ...
## $ turnovers_per_game : num 13.8 11.3 15.7 14 11.3 10.8 14.6 13.2 13.7 14.9 ...
## $ turnovers_per_possession : num 18.5 17.1 22.1 19.6 17.3 16.3 21.2 19.4 20.5 21.3 ...
## $ assist_per_turnover_ratio : num 1.328 1.325 0.978 1.045 1.546 ...
## $ assists_per_fgm : num 0.586 0.567 0.525 0.538 0.687 0.499 0.463 0.639 0.574 0.529 ...
## $ assists_per_possession : num 0.246 0.226 0.216 0.205 0.267 0.194 0.18 0.239 0.219 0.185 ...
## $ turnover_pct : num 15.7 14.7 18.5 16.7 15.1 14.1 18.4 16.8 17.5 17.4 ...
## $ games_played : num 38 39 32 35 35 33 35 32 33 31 ...
## $ possessions_per_game : num 74.4 66.1 71.1 71.3 65.1 66.3 68.9 67.7 66.8 70.1 ...
## $ extra_chances_per_game : num 4.7 2.6 4.2 0.2 2.8 3.2 0.5 3.2 -2.9 3.3 ...
## $ effective_possession_ratio : num 0.995 0.994 0.969 0.977 0.971 ...
## $ win_pct_all_games : num 0.816 0.897 0.625 0.6 0.743 0.636 0.629 0.656 0.636 0.548 ...
## $ ncaa_round_64 : int 1 1 1 0 0 1 0 0 1 0 ...
## $ weighted_rank : num 15 63 1 0 0 3 0 0 3 0 ...
set.seed(600)
ctrl <- trainControl(method="repeatedcv", repeats = 5, number = 10)
knnFit <- train(weighted_rank ~ .,
data = stats_train,
method = "knn",
trControl = ctrl,
preProcess = c("center","scale", "pca"),
tuneLength = 20)
knnFit
## k-Nearest Neighbors
##
## 3124 samples
## 46 predictor
##
## Pre-processing: centered (46), scaled (46), principal component
## signal extraction (46)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 2812, 2811, 2811, 2812, 2811, 2811, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared
## 5 7.390945 0.1878696
## 7 7.005912 0.2339936
## 9 6.773718 0.2798062
## 11 6.660547 0.3043377
## 13 6.601636 0.3186818
## 15 6.510144 0.3424952
## 17 6.507288 0.3440947
## 19 6.497940 0.3507688
## 21 6.496347 0.3507279
## 23 6.506862 0.3500910
## 25 6.506733 0.3509608
## 27 6.497975 0.3546472
## 29 6.492643 0.3573096
## 31 6.514821 0.3530110
## 33 6.515490 0.3518267
## 35 6.533359 0.3459206
## 37 6.535085 0.3462279
## 39 6.541223 0.3464458
## 41 6.550911 0.3457281
## 43 6.551048 0.3469615
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 29.
Let’s see which variables were determined to be the most important in the KNN model:
dotPlot(varImp(knnFit), main = "KNN Model - Most Relevant Variables")
set.seed(600)
rfFit <-train(weighted_rank ~.,
data=stats_train,
method="rf",
trControl=trainControl(method="cv",number=5),
prox=TRUE, importance = TRUE,
allowParallel=TRUE)
rfFit
## Random Forest
##
## 3124 samples
## 46 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 2499, 2499, 2499, 2500, 2499
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared
## 2 6.528543 0.3757743
## 24 6.098787 0.4619901
## 46 6.273775 0.4454502
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 24.
print(rfFit$finalModel)
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE, proximity = TRUE, allowParallel = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 24
##
## Mean of squared residuals: 36.51865
## % Var explained: 44.67
dotPlot(varImp(rfFit), main = "Random Forest Model - Most Relevant Variables")
set.seed(600)
SVMFit <- train(weighted_rank ~., data = stats_train, method="svmRadial", trControl = ctrl)
SVMFit
## Support Vector Machines with Radial Basis Function Kernel
##
## 3124 samples
## 46 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 2812, 2811, 2811, 2812, 2811, 2811, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared
## 0.25 6.893490 0.3641109
## 0.50 6.741980 0.3966425
## 1.00 6.587426 0.4223910
##
## Tuning parameter 'sigma' was held constant at a value of 0.01659096
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01659096 and C = 1.
dotPlot(varImp(SVMFit), main = "SVM Model - Most Relevant Variables")
To predict the top 8 teams in the NCAA tournament using the KNN, Random Forest, and SVM models, provide the test dataset to the three models.
AllModels <- list(knn = knnFit,
rf = rfFit,
svm = SVMFit)
AllModelPred <- predict(AllModels, newdata = stats_test)
stats_test_results <- cbind(stats_test_results, as.data.frame(AllModelPred))
Compare and store the models’ predictions against the actual winners:
# Top 8 teams based on kNN
knn_top8 <- stats_test_results %>%
select(team_name, knn) %>%
arrange(desc(knn)) %>%
slice(1:8) %>%
mutate(rank = row_number())
# Top 8 teams based on SVM
svm_top8 <- stats_test_results %>%
select(team_name, svm) %>%
arrange(desc(svm)) %>%
slice(1:8) %>%
mutate(rank = row_number())
# Top 8 teams based on Random Forest
rf_top8 <- stats_test_results %>%
select(team_name, rf) %>%
arrange(desc(rf)) %>%
slice(1:8) %>%
mutate(rank = row_number())
top8.disp <-
knn_top8 %>%
inner_join(svm_top8, by="rank") %>%
inner_join(rf_top8, by="rank") %>%
select(rank, team_name.x, team_name.y, team_name) %>%
rename(Rank = rank, KNN = team_name.x, SVM = team_name.y, RF = team_name)
These are the results that were obtained by the machine learning algorithms - kNN, SVM, and Random Forest.
Rank | KNN | SVM | RF |
---|---|---|---|
1 | North Carolina | North Carolina | Villanova |
2 | Kentucky | Kentucky | North Carolina |
3 | West Virginia | Oregon | Michigan St. |
4 | Stephen F. Austin | Kansas | Oklahoma |
5 | Gonzaga | Villanova | Kansas |
6 | Michigan St. | West Virginia | Virginia |
7 | Miami (FL) | Notre Dame | Oregon |
8 | Kansas | Virginia | Texas A&M |
Blending the rankings from each model:
top8 <- rbindlist(list(knn_top8, svm_top8, rf_top8))
top8$wt <- abs(top8$rank - 8 ) + 1
top8.blended <-
top8 %>%
group_by(team_name) %>%
summarise(points = sum(wt)) %>%
arrange(desc(points)) %>%
slice(1:8) %>%
mutate(rank = row_number(desc(points))) %>%
select(rank, team_name, points)
kable(top8.blended)
rank | team_name | points |
---|---|---|
1 | North Carolina | 23 |
2 | Kentucky | 14 |
3 | Villanova | 12 |
4 | Kansas | 10 |
5 | West Virginia | 9 |
6 | Michigan St. | 9 |
7 | Oregon | 8 |
8 | Stephen F. Austin | 5 |
Note. The blended results identified four teams correctly in the Elite 8 round. However, it did not place any of them in rank order correctly. In comparison, the Random Forest model identified 6 teams in the Elite 8 and correctly placed three teams in rank order. The Random Forest model will be selected as the best model from an offensive perspective.
Tournament Place | Actual Teams | Defensive Prediction | Offensive Prediction (Random Forest) |
---|---|---|---|
Champ | Villanova | Villanova | Villanova |
Finals | North Carolina | North Carolina | North Carolina |
Final 4 | Syracuse | Kansas | Michigan St. |
Final 4 | Oklahoma | Michigan St | Oklahoma |
Elite 8 | Virginia | Virgina | Kansas |
Elite 8 | Oregon | Indiana | Virginia |
Elite 8 | Kansas | SMU | Oregon |
Elite 8 | Notre Dame | Syracuse | Texas A&M |
Overall, the results of the models are encouraging. By design, this project separately models offensive statistics from defensive statistics. Even with this limitation, the models were able to correctly identify the championship game and the overall champion. Additionally, the defensive model predicted Virgina making it to the Elite 8. It did correctly predict Syracuse and Kansas as the top 8 teams; it just placed them in the wrong rounds. Defensive prediction results with three exactly correct and another 2 partially correct out of 8 is a good indicator.
Within the offensive models, the Random Forest model proves to be the most accurate of the three models used. This model correctly identified 6 teams among the Elite 8 with Villanova, North Carolina, and Oklahoma being placed correctly in rank order.
It could be that the teams in the top 8 that were not predicted based on the defensive model either had a huge offensive presence and/or were ranked very low on defense. These factors would keep them lower on this list and would not be predicted properly. A similar consideration would apply to the offensive models from a defensive perspective.
Combining the offensive and defensive statistics together would be the next step in the modeling process for this topic.