Import the necessary packages. This analysis includes Logistic Regression and Random Forest modeling.
library(tidyverse)
library(caret) ### confusion matrix
library(gam) ### Generalized Additive Models. our Logistic Regression
library(lubridate) ##Format and filter date
library(randomForest)
library(e1071)
library(ggcorrplot)
library(car)
Import the data.
ufc <- read.csv("data.csv")
For this project, I will conduct Logistic Regression, Random Forest, and SVM classification models to predict the Winner of the Red vs Blue Fighter in the UFC using various fight statistics. In the UFC, the Red Corner Fighter is typically the higher ranking fighter in the UFC roster. They typically have more fights in the UFC and have a higher rank due to their success in the UFC. The Blue Corner Fighter is typically the challenger and often has fewer fights than the Red Corner Fighter and are considered the underdog. Match making is decided on by UFC executives and other athletic commissions. Often times fights are also created based on popularity of the fighter. In this article, UFC president Dana White comments on the matchmaking process: (https://mmajunkie.usatoday.com/2021/01/ufc-news-dana-white-discusses-matchmaking-strategy). For these reasons, the creation of fights in the UFC can be biased. But using data can help us discern what skills makes a fighter likely to win and which fighters will win based on their abilities and demographics alone. At the end of this analysis we conclude that it is possible to build an effective predictive model based on the results
This dataset (data.csv) comes from the Kaggle user RAJEEV WARRIER. Link here: https://www.kaggle.com/datasets/rajeevw/ufcdata
This is a data set of every UFC fight in the history of the organisation until March 2021. Every row contains information about both fighters, fight details and the winner. The data was scraped from ufc stats website.
Each row is a compilation of both fighter stats. Fighters are represented by ‘red’ and ‘blue’ (for red and blue corner). So for instance, red fighter has the complied average stats of all their fights except the current one. The stats include damage done by the red fighter on the opponent and the damage done by the opponent on the fighter (represented by ‘opp’ in the columns) in all the fights this particular red fighter has had, except this one as it has not occurred yet (in the data). Same information exists for blue fighter. The target variable is ‘Winner’ which is the only column that tells you what happened in that fight.
Example, we can look at UFC fighter Max Holloway’s fight records for an example. Max Holloway is known for his Boxing in the UFC. If we look at his Average Significant Strikes Landed (based on his corner color) we see that his compiled stats show an increase throughout most of his career for Average Significant Strikes Landed. If we look at Max’s fight against Brian Ortega, we see that Max had a avg_SIG_STR_landed of 134.28. During that fight, Max landed a high 290 Significant Strikes against Ortega. Those 290 are not found in the dataset, but were compiled to Max’s average, which shows why Max’s average went up to 212.14 leading into his fight against Dustin Porier.
ufc_max_holloway <- ufc %>% filter_all(any_vars(grepl("Max Holloway", .)))
max_holloway <- ufc_max_holloway[, c('R_fighter', 'B_fighter', 'R_avg_SIG_STR_landed', 'B_avg_SIG_STR_landed', 'Winner')]
max_holloway
## R_fighter B_fighter R_avg_SIG_STR_landed
## 1 Max Holloway Calvin Kattar 125.19643
## 2 Alexander Volkanovski Max Holloway 119.66406
## 3 Max Holloway Alexander Volkanovski 162.78574
## 4 Max Holloway Frankie Edgar 196.57148
## 5 Max Holloway Dustin Poirier 212.14296
## 6 Max Holloway Brian Ortega 134.28591
## 7 Max Holloway Jose Aldo 94.57182
## 8 Jose Aldo Max Holloway 59.71484
## 9 Max Holloway Anthony Pettis 76.28729
## 10 Max Holloway Ricardo Lamas 49.57458
## 11 Max Holloway Jeremy Stephens 42.14917
## 12 Max Holloway Charles Oliveira 70.29834
## 13 Cub Swanson Max Holloway 47.70312
## 14 Max Holloway Cole Miller 60.19336
## 15 Akira Corassani Max Holloway 23.87500
## 16 Max Holloway Clay Collard 54.77344
## 17 Max Holloway Andre Fili 62.54688
## 18 Max Holloway Will Chope 51.09375
## 19 Conor McGregor Max Holloway 21.00000
## 20 Dennis Bermudez Max Holloway 71.75000
## 21 Leonard Garcia Max Holloway 32.89062
## 22 Justin Lawrence Max Holloway 54.00000
## 23 Max Holloway Pat Schilling 11.00000
## 24 Dustin Poirier Max Holloway 47.00000
## B_avg_SIG_STR_landed Winner
## 1 79.51562 Red
## 2 148.39287 Red
## 3 82.32812 Blue
## 4 51.25104 Red
## 5 85.47866 Blue
## 6 32.09375 Red
## 7 57.35742 Red
## 8 85.14365 Blue
## 9 50.19727 Red
## 10 36.47266 Red
## 11 42.82710 Red
## 12 48.88184 Red
## 13 58.59668 Blue
## 14 34.56369 Red
## 15 89.38672 Blue
## 16 NA Red
## 17 55.00000 Red
## 18 NA Red
## 19 79.18750 Red
## 20 83.37500 Red
## 21 46.75000 Blue
## 22 64.50000 Blue
## 23 4.00000 Red
## 24 NA Red
Data Dictionary
This dataset (data.csv) is the partially processed file. All feature engineering has been included and every row is a compilation of info about each fighter up until that fight. The data has not been one hot encoded or processed for missing data. Further processing can be done.
*Note: Columns have already had percentages converted to decimals.
R_ and B_ prefix - signifies red and blue corner fighter stats respectively
opp prefix containing columns is the average damage done by the opponent on the fighter
KD - is number of knockdowns
SIG_STR is no. of significant strikes ‘landed of attempted’
SIG_STR_pct is significant strikes percentage
TOTAL_STR is total strikes ‘landed of attempted’
TD is no. of take downs
TD_pct is take down percentages
SUB_ATT is no. of submission attempts
PASS is no. times the guard was passed?
REV is the no. of Reversals landed
HEAD is no. of significant strikes to the head ‘landed of attempted’
BODY is no. of significant strikes to the body ‘landed of attempted’
CLINCH is no. of significant strikes in the clinch ‘landed of attempted’
GROUND is no. of significant strikes on the ground ‘landed of attempted’
win_by is method of win
last_round is last round of the fight (ex. if it was a KO in 1st, then this will be 1)
last_round_time is when the fight ended in the last round
Format is the format of the fight (3 rounds, 5 rounds etc.)
Referee is the name of the Ref
date is the date of the fight
location is the location in which the event took place
Fight_type is which weight class and whether it’s a title bout or not
Winner is the winner of the fight
Stance is the stance of the fighter (orthodox, southpaw, etc.)
Height_cms is the height in centimeter
Reach_cms is the reach of the fighter (arm span) in centimeter
Weight_lbs is the weight of the fighter in pounds (lbs)
age is the age of the fighter
title_bout Boolean value of whether it is title fight or not
weight_class is which weight class the fight is in (Bantamweight, heavyweight, Women’s flyweight, etc.)
no_of_rounds is the number of rounds the fight was scheduled for
current_lose_streak is the count of current concurrent losses of the fighter
current_win_streak is the count of current concurrent wins of the fighter
draw is the number of draws in the fighter’s ufc career
wins is the number of wins in the fighter’s ufc career
losses is the number of losses in the fighter’s ufc career
total_rounds_fought is the average of total rounds fought by the fighter
total_time_fought(seconds) is the count of total time spent fighting in seconds
total_title_bouts is the total number of title bouts taken part in by the fighter
win_by_Decision_Majority is the number of wins by majority judges decision in the fighter’s ufc career
win_by_Decision_Split is the number of wins by split judges decision in the fighter’s ufc career
win_by_Decision_Unanimous is the number of wins by unanimous judges decision in the fighter’s ufc career
win_by_KO/TKO is the number of wins by knockout in the fighter’s ufc career
win_by_Submission is the number of wins by submission in the fighter’s ufc career
win_by_TKO_Doctor_Stoppage is the number of wins by doctor stoppage in the fighter’s ufc career
Heavyweight: 265 lb (120.2 kg) Light Heavyweight: 205 lb (102.1 kg) Middleweight: 185 lb (83.9 kg) Welterweight: 170 lb (77.1 kg) Lightweight: 155 lb (70.3 kg) Featherweight: 145 lb (65.8 kg) Bantamweight: 135 lb (61.2 kg) Flyweight: 125 lb (56.7 kg) Strawweight: 115 lb (52.5 kg)
It is mandatory that neither fighter weighs more than the upper limit of their respective division at the weigh-ins. Note: fighters cut weight to make the division they wish to fight in. Done the day before the fight. Fighters weigh the same as their opponent always.
Here we observe the names and data types of our columns.
str(ufc)
## 'data.frame': 6012 obs. of 144 variables:
## $ R_fighter : chr "Adrian Yanez" "Trevin Giles" "Tai Tuivasa" "Cheyanne Buys" ...
## $ B_fighter : chr "Gustavo Lopez" "Roman Dolidze" "Harry Hunsucker" "Montserrat Conejo" ...
## $ Referee : chr "Chris Tognoni" "Herb Dean" "Herb Dean" "Mark Smith" ...
## $ date : chr "2021-03-20" "2021-03-20" "2021-03-20" "2021-03-20" ...
## $ location : chr "Las Vegas, Nevada, USA" "Las Vegas, Nevada, USA" "Las Vegas, Nevada, USA" "Las Vegas, Nevada, USA" ...
## $ Winner : chr "Red" "Red" "Red" "Blue" ...
## $ title_bout : chr "False" "False" "False" "False" ...
## $ weight_class : chr "Bantamweight" "Middleweight" "Heavyweight" "WomenStrawweight" ...
## $ B_avg_KD : num 0 0.5 NA NA 0.125 ...
## $ B_avg_opp_KD : num 0 0 NA NA 0 0 0.125 0 NA NA ...
## $ B_avg_SIG_STR_pct : num 0.42 0.66 NA NA 0.536 ...
## $ B_avg_opp_SIG_STR_pct : num 0.495 0.305 NA NA 0.579 ...
## $ B_avg_TD_pct : num 0.33 0.3 NA NA 0.185 ...
## $ B_avg_opp_TD_pct : num 0.36 0.5 NA NA 0.166 ...
## $ B_avg_SUB_ATT : num 0.5 1.5 NA NA 0.125 ...
## $ B_avg_opp_SUB_ATT : num 1 0 NA NA 0.188 ...
## $ B_avg_REV : num 0 0 NA NA 0.25 ...
## $ B_avg_opp_REV : num 0 0 NA NA 0 ...
## $ B_avg_SIG_STR_att : num 50 65.5 NA NA 109.2 ...
## $ B_avg_SIG_STR_landed : num 20 35 NA NA 57.9 ...
## $ B_avg_opp_SIG_STR_att : num 84 50 NA NA 50.6 ...
## $ B_avg_opp_SIG_STR_landed : num 45 16.5 NA NA 28.4 ...
## $ B_avg_TOTAL_STR_att : num 76.5 113.5 NA NA 170.4 ...
## $ B_avg_TOTAL_STR_landed : num 41 68.5 NA NA 105.6 ...
## $ B_avg_opp_TOTAL_STR_att : num 114 68.5 NA NA 74.4 ...
## $ B_avg_opp_TOTAL_STR_landed : num 64 29 NA NA 44.2 ...
## $ B_avg_TD_att : num 1.5 2.5 NA NA 5.38 ...
## $ B_avg_TD_landed : num 1 1.5 NA NA 1.5 ...
## $ B_avg_opp_TD_att : num 9 0.5 NA NA 2 ...
## $ B_avg_opp_TD_landed : num 6.5 0.5 NA NA 0.625 ...
## $ B_avg_HEAD_att : num 39.5 46 NA NA 77.4 ...
## $ B_avg_HEAD_landed : num 11 20 NA NA 31.4 ...
## $ B_avg_opp_HEAD_att : num 63 36 NA NA 41.6 ...
## $ B_avg_opp_HEAD_landed : num 27.5 7.5 NA NA 22.6 ...
## $ B_avg_BODY_att : num 7.5 12 NA NA 31.2 ...
## $ B_avg_BODY_landed : num 7 8 NA NA 26.2 ...
## $ B_avg_opp_BODY_att : num 12 8 NA NA 7.69 ...
## $ B_avg_opp_BODY_landed : num 9 3 NA NA 4.94 ...
## $ B_avg_LEG_att : num 3 7.5 NA NA 0.625 ...
## $ B_avg_LEG_landed : num 2 7 NA NA 0.375 ...
## $ B_avg_opp_LEG_att : num 9 6 NA NA 1.38 ...
## $ B_avg_opp_LEG_landed : num 8.5 6 NA NA 0.875 ...
## $ B_avg_DISTANCE_att : num 35 58 NA NA 33.6 ...
## $ B_avg_DISTANCE_landed : num 12.5 30 NA NA 11 ...
## $ B_avg_opp_DISTANCE_att : num 43.5 48 NA NA 32.1 ...
## $ B_avg_opp_DISTANCE_landed : num 17.5 15.5 NA NA 13.9 ...
## $ B_avg_CLINCH_att : num 10.5 0.5 NA NA 39.1 ...
## $ B_avg_CLINCH_landed : num 4.5 0.5 NA NA 28.8 ...
## $ B_avg_opp_CLINCH_att : num 4 0.5 NA NA 13.3 ...
## $ B_avg_opp_CLINCH_landed : num 3 0.5 NA NA 10.8 ...
## $ B_avg_GROUND_att : num 4.5 7 NA NA 36.6 ...
## $ B_avg_GROUND_landed : num 3 4.5 NA NA 18.1 ...
## $ B_avg_opp_GROUND_att : num 36.5 1.5 NA NA 5.25 ...
## $ B_avg_opp_GROUND_landed : num 24.5 0.5 NA NA 3.75 ...
## $ B_avg_CTRL_time.seconds. : num 34 220 NA NA 390 ...
## $ B_avg_opp_CTRL_time.seconds.: num 277.5 24.5 NA NA 156.3 ...
## $ B_total_time_fought.seconds.: num 532 578 NA NA 764 ...
## $ B_total_rounds_fought : int 4 4 0 0 11 10 28 23 0 0 ...
## $ B_total_title_bouts : int 0 0 0 0 1 0 0 0 0 0 ...
## $ B_current_win_streak : int 0 2 0 0 3 4 0 0 0 0 ...
## $ B_current_lose_streak : int 1 0 0 0 0 0 1 1 0 0 ...
## $ B_longest_win_streak : int 1 2 0 0 3 4 1 5 0 0 ...
## $ B_wins : int 1 2 0 0 4 4 4 8 0 0 ...
## $ B_losses : int 1 0 0 0 1 0 6 2 0 0 ...
## $ B_draw : int 0 0 0 0 0 0 0 0 0 0 ...
## $ B_win_by_Decision_Majority : int 0 0 0 0 0 0 1 0 0 0 ...
## $ B_win_by_Decision_Split : int 0 1 0 0 0 0 0 2 0 0 ...
## $ B_win_by_Decision_Unanimous : int 0 0 0 0 1 2 1 1 0 0 ...
## $ B_win_by_KO.TKO : int 0 1 0 0 2 0 2 3 0 0 ...
## $ B_win_by_Submission : int 1 0 0 0 1 2 0 2 0 0 ...
## $ B_win_by_TKO_Doctor_Stoppage: int 0 0 0 0 0 0 0 0 0 0 ...
## $ B_Stance : chr "Orthodox" "Orthodox" "Orthodox" "Southpaw" ...
## $ B_Height_cms : num 165 188 188 152 180 ...
## $ B_Reach_cms : num 170 193 190 155 183 ...
## $ B_Weight_lbs : num 135 205 241 115 135 145 170 185 135 125 ...
## $ R_avg_KD : num 1 1.031 0.547 NA 0 ...
## $ R_avg_opp_KD : num 0 0.0625 0.1875 NA 0.000977 ...
## $ R_avg_SIG_STR_pct : num 0.5 0.577 0.539 NA 0.403 ...
## $ R_avg_opp_SIG_STR_pct : num 0.46 0.381 0.599 NA 0.555 ...
## $ R_avg_TD_pct : num 0 0.406 0 NA 0.512 ...
## $ R_avg_opp_TD_pct : num 0 0.116 0.312 NA 0.629 ...
## $ R_avg_SUB_ATT : num 0 0.25 0 NA 0.231 ...
## $ R_avg_opp_SUB_ATT : num 0 1.1875 0.25 NA 0.0312 ...
## $ R_avg_REV : num 0 0.375 0 NA 0.0312 ...
## $ R_avg_opp_REV : num 0 0.25 0 NA 0.5 0.5 0.25 0 0 0.5 ...
## $ R_avg_SIG_STR_att : num 34 77.6 59.2 NA 109.3 ...
## $ R_avg_SIG_STR_landed : num 17 43.2 30.4 NA 44.4 ...
## $ R_avg_opp_SIG_STR_att : num 13 69.2 43.8 NA 148.8 ...
## $ R_avg_opp_SIG_STR_landed : num 6 27.6 24.8 NA 84.6 ...
## $ R_avg_TOTAL_STR_att : num 35 93.1 70.5 NA 137.2 ...
## $ R_avg_TOTAL_STR_landed : num 18 57.2 41.4 NA 70.2 ...
## $ R_avg_opp_TOTAL_STR_att : num 16 98.3 50.2 NA 172.5 ...
## $ R_avg_opp_TOTAL_STR_landed : num 9 52.5 30.9 NA 106.7 ...
## $ R_avg_TD_att : num 0 1.2812 0.0312 NA 2.2617 ...
## $ R_avg_TD_landed : num 0 0.781 0 NA 1.262 ...
## $ R_avg_opp_TD_att : num 3 4.69 2.84 NA 3.14 ...
## $ R_avg_opp_TD_landed : num 0 0.438 1.75 NA 1.771 ...
## $ R_avg_HEAD_att : num 32 71.1 42.5 NA 86.4 ...
## $ R_avg_HEAD_landed : num 15 38.1 16.8 NA 26 ...
## [list output truncated]
After observing the data types we see that the majority the features are numerical/integers, with a few character variables. Because we are using classification models we will transform the ‘Win’ column to numeric data, later in this report.
First we get rid of all rows that contain null values. First time fighters all have null values, as they have not yet fought yet in the UFC and do not have complied fight stats. Therefore, our models will not predict fight winners if a fighter has never fought in the UFC. First time fighters will be removed when excluding all nulls.
ufc = na.omit(ufc)
Use substring to extract the year from date and store the values as ‘year’.
ufc$year <- substr(ufc$date, 1, 4)
Eliminate all fights that end in a ‘Draw’.
ufc <- filter(ufc, Winner != 'Draw')
Here is where we create our dependent variable ‘win_dummy’. We will set Blue = 1 and Red = 0. This will be used in our classification models. This will also show us the proportion of Red vs Blue wins throughout the dataset. Below are a few insights into the distribution of win_dummy.
ufc$win_dummy = ifelse(ufc$Winner == 'Blue', 1,0)
The line graph below shows that Blue Wins were not recorded until 2010. Upon further research, before 2010 there were no red or blue corners. On the ufc stats website (http://ufcstats.com/statistics/events/completed), Red is the default value for whoever won their fight during this time period. When predicting Blue or Red being the winner, we must filter out all fights before 2010 so that we create a more balanced and accurate data set that reflects the UFC today. We also see Red wins more often than Blue since 2010.
# Multiple line plot
ggplot(ufc, aes(x = year, group=Winner)) +
geom_line(stat = "count", aes(color = Winner), size=1) +
scale_color_manual(values=c('Blue','Red')) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
Convert the data type of ‘date’ from Character to Date.
class(ufc$date)
## [1] "character"
ufc <- ufc %>% mutate(date = as.Date(date, format = "%Y-%m-%d"))
class(ufc$date)
## [1] "Date"
Below we observe the gap between Red vs Blue wins since 2010, expressed as percentage of wins for Red and Blue Corners, by year.
ufc_2010 <- ufc %>% filter(date >= "2010-01-01")
g <- ufc_2010 %>%
group_by(win_dummy, year) %>%
summarise(cnt = n()) %>%
group_by(year) %>%
mutate(freq = round(cnt / sum(cnt), 3))
## `summarise()` regrouping output by 'win_dummy' (override with `.groups` argument)
g <- as.data.frame(g)
g$win_dummy <- as.character(g$win_dummy)
g$freq <- as.numeric(g$freq)
g$year <- as.numeric(as.vector(g$year))
gg1 <- ggplot(data=g, aes(x=year, y=freq, group=win_dummy, fill=win_dummy)) +
geom_bar(stat="identity",
width = 0.5,
position=position_dodge())+
scale_fill_manual(values=c(
"red",
"darkblue")) +
theme(axis.line = element_line(linetype = "solid"),
panel.grid.minor = element_line(linetype = "blank"),
axis.text = element_text(colour = "gray18"),
panel.background = element_rect(fill = "gray64"),
plot.background = element_rect(fill = "aliceblue"))+labs(title = "Red vs Blue Win Percentage (Grouped by Year)",
x = "Year", y = "Percentage", fill = "Red/Blue")
gg1 + scale_x_continuous(breaks = c(2010, 2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021))
Does not allow their opponent to stand up.
From our graph below, we see that Red Fighter Average Control Time on the Ground (in seconds) has decreased steadily, even after 2010. We also see that Control Time was not recorded until 2000, but the original dataset includes fights back to 1997 and UFC fights span back to 1993. This suggest that all fighters ground defense and wrestling have steadily improved over the years. We will use 2016-2021 data from here on.
R_avg_CTRL_time.seconds_history <- ufc %>%
group_by(year) %>%
summarise_at(vars(R_avg_CTRL_time.seconds.), list(name = mean))
R_avg_CTRL_time.seconds_history <- as.data.frame(R_avg_CTRL_time.seconds_history)
R_avg_CTRL_time.seconds_history$year <- as.numeric(R_avg_CTRL_time.seconds_history$year)
r_ctrl <- ggplot(R_avg_CTRL_time.seconds_history, aes(year,name, group=year)) +
geom_bar(stat='identity', fill='red')+labs(y = "Red Fighter Average Control Time on Ground (seconds)")
r_ctrl + scale_x_continuous(breaks = c(1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021)) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
Now we will filter out our dataset
ufc_2 <- ufc %>% filter(date >= "2016-01-01")
Between 2016-2021, 0=Red, 1=Blue.
table(ufc_2$win_dummy)
##
## 0 1
## 1068 820
Graph showing the number of fights per weight class
unique(ufc_2$weight_class)
## [1] "Bantamweight" "Middleweight" "WomenBantamweight"
## [4] "Lightweight" "Welterweight" "Flyweight"
## [7] "LightHeavyweight" "WomenStrawweight" "Featherweight"
## [10] "WomenFlyweight" "WomenFeatherweight" "Heavyweight"
## [13] "CatchWeight"
gg2 <- ggplot(data=ufc_2, aes(weight_class)) + geom_bar(fill = "purple")+labs(x = "Weight Class", y = "Number of Fights (2016-2021)")
gg2 + coord_flip()
R_avg_TD_att_wc <- ufc_2 %>%
group_by(weight_class) %>%
summarise_at(vars(R_avg_TD_att), list(name = mean))
R_avg_TD_att_wc <- as.data.frame(R_avg_TD_att_wc)
ggplot(data=R_avg_TD_att_wc, aes(x=weight_class,y=name)) + geom_bar(stat='identity', fill = "purple") + coord_flip()+labs(y = "Red Average Take Downs Attempted (per fight)")+labs(title = '2016 - 2021', x = "Weight Class",
y = "Red Fighter Average Take Downs Attempted (per fight)")
R_avg_SIG_STR_att_wc <- ufc_2 %>%
group_by(weight_class) %>%
summarise_at(vars(R_avg_SIG_STR_att), list(name = mean))
R_avg_SIG_STR_att_wc <- as.data.frame(R_avg_SIG_STR_att_wc)
ggplot(data=R_avg_SIG_STR_att_wc, aes(x=weight_class,y=name)) + geom_bar(stat='identity', fill = "purple") + coord_flip()+labs(y = "Red Average Significant Strikes Attempted (per fight)")+labs(title = "2016-2021")+labs(x = "Weight Class")
R_avg_SUB_ATT_wc <- ufc_2 %>%
group_by(weight_class) %>%
summarise_at(vars(R_avg_SUB_ATT), list(name = mean))
R_avg_SUB_ATT_wc <- as.data.frame(R_avg_SUB_ATT_wc)
ggplot(data=R_avg_SUB_ATT_wc, aes(x=weight_class,y=name)) + geom_bar(stat='identity', fill = "purple") + coord_flip()+labs(y = "Red Average Submissions Attempted (per fight)")+labs(title = "2016-2021", x = "Weight Class")
R_avg_win_by_KO_TKO <- ufc_2 %>%
group_by(weight_class) %>%
summarise_at(vars(R_win_by_KO.TKO), list(name = mean))
R_avg_win_by_KO_TKO <- as.data.frame(R_avg_win_by_KO_TKO)
ggplot(data=R_avg_win_by_KO_TKO, aes(x=weight_class,y=name)) + geom_bar(stat='identity', fill = "purple") + coord_flip()+labs(y = "Red Fighter Average Wins by KO/TKO")+labs(title = "2016-2021", x = "Weight Class")
We can conclude that the lower in weight class (meaning the less the fighter weighs in pounds), the more activity they display in their fights. The graphs show that Heavyweight fighters not only have the least amount of fights since 2016, but also average less attempts of strikes, take downs, and submissions as the smaller UFC fighters. They do however average more knockouts than anyother weight class. For this reason, it would be likely that some independent variables like ‘Blue Average Submissions Attempted’ or ‘Red Average Knock Downs’ would have more importance for other classes. To better predict fight outcomes, it’s likely models should be built around specific weight classes to find the best results.
But for our initial model builds, we include all weight classes and use all of the same independent(predictive) variables. At the end of this report specific weight classes are tested.
We will leave the ‘win_dummy’ variable set as Blue = 1 and Red = 0 with Blue being the positive outcome. We know that Red typically wins more often as shown by the results above. We want to create a model that essentially predicts upsets of Blue winning.
We also observe that the percentage of wins between Red vs Blue have gotten closer over time, meaning that Blue fighters now win more often than they used to. This could support the idea that the level of skill and competition among UFC fighters has increased over the years. Therefore we will use data from 2016-2021 as it is more representative of the level of competition that the UFC sees today. Also, Women did not compete in the UFC until 2013
We will now gather inference on which variables are significant influences on the outcome of Red or Blue Corner winning. Our dependent ‘y’ variable will be ‘win_dummy’ (Blue win = 1, Red win = 0). Our independent ‘x’ variables will be chosen based on significance and theoretical need.
ufc_cor <- select_if(ufc_2, is.numeric)
Below we see the correlation of the fight variables to the Winner variable(win_dummy). The closer to ‘-1 or 1’ the variable is the better. A negative value does not indicate a bad correlation, only the size of the number matters.Example:(-.75 > .65). Age seems to have the highest correlation at first look.
cor(ufc_cor[135], ufc_cor[])
## Warning in cor(ufc_cor[135], ufc_cor[]): the standard deviation is zero
## B_avg_KD B_avg_opp_KD B_avg_SIG_STR_pct B_avg_opp_SIG_STR_pct
## win_dummy 0.00306321 -0.03899145 0.04204823 -0.03005528
## B_avg_TD_pct B_avg_opp_TD_pct B_avg_SUB_ATT B_avg_opp_SUB_ATT
## win_dummy 0.03714695 -0.05547428 0.03070611 -0.01733953
## B_avg_REV B_avg_opp_REV B_avg_SIG_STR_att B_avg_SIG_STR_landed
## win_dummy 0.01894169 0.03367675 0.01532451 0.02324385
## B_avg_opp_SIG_STR_att B_avg_opp_SIG_STR_landed B_avg_TOTAL_STR_att
## win_dummy -0.02031625 -0.03309539 0.02740114
## B_avg_TOTAL_STR_landed B_avg_opp_TOTAL_STR_att
## win_dummy 0.04181535 -0.01948659
## B_avg_opp_TOTAL_STR_landed B_avg_TD_att B_avg_TD_landed
## win_dummy -0.02774442 0.06196834 0.07321986
## B_avg_opp_TD_att B_avg_opp_TD_landed B_avg_HEAD_att B_avg_HEAD_landed
## win_dummy 0.04465765 0.0309859 0.01384461 0.02408938
## B_avg_opp_HEAD_att B_avg_opp_HEAD_landed B_avg_BODY_att
## win_dummy -0.02071813 -0.03864946 0.01479837
## B_avg_BODY_landed B_avg_opp_BODY_att B_avg_opp_BODY_landed
## win_dummy 0.01697846 -0.02043532 -0.02779698
## B_avg_LEG_att B_avg_LEG_landed B_avg_opp_LEG_att B_avg_opp_LEG_landed
## win_dummy 0.01013503 0.007056198 -0.0008558403 0.004390985
## B_avg_DISTANCE_att B_avg_DISTANCE_landed B_avg_opp_DISTANCE_att
## win_dummy 0.005365871 0.006427694 -0.006663574
## B_avg_opp_DISTANCE_landed B_avg_CLINCH_att B_avg_CLINCH_landed
## win_dummy -0.01041182 0.001296825 0.007211577
## B_avg_opp_CLINCH_att B_avg_opp_CLINCH_landed B_avg_GROUND_att
## win_dummy -0.04026424 -0.03782442 0.06171763
## B_avg_GROUND_landed B_avg_opp_GROUND_att B_avg_opp_GROUND_landed
## win_dummy 0.0587167 -0.05242162 -0.05475416
## B_avg_CTRL_time.seconds. B_avg_opp_CTRL_time.seconds.
## win_dummy 0.05019004 0.00205904
## B_total_time_fought.seconds. B_total_rounds_fought
## win_dummy 0.03791624 -0.03425858
## B_total_title_bouts B_current_win_streak B_current_lose_streak
## win_dummy -0.06098007 -0.007216505 -0.02305906
## B_longest_win_streak B_wins B_losses B_draw
## win_dummy -0.006702128 -0.01874144 -0.04857157 NA
## B_win_by_Decision_Majority B_win_by_Decision_Split
## win_dummy -0.01340058 -0.05034336
## B_win_by_Decision_Unanimous B_win_by_KO.TKO B_win_by_Submission
## win_dummy 0.007953964 -0.03091963 0.009940745
## B_win_by_TKO_Doctor_Stoppage B_Height_cms B_Reach_cms B_Weight_lbs
## win_dummy -0.01618828 0.02728662 0.03034889 0.02776254
## R_avg_KD R_avg_opp_KD R_avg_SIG_STR_pct R_avg_opp_SIG_STR_pct
## win_dummy -0.04096858 0.05978554 -0.04379517 0.09915966
## R_avg_TD_pct R_avg_opp_TD_pct R_avg_SUB_ATT R_avg_opp_SUB_ATT
## win_dummy -0.04921816 0.03230588 -0.05286117 -0.01384936
## R_avg_REV R_avg_opp_REV R_avg_SIG_STR_att R_avg_SIG_STR_landed
## win_dummy 0.01965227 -0.01394109 -0.03095346 -0.04196227
## R_avg_opp_SIG_STR_att R_avg_opp_SIG_STR_landed R_avg_TOTAL_STR_att
## win_dummy 0.0174484 0.04870163 -0.04976161
## R_avg_TOTAL_STR_landed R_avg_opp_TOTAL_STR_att
## win_dummy -0.06546941 0.01820164
## R_avg_opp_TOTAL_STR_landed R_avg_TD_att R_avg_TD_landed
## win_dummy 0.04136129 -0.07558121 -0.07763501
## R_avg_opp_TD_att R_avg_opp_TD_landed R_avg_HEAD_att R_avg_HEAD_landed
## win_dummy -0.006153334 0.02582334 -0.03283888 -0.04850527
## R_avg_opp_HEAD_att R_avg_opp_HEAD_landed R_avg_BODY_att
## win_dummy 0.01539803 0.04801341 0.001764856
## R_avg_BODY_landed R_avg_opp_BODY_att R_avg_opp_BODY_landed
## win_dummy -0.003982463 0.03025601 0.04656319
## R_avg_LEG_att R_avg_LEG_landed R_avg_opp_LEG_att R_avg_opp_LEG_landed
## win_dummy -0.02992376 -0.02717299 0.0004393277 0.01116835
## R_avg_DISTANCE_att R_avg_DISTANCE_landed R_avg_opp_DISTANCE_att
## win_dummy -0.01752002 -0.0245582 0.007482186
## R_avg_opp_DISTANCE_landed R_avg_CLINCH_att R_avg_CLINCH_landed
## win_dummy 0.03178595 0.008217354 0.01308849
## R_avg_opp_CLINCH_att R_avg_opp_CLINCH_landed R_avg_GROUND_att
## win_dummy 0.03657432 0.04085239 -0.08076469
## R_avg_GROUND_landed R_avg_opp_GROUND_att R_avg_opp_GROUND_landed
## win_dummy -0.07660279 0.03227853 0.03998505
## R_avg_CTRL_time.seconds. R_avg_opp_CTRL_time.seconds.
## win_dummy -0.08062083 0.05382894
## R_total_time_fought.seconds. R_total_rounds_fought
## win_dummy -0.02239872 0.0624352
## R_total_title_bouts R_current_win_streak R_current_lose_streak
## win_dummy -0.002108865 -0.0432342 0.04194568
## R_longest_win_streak R_wins R_losses R_draw
## win_dummy -0.02615285 0.03952394 0.1093061 NA
## R_win_by_Decision_Majority R_win_by_Decision_Split
## win_dummy 0.04471593 0.08173435
## R_win_by_Decision_Unanimous R_win_by_KO.TKO R_win_by_Submission
## win_dummy -0.003083416 0.04700467 -0.003545831
## R_win_by_TKO_Doctor_Stoppage R_Height_cms R_Reach_cms R_Weight_lbs
## win_dummy 0.04195498 -0.007027602 -0.02956736 0.01928033
## B_age R_age win_dummy
## win_dummy -0.1335294 0.141457 1
cr <- cor(ufc_cor[135], ufc_cor[c(1:20)])
We can also use a color scale to see values closer to one. The darker the color the better, regardless of the color itself.
ggcorrplot(cr)
We can begin to make some inferences. We see that none of the variables are significantly strong in correlation, however some are stronger than others. We see that any variable associated with ‘KD’ has a very low correlation. We can run this test per weight class as well to see if some variables have higher correlation rather than seeing the correlation when all weight classes data is combined.
Now we build our initial model. Logistic Regression and find Multi-collinearity among variables.
Below we split the data into train and test. We make sure do gather random sample during the formation of the training data, that way the dataset is not read into our variable in the order of the dataset. If we do not do this, then %70 of the fights in the training data will be the first %70 of fights since 2016.
set.seed(123)
sample <- sample.int(n = nrow(ufc_2), size = round(.7*nrow(ufc_2)), replace = F)
ufc_train <- ufc_2[sample, ]
ufc_test <- ufc_2[-sample, ]
set.seed(123)
m1 = glm(win_dummy ~ B_avg_TD_pct + B_avg_opp_TD_pct + B_avg_REV + B_avg_opp_REV + B_avg_TD_landed + B_avg_opp_TD_landed + B_avg_SIG_STR_landed + B_avg_opp_SIG_STR_landed + B_avg_SIG_STR_pct + B_avg_opp_SIG_STR_pct + B_avg_TOTAL_STR_landed + B_avg_opp_TOTAL_STR_landed + B_avg_CLINCH_landed + B_avg_opp_CLINCH_landed + B_avg_KD + B_avg_opp_KD + B_avg_SUB_ATT + B_avg_opp_SUB_ATT + B_avg_GROUND_landed + B_avg_opp_GROUND_landed + B_win_by_KO.TKO + B_win_by_Submission + B_win_by_Decision_Unanimous + R_avg_TD_pct + R_avg_opp_TD_pct + R_avg_REV + R_avg_opp_REV + R_avg_TD_landed + R_avg_opp_TD_landed + R_avg_SIG_STR_landed + R_avg_opp_SIG_STR_landed + R_avg_SIG_STR_pct + R_avg_opp_SIG_STR_pct + R_avg_TOTAL_STR_landed + R_avg_opp_TOTAL_STR_landed + R_avg_CLINCH_landed + R_avg_opp_CLINCH_landed + R_avg_KD + R_avg_opp_KD + R_avg_opp_SUB_ATT + R_avg_GROUND_landed + R_avg_opp_GROUND_landed + R_win_by_KO.TKO + R_win_by_Submission + R_win_by_Decision_Unanimous +
B_total_rounds_fought + R_total_rounds_fought + B_avg_CTRL_time.seconds. + B_avg_opp_CTRL_time.seconds. + B_current_win_streak + B_losses + R_age + B_age + B_Reach_cms + R_Reach_cms + B_Height_cms + R_Height_cms + weight_class,
data = ufc_train, family = binomial)
summary(m1)
##
## Call:
## glm(formula = win_dummy ~ B_avg_TD_pct + B_avg_opp_TD_pct + B_avg_REV +
## B_avg_opp_REV + B_avg_TD_landed + B_avg_opp_TD_landed + B_avg_SIG_STR_landed +
## B_avg_opp_SIG_STR_landed + B_avg_SIG_STR_pct + B_avg_opp_SIG_STR_pct +
## B_avg_TOTAL_STR_landed + B_avg_opp_TOTAL_STR_landed + B_avg_CLINCH_landed +
## B_avg_opp_CLINCH_landed + B_avg_KD + B_avg_opp_KD + B_avg_SUB_ATT +
## B_avg_opp_SUB_ATT + B_avg_GROUND_landed + B_avg_opp_GROUND_landed +
## B_win_by_KO.TKO + B_win_by_Submission + B_win_by_Decision_Unanimous +
## R_avg_TD_pct + R_avg_opp_TD_pct + R_avg_REV + R_avg_opp_REV +
## R_avg_TD_landed + R_avg_opp_TD_landed + R_avg_SIG_STR_landed +
## R_avg_opp_SIG_STR_landed + R_avg_SIG_STR_pct + R_avg_opp_SIG_STR_pct +
## R_avg_TOTAL_STR_landed + R_avg_opp_TOTAL_STR_landed + R_avg_CLINCH_landed +
## R_avg_opp_CLINCH_landed + R_avg_KD + R_avg_opp_KD + R_avg_opp_SUB_ATT +
## R_avg_GROUND_landed + R_avg_opp_GROUND_landed + R_win_by_KO.TKO +
## R_win_by_Submission + R_win_by_Decision_Unanimous + B_total_rounds_fought +
## R_total_rounds_fought + B_avg_CTRL_time.seconds. + B_avg_opp_CTRL_time.seconds. +
## B_current_win_streak + B_losses + R_age + B_age + B_Reach_cms +
## R_Reach_cms + B_Height_cms + R_Height_cms + weight_class,
## family = binomial, data = ufc_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0882 -1.0031 -0.6387 1.1125 2.2068
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.969e+00 3.533e+00 0.557 0.577305
## B_avg_TD_pct -3.403e-01 3.063e-01 -1.111 0.266568
## B_avg_opp_TD_pct -1.256e+00 3.510e-01 -3.579 0.000344 ***
## B_avg_REV 7.016e-02 2.311e-01 0.304 0.761405
## B_avg_opp_REV 2.931e-01 2.545e-01 1.152 0.249348
## B_avg_TD_landed 1.552e-01 8.753e-02 1.773 0.076239 .
## B_avg_opp_TD_landed 2.576e-01 8.916e-02 2.890 0.003857 **
## B_avg_SIG_STR_landed 1.470e-03 7.852e-03 0.187 0.851530
## B_avg_opp_SIG_STR_landed 8.376e-03 8.709e-03 0.962 0.336170
## B_avg_SIG_STR_pct -7.295e-02 6.438e-01 -0.113 0.909789
## B_avg_opp_SIG_STR_pct 1.963e-01 6.714e-01 0.292 0.769983
## B_avg_TOTAL_STR_landed -1.718e-03 5.349e-03 -0.321 0.748108
## B_avg_opp_TOTAL_STR_landed -5.397e-03 6.489e-03 -0.832 0.405546
## B_avg_CLINCH_landed 2.324e-02 1.683e-02 1.381 0.167368
## B_avg_opp_CLINCH_landed -6.231e-02 2.016e-02 -3.090 0.001999 **
## B_avg_KD -1.244e-01 1.932e-01 -0.644 0.519668
## B_avg_opp_KD -3.776e-01 2.314e-01 -1.632 0.102723
## B_avg_SUB_ATT 1.424e-01 1.201e-01 1.186 0.235770
## B_avg_opp_SUB_ATT 5.151e-02 1.471e-01 0.350 0.726177
## B_avg_GROUND_landed 3.520e-03 1.422e-02 0.248 0.804455
## B_avg_opp_GROUND_landed -3.078e-02 1.424e-02 -2.162 0.030620 *
## B_win_by_KO.TKO 8.345e-02 5.783e-02 1.443 0.148999
## B_win_by_Submission 7.604e-02 6.094e-02 1.248 0.212128
## B_win_by_Decision_Unanimous 2.016e-01 8.549e-02 2.359 0.018346 *
## R_avg_TD_pct 2.632e-01 3.381e-01 0.779 0.436255
## R_avg_opp_TD_pct -2.265e-01 3.594e-01 -0.630 0.528477
## R_avg_REV 4.890e-01 2.534e-01 1.930 0.053616 .
## R_avg_opp_REV -1.855e-01 2.495e-01 -0.743 0.457242
## R_avg_TD_landed -8.705e-02 7.626e-02 -1.142 0.253655
## R_avg_opp_TD_landed 1.058e-01 8.239e-02 1.284 0.199077
## R_avg_SIG_STR_landed -5.838e-03 7.070e-03 -0.826 0.408924
## R_avg_opp_SIG_STR_landed 1.102e-02 7.623e-03 1.445 0.148435
## R_avg_SIG_STR_pct -7.144e-01 6.809e-01 -1.049 0.294076
## R_avg_opp_SIG_STR_pct 1.597e+00 6.956e-01 2.295 0.021711 *
## R_avg_TOTAL_STR_landed -3.023e-03 5.004e-03 -0.604 0.545787
## R_avg_opp_TOTAL_STR_landed -6.070e-03 5.521e-03 -1.100 0.271516
## R_avg_CLINCH_landed 1.454e-02 1.558e-02 0.933 0.350659
## R_avg_opp_CLINCH_landed 1.461e-03 1.766e-02 0.083 0.934079
## R_avg_KD -1.119e-01 1.926e-01 -0.581 0.561221
## R_avg_opp_KD 1.106e-01 2.130e-01 0.519 0.603540
## R_avg_opp_SUB_ATT 2.218e-02 1.419e-01 0.156 0.875805
## R_avg_GROUND_landed 7.288e-04 1.100e-02 0.066 0.947197
## R_avg_opp_GROUND_landed 5.975e-03 1.316e-02 0.454 0.649912
## R_win_by_KO.TKO 1.197e-02 5.046e-02 0.237 0.812527
## R_win_by_Submission -5.709e-02 5.001e-02 -1.141 0.253675
## R_win_by_Decision_Unanimous -5.963e-02 6.132e-02 -0.972 0.330829
## B_total_rounds_fought -2.908e-02 2.203e-02 -1.320 0.186856
## R_total_rounds_fought 5.772e-03 1.038e-02 0.556 0.578223
## B_avg_CTRL_time.seconds. 4.671e-05 1.121e-03 0.042 0.966769
## B_avg_opp_CTRL_time.seconds. 7.642e-04 1.122e-03 0.681 0.495907
## B_current_win_streak -3.363e-02 4.686e-02 -0.718 0.472960
## B_losses 4.200e-02 8.102e-02 0.518 0.604191
## R_age 7.784e-02 1.840e-02 4.230 2.34e-05 ***
## B_age -7.730e-02 1.867e-02 -4.141 3.46e-05 ***
## B_Reach_cms -6.582e-03 1.337e-02 -0.492 0.622538
## R_Reach_cms -2.880e-02 1.352e-02 -2.131 0.033101 *
## B_Height_cms 8.339e-03 1.743e-02 0.478 0.632328
## R_Height_cms 1.391e-02 1.760e-02 0.791 0.429201
## weight_classCatchWeight 5.484e-01 7.320e-01 0.749 0.453762
## weight_classFeatherweight -2.354e-01 2.811e-01 -0.837 0.402499
## weight_classFlyweight -2.680e-01 3.402e-01 -0.788 0.430745
## weight_classHeavyweight 2.469e-01 5.446e-01 0.453 0.650236
## weight_classLightHeavyweight 2.505e-01 4.779e-01 0.524 0.600194
## weight_classLightweight 2.646e-01 2.924e-01 0.905 0.365468
## weight_classMiddleweight -3.183e-02 4.086e-01 -0.078 0.937918
## weight_classWelterweight 5.807e-02 3.446e-01 0.169 0.866178
## weight_classWomenBantamweight 1.368e-01 3.988e-01 0.343 0.731579
## weight_classWomenFeatherweight -2.905e-01 1.025e+00 -0.283 0.776891
## weight_classWomenFlyweight -2.792e-01 3.948e-01 -0.707 0.479482
## weight_classWomenStrawweight -1.985e-01 3.925e-01 -0.506 0.612999
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1811.8 on 1321 degrees of freedom
## Residual deviance: 1641.0 on 1252 degrees of freedom
## AIC: 1781
##
## Number of Fisher Scoring iterations: 4
We get our variable coefficients and p-values for the model. We see there are some variables that show significance, mainly Age.
Now we fit the model onto the Test data and see how well it predicts. We’ll use the caret package to print out a confusion matrix and show the models performance.
set.seed(123)
ufc_test$PredProb = predict.glm(m1, ufc_test, type = 'response')
ufc_test$Prediction = ifelse(ufc_test$PredProb >= 0.5,1,0)
caret::confusionMatrix(as.factor(ufc_test$win_dummy), as.factor(ufc_test$Prediction), positive='1')
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 223 101
## 1 128 114
##
## Accuracy : 0.5954
## 95% CI : (0.5537, 0.6361)
## No Information Rate : 0.6201
## P-Value [Acc > NIR] : 0.89506
##
## Kappa : 0.1616
##
## Mcnemar's Test P-Value : 0.08577
##
## Sensitivity : 0.5302
## Specificity : 0.6353
## Pos Pred Value : 0.4711
## Neg Pred Value : 0.6883
## Prevalence : 0.3799
## Detection Rate : 0.2014
## Detection Prevalence : 0.4276
## Balanced Accuracy : 0.5828
##
## 'Positive' Class : 1
##
Our confusion matrix for our initial model shows 59% accuracy. Accuracy is our models ability to correctly predict the outcome of the dependent variable. But we have a higher ‘No Information Rate’(NIR) of 62%. The No Information Rate shows us the largest proportion of the observed data for this sample size(‘ufc_test’). In this Test data, the largest proportion of wins belongs to Red at 62%, meaning if we predicted Red as the winner in this test set for each fight(row), we would have a higher accuracy than the accuracy our predictive model (m1) has given us. Accuracy should be higher than NIR to be considered effective. We can also use ‘P-Value [Acc > NIR] : .89’ to see that the model is not significant in predicting the winner as well. A p-value that is lower is better, such as ‘0.05’ or for UFC predictions even ‘0.10’
We now check for multicollinearity among the independent variables. Multicollinearity is a statistical concept where several independent variables in a model are overly correlated. Two variables are considered to be perfectly collinear if their correlation coefficient is +/- 1.0. Multicollinearity among independent variables will result in less reliable statistical inferences.
Multicollinearity can be detected via various methods. In this report, we will focus on a common one – VIF (Variable Inflation Factors).
VIF determines the strength of the correlation between the independent variables. It is predicted by taking a variable and regressing it against every other variable. VIF score of an independent variable represents how well the variable is explained by other independent variables.
R^2 value is determined to find out how well an independent variable is described by the other independent variables. A high value of R^2 means that the variable is highly correlated with the other variables. This is captured by the VIF.
As a rule of thumb, a vif score over 5 is regarded as overly correlated. A score over 10 should be remedied and consider dropping the variable from the regression model or creating an index of all the closely related variables.
vif(m1)
## GVIF Df GVIF^(1/(2*Df))
## B_avg_TD_pct 1.758453 1 1.326067
## B_avg_opp_TD_pct 1.948953 1 1.396049
## B_avg_REV 1.409804 1 1.187352
## B_avg_opp_REV 1.350532 1 1.162124
## B_avg_TD_landed 3.386404 1 1.840218
## B_avg_opp_TD_landed 3.077177 1 1.754189
## B_avg_SIG_STR_landed 9.606699 1 3.099467
## B_avg_opp_SIG_STR_landed 10.814295 1 3.288509
## B_avg_SIG_STR_pct 1.518125 1 1.232122
## B_avg_opp_SIG_STR_pct 1.473767 1 1.213988
## B_avg_TOTAL_STR_landed 7.642021 1 2.764420
## B_avg_opp_TOTAL_STR_landed 9.538876 1 3.088507
## B_avg_CLINCH_landed 2.389040 1 1.545652
## B_avg_opp_CLINCH_landed 2.399104 1 1.548904
## B_avg_KD 1.526855 1 1.235660
## B_avg_opp_KD 1.291885 1 1.136611
## B_avg_SUB_ATT 1.476067 1 1.214935
## B_avg_opp_SUB_ATT 1.304696 1 1.142233
## B_avg_GROUND_landed 2.170574 1 1.473287
## B_avg_opp_GROUND_landed 2.053602 1 1.433039
## B_win_by_KO.TKO 3.554093 1 1.885230
## B_win_by_Submission 2.190286 1 1.479962
## B_win_by_Decision_Unanimous 6.022816 1 2.454143
## R_avg_TD_pct 1.880171 1 1.371193
## R_avg_opp_TD_pct 1.885595 1 1.373170
## R_avg_REV 1.467499 1 1.211404
## R_avg_opp_REV 1.460827 1 1.208647
## R_avg_TD_landed 2.309801 1 1.519803
## R_avg_opp_TD_landed 2.146734 1 1.465174
## R_avg_SIG_STR_landed 8.267906 1 2.875397
## R_avg_opp_SIG_STR_landed 9.061933 1 3.010304
## R_avg_SIG_STR_pct 1.496087 1 1.223147
## R_avg_opp_SIG_STR_pct 1.487193 1 1.219505
## R_avg_TOTAL_STR_landed 7.038128 1 2.652947
## R_avg_opp_TOTAL_STR_landed 7.803845 1 2.793536
## R_avg_CLINCH_landed 2.338427 1 1.529192
## R_avg_opp_CLINCH_landed 2.339588 1 1.529571
## R_avg_KD 1.369624 1 1.170309
## R_avg_opp_KD 1.277758 1 1.130380
## R_avg_opp_SUB_ATT 1.250770 1 1.118378
## R_avg_GROUND_landed 2.002038 1 1.414934
## R_avg_opp_GROUND_landed 2.158113 1 1.469052
## R_win_by_KO.TKO 3.379912 1 1.838454
## R_win_by_Submission 1.970394 1 1.403707
## R_win_by_Decision_Unanimous 4.374091 1 2.091433
## B_total_rounds_fought 25.756155 1 5.075052
## R_total_rounds_fought 8.320626 1 2.884550
## B_avg_CTRL_time.seconds. 5.267908 1 2.295192
## B_avg_opp_CTRL_time.seconds. 4.739391 1 2.177014
## B_current_win_streak 1.794786 1 1.339696
## B_losses 9.787890 1 3.128560
## R_age 1.713216 1 1.308899
## B_age 1.511191 1 1.229305
## B_Reach_cms 6.524453 1 2.554301
## R_Reach_cms 6.901464 1 2.627064
## B_Height_cms 7.593179 1 2.755572
## R_Height_cms 8.017919 1 2.831593
## weight_class 35.601309 12 1.160498
We notice some high VIF scores. We should keep in mind that UFC fights always occur within the same weight class and same weight, similar height, etc. Also, some variables highly correlated because they occur together like ‘B_total_rounds_fought’ and ‘B_total_seconds_fought’.
After observing the summary of the m1 model, significance levels of variables, and the VIF we will choose the following variables in our second Logisitc Regression model below (m2).
For the initial model build, we will use striking variables that are more general than specific. For example, we will use Striking Percentage rather than Leg Strikes Attempted. The reason for this is because styles of fight vary among fighters, and we want to focus more on their efficiency rather than they’re style of fight when predicting. We can experiment with specific strikes later.
Note: All fights happen with respect to a fighters weight class, therefore their complied stats from previous fights and their opponents compiled stats have all occurred within their weight class. Therefore patterns in their fighting are bound to be consistent with their weight class which could lead to multi-collinearity. Height could be an issue as well. We will leave weight class out as weight class is more of a state of the fighter rather than a variable that is subject to change which would provide predictive ability.
set.seed(123)
m2 = glm(win_dummy ~ R_avg_TD_att + R_avg_SIG_STR_pct + R_avg_opp_SIG_STR_pct + R_avg_KD + R_avg_opp_KD +
R_avg_SUB_ATT + R_avg_CTRL_time.seconds. + R_avg_opp_CTRL_time.seconds. + R_avg_CLINCH_landed +
R_age + R_Reach_cms + R_avg_REV + R_avg_opp_GROUND_landed +
B_avg_opp_GROUND_landed +B_avg_TD_att +B_avg_opp_TD_pct +B_avg_SIG_STR_pct + B_avg_opp_SIG_STR_pct +
B_avg_KD + B_avg_opp_KD + B_avg_SUB_ATT +B_avg_CTRL_time.seconds.+B_avg_opp_CTRL_time.seconds.+
B_avg_CLINCH_landed + B_avg_opp_CLINCH_landed +
B_age + B_Reach_cms + B_avg_REV,
data = ufc_train, family = binomial)
summary(m2)
##
## Call:
## glm(formula = win_dummy ~ R_avg_TD_att + R_avg_SIG_STR_pct +
## R_avg_opp_SIG_STR_pct + R_avg_KD + R_avg_opp_KD + R_avg_SUB_ATT +
## R_avg_CTRL_time.seconds. + R_avg_opp_CTRL_time.seconds. +
## R_avg_CLINCH_landed + R_age + R_Reach_cms + R_avg_REV + R_avg_opp_GROUND_landed +
## B_avg_opp_GROUND_landed + B_avg_TD_att + B_avg_opp_TD_pct +
## B_avg_SIG_STR_pct + B_avg_opp_SIG_STR_pct + B_avg_KD + B_avg_opp_KD +
## B_avg_SUB_ATT + B_avg_CTRL_time.seconds. + B_avg_opp_CTRL_time.seconds. +
## B_avg_CLINCH_landed + B_avg_opp_CLINCH_landed + B_age + B_Reach_cms +
## B_avg_REV, family = binomial, data = ufc_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9289 -1.0063 -0.6767 1.1274 2.2714
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.943e-01 1.194e+00 0.330 0.741254
## R_avg_TD_att -6.728e-02 3.107e-02 -2.165 0.030353 *
## R_avg_SIG_STR_pct -7.133e-01 6.121e-01 -1.165 0.243848
## R_avg_opp_SIG_STR_pct 2.211e+00 6.327e-01 3.494 0.000476 ***
## R_avg_KD -1.480e-01 1.809e-01 -0.818 0.413229
## R_avg_opp_KD 1.587e-01 2.055e-01 0.772 0.439846
## R_avg_SUB_ATT -2.999e-01 1.273e-01 -2.357 0.018423 *
## R_avg_CTRL_time.seconds. 2.862e-05 7.068e-04 0.040 0.967701
## R_avg_opp_CTRL_time.seconds. 1.170e-03 7.625e-04 1.534 0.125085
## R_avg_CLINCH_landed -5.096e-03 1.164e-02 -0.438 0.661529
## R_age 8.143e-02 1.534e-02 5.307 1.12e-07 ***
## R_Reach_cms -1.424e-02 7.917e-03 -1.799 0.071971 .
## R_avg_REV 3.796e-01 2.276e-01 1.668 0.095281 .
## R_avg_opp_GROUND_landed -4.215e-03 1.205e-02 -0.350 0.726412
## B_avg_opp_GROUND_landed -3.399e-02 1.247e-02 -2.725 0.006433 **
## B_avg_TD_att 7.750e-02 3.106e-02 2.495 0.012584 *
## B_avg_opp_TD_pct -7.198e-01 2.936e-01 -2.451 0.014236 *
## B_avg_SIG_STR_pct -7.250e-02 5.821e-01 -0.125 0.900874
## B_avg_opp_SIG_STR_pct 4.370e-01 6.365e-01 0.687 0.492385
## B_avg_KD -8.804e-02 1.745e-01 -0.505 0.613798
## B_avg_opp_KD -3.689e-01 2.204e-01 -1.674 0.094176 .
## B_avg_SUB_ATT 6.702e-02 1.083e-01 0.619 0.536102
## B_avg_CTRL_time.seconds. -5.392e-05 7.186e-04 -0.075 0.940184
## B_avg_opp_CTRL_time.seconds. 1.581e-03 7.112e-04 2.222 0.026271 *
## B_avg_CLINCH_landed 2.588e-02 1.393e-02 1.858 0.063142 .
## B_avg_opp_CLINCH_landed -6.428e-02 1.669e-02 -3.852 0.000117 ***
## B_age -7.328e-02 1.587e-02 -4.618 3.87e-06 ***
## B_Reach_cms 6.299e-03 8.249e-03 0.764 0.445084
## B_avg_REV 1.976e-01 2.098e-01 0.942 0.346182
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1811.8 on 1321 degrees of freedom
## Residual deviance: 1666.3 on 1293 degrees of freedom
## AIC: 1724.3
##
## Number of Fisher Scoring iterations: 4
set.seed(123)
ufc_test$PredProb = predict.glm(m2, ufc_test, type = 'response')
ufc_test$Prediction = ifelse(ufc_test$PredProb >= 0.5,1,0)
caret::confusionMatrix(as.factor(ufc_test$win_dummy), as.factor(ufc_test$Prediction), positive='1')
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 233 91
## 1 127 115
##
## Accuracy : 0.6148
## 95% CI : (0.5734, 0.6551)
## No Information Rate : 0.636
## P-Value [Acc > NIR] : 0.86244
##
## Kappa : 0.1981
##
## Mcnemar's Test P-Value : 0.01776
##
## Sensitivity : 0.5583
## Specificity : 0.6472
## Pos Pred Value : 0.4752
## Neg Pred Value : 0.7191
## Prevalence : 0.3640
## Detection Rate : 0.2032
## Detection Prevalence : 0.4276
## Balanced Accuracy : 0.6027
##
## 'Positive' Class : 1
##
Check multi-colinearity scores.
vif(m2)
## R_avg_TD_att R_avg_SIG_STR_pct
## 1.959772 1.230693
## R_avg_opp_SIG_STR_pct R_avg_KD
## 1.244932 1.240002
## R_avg_opp_KD R_avg_SUB_ATT
## 1.200657 1.283920
## R_avg_CTRL_time.seconds. R_avg_opp_CTRL_time.seconds.
## 2.251146 2.019427
## R_avg_CLINCH_landed R_age
## 1.304143 1.206647
## R_Reach_cms R_avg_REV
## 2.424137 1.200405
## R_avg_opp_GROUND_landed B_avg_opp_GROUND_landed
## 1.855908 1.642835
## B_avg_TD_att B_avg_opp_TD_pct
## 2.015960 1.399594
## B_avg_SIG_STR_pct B_avg_opp_SIG_STR_pct
## 1.269691 1.345275
## B_avg_KD B_avg_opp_KD
## 1.268585 1.183851
## B_avg_SUB_ATT B_avg_CTRL_time.seconds.
## 1.219655 2.215003
## B_avg_opp_CTRL_time.seconds. B_avg_CLINCH_landed
## 1.935582 1.668842
## B_avg_opp_CLINCH_landed B_age
## 1.687617 1.120621
## B_Reach_cms B_avg_REV
## 2.541146 1.230209
After running the m2 model, we see that the the model still is not a significant predictor. However, we see that the VIF scores are all below 5, meaning the model likely does contain multi collinearity.
We will now run a random forest model to see if our results improve. In the random forest we will also tune the hyper parameters.
set.seed(123)
train_rf.idx <- sample(1:nrow(ufc_2), size = 1 * nrow(ufc_2))
train_rf_data <- ufc_2[train_rf.idx,]
train_rf_data$win_dummy = as.factor(train_rf_data$win_dummy)
table(train_rf_data$win_dummy)
##
## 0 1
## 1068 820
Out-of-bag method is included. Out-of-bag error, also called out-of-bag estimate, is a method of measuring the prediction error of random forests, boosted decision trees, and other machine learning models utilizing bootstrap aggregating. Bagging uses subsampling with replacement to create training samples for the model to learn from.
set.seed(123)
forest2 <- randomForest(win_dummy ~ R_avg_TD_att + R_avg_SIG_STR_pct + R_avg_opp_SIG_STR_pct + R_avg_KD + R_avg_opp_KD + R_avg_SUB_ATT + R_avg_CTRL_time.seconds. + R_avg_opp_CTRL_time.seconds. + R_avg_CLINCH_landed +
R_age + R_Reach_cms + R_avg_REV + R_avg_opp_GROUND_landed +
B_avg_opp_GROUND_landed + B_avg_TD_att + B_avg_opp_TD_pct +B_avg_SIG_STR_pct + B_avg_opp_SIG_STR_pct +
B_avg_KD + B_avg_opp_KD + B_avg_SUB_ATT + B_avg_CTRL_time.seconds.+ B_avg_opp_CTRL_time.seconds.+
B_avg_CLINCH_landed + B_avg_opp_CLINCH_landed + B_age + B_Reach_cms + B_avg_REV,
data = train_rf_data,
importance = TRUE,
oob_score= TRUE,
ntree = 1000)
forest2
##
## Call:
## randomForest(formula = win_dummy ~ R_avg_TD_att + R_avg_SIG_STR_pct + R_avg_opp_SIG_STR_pct + R_avg_KD + R_avg_opp_KD + R_avg_SUB_ATT + R_avg_CTRL_time.seconds. + R_avg_opp_CTRL_time.seconds. + R_avg_CLINCH_landed + R_age + R_Reach_cms + R_avg_REV + R_avg_opp_GROUND_landed + B_avg_opp_GROUND_landed + B_avg_TD_att + B_avg_opp_TD_pct + B_avg_SIG_STR_pct + B_avg_opp_SIG_STR_pct + B_avg_KD + B_avg_opp_KD + B_avg_SUB_ATT + B_avg_CTRL_time.seconds. + B_avg_opp_CTRL_time.seconds. + B_avg_CLINCH_landed + B_avg_opp_CLINCH_landed + B_age + B_Reach_cms + B_avg_REV, data = train_rf_data, importance = TRUE, oob_score = TRUE, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 41.21%
## Confusion matrix:
## 0 1 class.error
## 0 824 244 0.2284644
## 1 534 286 0.6512195
Out of Bag Error of 41.21%. Giving us an Out of Bag Score of about 59%.
We see that a Random Forest does a better job at predicting in this instance than Logistic Regression. This could change depending on further transforming the data, like choosing one weight class.
set.seed(123)
# Establish a list of possible values for hyper-parameters
mtry.values <- seq(4,6,1)
nodesize.values <- seq(4,8,2)
ntree.values <- seq(4e3,6e3,1e3)
# Create a data frame containing all combinations
hyper_grid <- expand.grid(mtry = mtry.values, nodesize = nodesize.values, ntree = ntree.values)
# Create an empty vector to store OOB error values
oob_err <- c()
# Write a loop over the rows of hyper_grid to train the grid of models
for (i in 1:nrow(hyper_grid)) {
# Train a Random Forest model
forest2 <- randomForest(win_dummy ~ R_avg_TD_att + R_avg_SIG_STR_pct + R_avg_opp_SIG_STR_pct +
R_avg_KD + R_avg_opp_KD + R_avg_SUB_ATT + R_avg_CTRL_time.seconds. +
R_avg_opp_CTRL_time.seconds. + R_avg_CLINCH_landed +
R_age + R_Reach_cms + R_avg_REV + R_avg_opp_GROUND_landed +
B_avg_opp_GROUND_landed + B_avg_TD_att + B_avg_opp_TD_pct + B_avg_SIG_STR_pct +
B_avg_opp_SIG_STR_pct + B_avg_KD + B_avg_opp_KD +
B_avg_SUB_ATT + B_avg_CTRL_time.seconds.+ B_avg_opp_CTRL_time.seconds.+
B_avg_CLINCH_landed + B_avg_opp_CLINCH_landed + B_age + B_Reach_cms +
B_avg_REV,
data = train_rf_data,
mtry = hyper_grid$mtry[i],
nodesize = hyper_grid$nodesize[i],
ntree = hyper_grid$ntree[i])
# Store OOB error for the model
oob_err[i] <- forest2$err.rate[length(forest2$err.rate)]
}
# Identify optimal set of hyperparameters based on OOB error
opt_i <- which.min(oob_err)
print(hyper_grid[opt_i,])
## mtry nodesize ntree
## 12 6 4 5000
set.seed(123)
forest.hyper <- randomForest(win_dummy ~ R_avg_TD_att + R_avg_SIG_STR_pct + R_avg_opp_SIG_STR_pct + R_avg_KD + R_avg_opp_KD + R_avg_SUB_ATT + R_avg_CTRL_time.seconds. + R_avg_opp_CTRL_time.seconds. + R_avg_CLINCH_landed +
R_age + R_Reach_cms + R_avg_REV + R_avg_opp_GROUND_landed +
B_avg_opp_GROUND_landed + B_avg_TD_att + B_avg_opp_TD_pct + B_avg_SIG_STR_pct + B_avg_opp_SIG_STR_pct +
B_avg_KD + B_avg_opp_KD + B_avg_SUB_ATT + B_avg_CTRL_time.seconds.+ B_avg_opp_CTRL_time.seconds.+
B_avg_CLINCH_landed + B_avg_opp_CLINCH_landed + B_age + B_Reach_cms +
B_avg_REV,
data = train_rf_data,
importance = TRUE,
mtry = 6,
nodesize = 4,
ntree = 5000)
forest.hyper
##
## Call:
## randomForest(formula = win_dummy ~ R_avg_TD_att + R_avg_SIG_STR_pct + R_avg_opp_SIG_STR_pct + R_avg_KD + R_avg_opp_KD + R_avg_SUB_ATT + R_avg_CTRL_time.seconds. + R_avg_opp_CTRL_time.seconds. + R_avg_CLINCH_landed + R_age + R_Reach_cms + R_avg_REV + R_avg_opp_GROUND_landed + B_avg_opp_GROUND_landed + B_avg_TD_att + B_avg_opp_TD_pct + B_avg_SIG_STR_pct + B_avg_opp_SIG_STR_pct + B_avg_KD + B_avg_opp_KD + B_avg_SUB_ATT + B_avg_CTRL_time.seconds. + B_avg_opp_CTRL_time.seconds. + B_avg_CLINCH_landed + B_avg_opp_CLINCH_landed + B_age + B_Reach_cms + B_avg_REV, data = train_rf_data, importance = TRUE, mtry = 6, nodesize = 4, ntree = 5000)
## Type of random forest: classification
## Number of trees: 5000
## No. of variables tried at each split: 6
##
## OOB estimate of error rate: 41.15%
## Confusion matrix:
## 0 1 class.error
## 0 833 235 0.2200375
## 1 542 278 0.6609756
source: https://topepo.github.io/caret/variable-importance.html
Variable importance evaluation functions can be separated into two groups: those that use the model information and those that do not. The advantage of using a model-based approach is that is more closely tied to the model performance and that it may be able to incorporate the correlation structure between the predictors into the importance calculation. Regardless of how the importance is calculated:
For most classification models, each predictor will have a separate variable importance for each class (the exceptions are classification trees, bagged trees and boosted trees). All measures of importance are scaled to have a maximum value of 100, unless the scale argument of varImp.train is set to FALSE.
Below we see the variable importance of our Random Forest model. Keep in mind this random forest includes all weight classes. We see that Age again shows the most importance in correlation.
rf_imp <- varImp(forest.hyper)
rf_imp
## 0 1
## R_avg_TD_att 3.85835083 3.85835083
## R_avg_SIG_STR_pct -1.93723582 -1.93723582
## R_avg_opp_SIG_STR_pct 9.73458479 9.73458479
## R_avg_KD -1.66510182 -1.66510182
## R_avg_opp_KD 3.79993995 3.79993995
## R_avg_SUB_ATT 1.93845581 1.93845581
## R_avg_CTRL_time.seconds. 4.16326805 4.16326805
## R_avg_opp_CTRL_time.seconds. 8.01785745 8.01785745
## R_avg_CLINCH_landed -1.26455786 -1.26455786
## R_age 20.77182477 20.77182477
## R_Reach_cms -0.08861086 -0.08861086
## R_avg_REV 0.36628520 0.36628520
## R_avg_opp_GROUND_landed 0.81512365 0.81512365
## B_avg_opp_GROUND_landed -0.39399884 -0.39399884
## B_avg_TD_att 6.44701816 6.44701816
## B_avg_opp_TD_pct 4.83670630 4.83670630
## B_avg_SIG_STR_pct 2.03220016 2.03220016
## B_avg_opp_SIG_STR_pct -0.44450234 -0.44450234
## B_avg_KD -1.80311510 -1.80311510
## B_avg_opp_KD 0.18117799 0.18117799
## B_avg_SUB_ATT 4.95586401 4.95586401
## B_avg_CTRL_time.seconds. 3.39856639 3.39856639
## B_avg_opp_CTRL_time.seconds. -0.19979859 -0.19979859
## B_avg_CLINCH_landed 1.08179709 1.08179709
## B_avg_opp_CLINCH_landed 1.58935508 1.58935508
## B_age 17.22058939 17.22058939
## B_Reach_cms -0.65041339 -0.65041339
## B_avg_REV -0.87504128 -0.87504128
Here we use an SVM model, in attempts to see how different models improve predictive performance.
set.seed(123)
svm.idx <- sample(1:nrow(ufc_2), size = 0.70 * nrow(ufc_2))
train_svm_data <- ufc_2[svm.idx,]
test_svm_data <- ufc_2[-svm.idx,]
A cross validation method is set up to create an accurate way of sampling the dataset, multiple times.
# Set up Repeated k-fold Cross Validation
train_control <- trainControl(method="repeatedcv", number=10, repeats=3)
For SVM, we will scale the variables as distance algorithms like “KNN”, “K-means” and “SVM” are most affected by the range of features. This is because behind the scenes they are using distances between data points to determine their similarity. When two features have different scales, there is a chance that higher weight is given to features with higher magnitude. This will impact the performance of the machine learning algorithm.
set.seed(123)
svm = svm(formula = win_dummy ~ R_avg_TD_att + R_avg_opp_TD_pct + R_avg_SIG_STR_pct + R_avg_opp_SIG_STR_pct + R_avg_KD + R_avg_opp_KD +
R_avg_SUB_ATT + R_avg_CTRL_time.seconds. +
R_avg_opp_CTRL_time.seconds. + R_avg_CLINCH_landed +
R_age + R_Reach_cms + R_avg_REV + R_avg_opp_GROUND_landed +
B_avg_opp_GROUND_landed + B_avg_TD_att + B_avg_opp_TD_pct + B_avg_SIG_STR_pct +
B_avg_opp_SIG_STR_pct + B_avg_KD + B_avg_opp_KD + B_avg_SUB_ATT +
B_avg_CTRL_time.seconds.+ B_avg_opp_CTRL_time.seconds.+ B_avg_CLINCH_landed +
B_avg_opp_CLINCH_landed + B_age + B_Reach_cms + B_avg_REV,
data = train_svm_data,
trControl = train_control,
type = 'C-classification',
kernel = 'linear',
preProcess = c("center","scale"),
tuneGrid = expand.grid(C = seq(0, 2, length = 20)))
summary(svm)
##
## Call:
## svm(formula = win_dummy ~ R_avg_TD_att + R_avg_opp_TD_pct + R_avg_SIG_STR_pct +
## R_avg_opp_SIG_STR_pct + R_avg_KD + R_avg_opp_KD + R_avg_SUB_ATT +
## R_avg_CTRL_time.seconds. + R_avg_opp_CTRL_time.seconds. + R_avg_CLINCH_landed +
## R_age + R_Reach_cms + R_avg_REV + R_avg_opp_GROUND_landed + B_avg_opp_GROUND_landed +
## B_avg_TD_att + B_avg_opp_TD_pct + B_avg_SIG_STR_pct + B_avg_opp_SIG_STR_pct +
## B_avg_KD + B_avg_opp_KD + B_avg_SUB_ATT + B_avg_CTRL_time.seconds. +
## B_avg_opp_CTRL_time.seconds. + B_avg_CLINCH_landed + B_avg_opp_CLINCH_landed +
## B_age + B_Reach_cms + B_avg_REV, data = train_svm_data, trControl = train_control,
## type = "C-classification", kernel = "linear", preProcess = c("center",
## "scale"), tuneGrid = expand.grid(C = seq(0, 2, length = 20)))
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 1056
##
## ( 529 527 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
set.seed(123)
svm.pred_1 = predict(svm,newdata = test_svm_data)
caret::confusionMatrix(as.factor(svm.pred_1), as.factor(test_svm_data$win_dummy), positive = '1')
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 224 121
## 1 100 122
##
## Accuracy : 0.6102
## 95% CI : (0.5687, 0.6506)
## No Information Rate : 0.5714
## P-Value [Acc > NIR] : 0.03363
##
## Kappa : 0.1955
##
## Mcnemar's Test P-Value : 0.17851
##
## Sensitivity : 0.5021
## Specificity : 0.6914
## Pos Pred Value : 0.5495
## Neg Pred Value : 0.6493
## Prevalence : 0.4286
## Detection Rate : 0.2152
## Detection Prevalence : 0.3915
## Balanced Accuracy : 0.5967
##
## 'Positive' Class : 1
##
We notice a positive outcome in the SVM predictive model. We might be able to improve performance by using various kernels. SVM so far looks to be the best a prediction ability.
Now we will use the same independent variables when building models for specific weight class. The models below only use rows from the original dataset that match a specific weight class (Welterweight model only uses Welterweight fights).
Logistic Regression Model Welter
Remember that the welterweight class has the greatest number of fights between 2015-2021
ufc_welter_weight <- filter(ufc_2, weight_class == 'Welterweight')
set.seed(123)
sample_welter <- sample.int(n = nrow(ufc_welter_weight), size = round(.7*nrow(ufc_welter_weight)), replace = F)
ufc_train_welter <- ufc_welter_weight[sample_welter, ]
ufc_test_welter <- ufc_welter_weight[-sample_welter, ]
set.seed(123)
m1_welter = glm(win_dummy ~ R_avg_TD_att + R_avg_SIG_STR_pct + R_avg_opp_SIG_STR_pct + R_avg_KD + R_avg_opp_KD +
R_avg_SUB_ATT + R_avg_CTRL_time.seconds. +
R_avg_opp_CTRL_time.seconds. + R_avg_CLINCH_landed +
R_age + R_Reach_cms + R_avg_REV + R_avg_opp_GROUND_landed +
B_avg_opp_GROUND_landed +
B_avg_TD_att +
B_avg_opp_TD_pct +
B_avg_SIG_STR_pct +
B_avg_opp_SIG_STR_pct +
B_avg_KD +
B_avg_opp_KD +
B_avg_SUB_ATT +
B_avg_CTRL_time.seconds.+
B_avg_opp_CTRL_time.seconds.+
B_avg_CLINCH_landed +
B_avg_opp_CLINCH_landed +
B_age +
B_Reach_cms +
B_avg_REV
, data = ufc_train_welter, family = binomial)
summary(m1_welter)
##
## Call:
## glm(formula = win_dummy ~ R_avg_TD_att + R_avg_SIG_STR_pct +
## R_avg_opp_SIG_STR_pct + R_avg_KD + R_avg_opp_KD + R_avg_SUB_ATT +
## R_avg_CTRL_time.seconds. + R_avg_opp_CTRL_time.seconds. +
## R_avg_CLINCH_landed + R_age + R_Reach_cms + R_avg_REV + R_avg_opp_GROUND_landed +
## B_avg_opp_GROUND_landed + B_avg_TD_att + B_avg_opp_TD_pct +
## B_avg_SIG_STR_pct + B_avg_opp_SIG_STR_pct + B_avg_KD + B_avg_opp_KD +
## B_avg_SUB_ATT + B_avg_CTRL_time.seconds. + B_avg_opp_CTRL_time.seconds. +
## B_avg_CLINCH_landed + B_avg_opp_CLINCH_landed + B_age + B_Reach_cms +
## B_avg_REV, family = binomial, data = ufc_train_welter)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.970 -1.004 -0.698 1.103 2.190
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 11.1526290 8.3156360 1.341 0.1799
## R_avg_TD_att -0.0688275 0.0792859 -0.868 0.3853
## R_avg_SIG_STR_pct 1.0977688 1.6492100 0.666 0.5056
## R_avg_opp_SIG_STR_pct 1.0027712 1.6346678 0.613 0.5396
## R_avg_KD -0.3043669 0.4172462 -0.729 0.4657
## R_avg_opp_KD -0.5810565 0.5868028 -0.990 0.3221
## R_avg_SUB_ATT -0.1989581 0.4417806 -0.450 0.6525
## R_avg_CTRL_time.seconds. -0.0022394 0.0017245 -1.299 0.1941
## R_avg_opp_CTRL_time.seconds. 0.0020654 0.0021842 0.946 0.3443
## R_avg_CLINCH_landed 0.0080213 0.0298085 0.269 0.7879
## R_age 0.0412604 0.0396676 1.040 0.2983
## R_Reach_cms -0.0400960 0.0305192 -1.314 0.1889
## R_avg_REV -0.0055806 0.7193079 -0.008 0.9938
## R_avg_opp_GROUND_landed -0.0445277 0.0496850 -0.896 0.3701
## B_avg_opp_GROUND_landed -0.0006107 0.0438096 -0.014 0.9889
## B_avg_TD_att 0.1923105 0.1065373 1.805 0.0711 .
## B_avg_opp_TD_pct 0.1011483 0.7924275 0.128 0.8984
## B_avg_SIG_STR_pct 1.3754170 1.7325420 0.794 0.4273
## B_avg_opp_SIG_STR_pct -3.3820734 1.7434786 -1.940 0.0524 .
## B_avg_KD -0.1656506 0.4020598 -0.412 0.6803
## B_avg_opp_KD 0.1458027 0.5011546 0.291 0.7711
## B_avg_SUB_ATT 0.0441990 0.3282362 0.135 0.8929
## B_avg_CTRL_time.seconds. -0.0002550 0.0022599 -0.113 0.9102
## B_avg_opp_CTRL_time.seconds. 0.0017408 0.0021023 0.828 0.4076
## B_avg_CLINCH_landed -0.0030797 0.0331017 -0.093 0.9259
## B_avg_opp_CLINCH_landed -0.0042818 0.0462423 -0.093 0.9262
## B_age -0.0222009 0.0486248 -0.457 0.6480
## B_Reach_cms -0.0243163 0.0305870 -0.795 0.4266
## B_avg_REV -0.0525628 0.8278034 -0.063 0.9494
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 295.80 on 215 degrees of freedom
## Residual deviance: 267.13 on 187 degrees of freedom
## AIC: 325.13
##
## Number of Fisher Scoring iterations: 4
set.seed(123)
ufc_test_welter$PredProb = predict.glm(m1_welter, ufc_test_welter, type = 'response')
ufc_test_welter$Prediction = ifelse(ufc_test_welter$PredProb >= 0.5,1,0)
caret::confusionMatrix(as.factor(ufc_test_welter$win_dummy), as.factor(ufc_test_welter$Prediction), positive='1')
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 32 14
## 1 23 23
##
## Accuracy : 0.5978
## 95% CI : (0.4904, 0.6988)
## No Information Rate : 0.5978
## P-Value [Acc > NIR] : 0.5450
##
## Kappa : 0.1957
##
## Mcnemar's Test P-Value : 0.1884
##
## Sensitivity : 0.6216
## Specificity : 0.5818
## Pos Pred Value : 0.5000
## Neg Pred Value : 0.6957
## Prevalence : 0.4022
## Detection Rate : 0.2500
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.6017
##
## 'Positive' Class : 1
##
set.seed(123)
svm.idx_welter <- sample(1:nrow(ufc_2), size = 0.70 * nrow(ufc_2))
train_svm_data_welter <- ufc_2[svm.idx_welter,]
test_svm_data_welter <- ufc_2[-svm.idx_welter,]
# Set up Repeated k-fold Cross Validation
train_control <- trainControl(method="repeatedcv", number=10, repeats=3)
set.seed(123)
svm_welter = svm(formula = win_dummy ~ R_avg_TD_att + R_avg_opp_TD_pct + R_avg_SIG_STR_pct + R_avg_opp_SIG_STR_pct + R_avg_KD + R_avg_opp_KD +
R_avg_SUB_ATT + R_avg_CTRL_time.seconds. +
R_avg_opp_CTRL_time.seconds. + R_avg_CLINCH_landed +
R_age + R_Reach_cms + R_avg_REV + R_avg_opp_GROUND_landed +
B_avg_opp_GROUND_landed +
B_avg_TD_att +
B_avg_opp_TD_pct +
B_avg_SIG_STR_pct +
B_avg_opp_SIG_STR_pct +
B_avg_KD +
B_avg_opp_KD +
B_avg_SUB_ATT +
B_avg_CTRL_time.seconds.+
B_avg_opp_CTRL_time.seconds.+
B_avg_CLINCH_landed +
B_avg_opp_CLINCH_landed +
B_age +
B_Reach_cms +
B_avg_REV,
data = train_svm_data,
trControl = train_control,
type = 'C-classification',
kernel = 'linear',
preProcess = c("center","scale"),
tuneGrid = expand.grid(C = seq(0, 2, length = 20)))
summary(svm_welter)
##
## Call:
## svm(formula = win_dummy ~ R_avg_TD_att + R_avg_opp_TD_pct + R_avg_SIG_STR_pct +
## R_avg_opp_SIG_STR_pct + R_avg_KD + R_avg_opp_KD + R_avg_SUB_ATT +
## R_avg_CTRL_time.seconds. + R_avg_opp_CTRL_time.seconds. + R_avg_CLINCH_landed +
## R_age + R_Reach_cms + R_avg_REV + R_avg_opp_GROUND_landed + B_avg_opp_GROUND_landed +
## B_avg_TD_att + B_avg_opp_TD_pct + B_avg_SIG_STR_pct + B_avg_opp_SIG_STR_pct +
## B_avg_KD + B_avg_opp_KD + B_avg_SUB_ATT + B_avg_CTRL_time.seconds. +
## B_avg_opp_CTRL_time.seconds. + B_avg_CLINCH_landed + B_avg_opp_CLINCH_landed +
## B_age + B_Reach_cms + B_avg_REV, data = train_svm_data, trControl = train_control,
## type = "C-classification", kernel = "linear", preProcess = c("center",
## "scale"), tuneGrid = expand.grid(C = seq(0, 2, length = 20)))
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 1056
##
## ( 529 527 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
set.seed(123)
svm.pred_1_welter = predict(svm_welter,newdata = test_svm_data_welter)
caret::confusionMatrix(as.factor(svm.pred_1_welter), as.factor(test_svm_data_welter$win_dummy), positive = '1')
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 224 121
## 1 100 122
##
## Accuracy : 0.6102
## 95% CI : (0.5687, 0.6506)
## No Information Rate : 0.5714
## P-Value [Acc > NIR] : 0.03363
##
## Kappa : 0.1955
##
## Mcnemar's Test P-Value : 0.17851
##
## Sensitivity : 0.5021
## Specificity : 0.6914
## Pos Pred Value : 0.5495
## Neg Pred Value : 0.6493
## Prevalence : 0.4286
## Detection Rate : 0.2152
## Detection Prevalence : 0.3915
## Balanced Accuracy : 0.5967
##
## 'Positive' Class : 1
##
SVM model for Welterweight looks to be significant, than Logistic Regression model.
Lightweight Logistic Model
ufc_light_weight <- filter(ufc_2, weight_class == 'Lightweight')
set.seed(123)
sample_light <- sample.int(n = nrow(ufc_light_weight), size = round(.7*nrow(ufc_light_weight)), replace = F)
ufc_train_light <- ufc_light_weight[sample_light, ]
ufc_test_light <- ufc_light_weight[-sample_light, ]
set.seed(123)
m1_light = glm(win_dummy ~ R_avg_TD_att + R_avg_SIG_STR_pct + R_avg_opp_SIG_STR_pct + R_avg_KD + R_avg_opp_KD +
R_avg_SUB_ATT + R_avg_CTRL_time.seconds. +
R_avg_opp_CTRL_time.seconds. + R_avg_CLINCH_landed +
R_age + R_Reach_cms + R_avg_REV + R_avg_opp_GROUND_landed +
B_avg_opp_GROUND_landed +
B_avg_TD_att +
B_avg_opp_TD_pct +
B_avg_SIG_STR_pct +
B_avg_opp_SIG_STR_pct +
B_avg_KD +
B_avg_opp_KD +
B_avg_SUB_ATT +
B_avg_CTRL_time.seconds.+
B_avg_opp_CTRL_time.seconds.+
B_avg_CLINCH_landed +
B_avg_opp_CLINCH_landed +
B_age +
B_Reach_cms +
B_avg_REV
, data = ufc_train_light, family = binomial)
summary(m1_light)
##
## Call:
## glm(formula = win_dummy ~ R_avg_TD_att + R_avg_SIG_STR_pct +
## R_avg_opp_SIG_STR_pct + R_avg_KD + R_avg_opp_KD + R_avg_SUB_ATT +
## R_avg_CTRL_time.seconds. + R_avg_opp_CTRL_time.seconds. +
## R_avg_CLINCH_landed + R_age + R_Reach_cms + R_avg_REV + R_avg_opp_GROUND_landed +
## B_avg_opp_GROUND_landed + B_avg_TD_att + B_avg_opp_TD_pct +
## B_avg_SIG_STR_pct + B_avg_opp_SIG_STR_pct + B_avg_KD + B_avg_opp_KD +
## B_avg_SUB_ATT + B_avg_CTRL_time.seconds. + B_avg_opp_CTRL_time.seconds. +
## B_avg_CLINCH_landed + B_avg_opp_CLINCH_landed + B_age + B_Reach_cms +
## B_avg_REV, family = binomial, data = ufc_train_light)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2711 -0.8706 -0.2447 0.8901 2.2942
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.333e+01 1.023e+01 -1.303 0.19274
## R_avg_TD_att -3.242e-02 9.705e-02 -0.334 0.73836
## R_avg_SIG_STR_pct -5.380e-01 2.088e+00 -0.258 0.79667
## R_avg_opp_SIG_STR_pct 6.156e+00 2.209e+00 2.787 0.00532 **
## R_avg_KD -1.075e+00 7.169e-01 -1.499 0.13384
## R_avg_opp_KD -2.511e-01 7.068e-01 -0.355 0.72239
## R_avg_SUB_ATT -8.658e-01 4.294e-01 -2.016 0.04377 *
## R_avg_CTRL_time.seconds. -2.404e-03 2.319e-03 -1.037 0.29982
## R_avg_opp_CTRL_time.seconds. 3.240e-03 2.756e-03 1.175 0.23980
## R_avg_CLINCH_landed 3.786e-02 4.835e-02 0.783 0.43354
## R_age 1.447e-01 5.250e-02 2.755 0.00586 **
## R_Reach_cms 8.371e-04 3.376e-02 0.025 0.98022
## R_avg_REV -8.152e-01 8.426e-01 -0.967 0.33331
## R_avg_opp_GROUND_landed -4.754e-02 3.807e-02 -1.249 0.21169
## B_avg_opp_GROUND_landed -3.628e-02 3.250e-02 -1.116 0.26426
## B_avg_TD_att 7.334e-02 9.808e-02 0.748 0.45463
## B_avg_opp_TD_pct -2.348e+00 1.016e+00 -2.310 0.02087 *
## B_avg_SIG_STR_pct 4.350e-02 1.774e+00 0.025 0.98043
## B_avg_opp_SIG_STR_pct 2.050e-01 1.815e+00 0.113 0.91007
## B_avg_KD -5.188e-01 5.698e-01 -0.910 0.36258
## B_avg_opp_KD -6.366e-01 6.055e-01 -1.051 0.29304
## B_avg_SUB_ATT 3.843e-01 4.579e-01 0.839 0.40124
## B_avg_CTRL_time.seconds. 1.851e-03 2.244e-03 0.825 0.40953
## B_avg_opp_CTRL_time.seconds. -3.875e-04 2.116e-03 -0.183 0.85470
## B_avg_CLINCH_landed -4.548e-02 4.401e-02 -1.033 0.30142
## B_avg_opp_CLINCH_landed 3.478e-02 4.696e-02 0.741 0.45891
## B_age -9.593e-02 5.655e-02 -1.696 0.08981 .
## B_Reach_cms 5.741e-02 3.651e-02 1.572 0.11589
## B_avg_REV -3.998e-01 8.045e-01 -0.497 0.61924
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 271.69 on 195 degrees of freedom
## Residual deviance: 212.11 on 167 degrees of freedom
## AIC: 270.11
##
## Number of Fisher Scoring iterations: 4
set.seed(123)
ufc_test_light$PredProb = predict.glm(m1_light, ufc_test_light, type = 'response')
ufc_test_light$Prediction = ifelse(ufc_test_light$PredProb >= 0.5,1,0)
caret::confusionMatrix(as.factor(ufc_test_light$win_dummy), as.factor(ufc_test_light$Prediction), positive='1')
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 27 22
## 1 16 19
##
## Accuracy : 0.5476
## 95% CI : (0.4352, 0.6566)
## No Information Rate : 0.5119
## P-Value [Acc > NIR] : 0.2930
##
## Kappa : 0.0916
##
## Mcnemar's Test P-Value : 0.4173
##
## Sensitivity : 0.4634
## Specificity : 0.6279
## Pos Pred Value : 0.5429
## Neg Pred Value : 0.5510
## Prevalence : 0.4881
## Detection Rate : 0.2262
## Detection Prevalence : 0.4167
## Balanced Accuracy : 0.5457
##
## 'Positive' Class : 1
##
ufc_heavy_weight <- filter(ufc_2, weight_class == 'Heavyweight')
set.seed(123)
sample_heavy <- sample.int(n = nrow(ufc_heavy_weight), size = round(.7*nrow(ufc_heavy_weight)), replace = F)
ufc_train_heavy <- ufc_heavy_weight[sample_heavy, ]
ufc_test_heavy <- ufc_heavy_weight[-sample_heavy, ]
set.seed(123)
m1_heavy = glm(win_dummy ~ R_avg_TD_att + R_avg_SIG_STR_pct + R_avg_opp_SIG_STR_pct + R_avg_KD + R_avg_opp_KD +
R_avg_SUB_ATT + R_avg_CTRL_time.seconds. +
R_avg_opp_CTRL_time.seconds. + R_avg_CLINCH_landed +
R_age + R_Reach_cms + R_avg_REV + R_avg_opp_GROUND_landed +
B_avg_opp_GROUND_landed +
B_avg_TD_att +
B_avg_opp_TD_pct +
B_avg_SIG_STR_pct +
B_avg_opp_SIG_STR_pct +
B_avg_KD +
B_avg_opp_KD +
B_avg_SUB_ATT +
B_avg_CTRL_time.seconds.+
B_avg_opp_CTRL_time.seconds.+
B_avg_CLINCH_landed +
B_avg_opp_CLINCH_landed +
B_age +
B_Reach_cms +
B_avg_REV
, data = ufc_train_heavy, family = binomial)
summary(m1_heavy)
##
## Call:
## glm(formula = win_dummy ~ R_avg_TD_att + R_avg_SIG_STR_pct +
## R_avg_opp_SIG_STR_pct + R_avg_KD + R_avg_opp_KD + R_avg_SUB_ATT +
## R_avg_CTRL_time.seconds. + R_avg_opp_CTRL_time.seconds. +
## R_avg_CLINCH_landed + R_age + R_Reach_cms + R_avg_REV + R_avg_opp_GROUND_landed +
## B_avg_opp_GROUND_landed + B_avg_TD_att + B_avg_opp_TD_pct +
## B_avg_SIG_STR_pct + B_avg_opp_SIG_STR_pct + B_avg_KD + B_avg_opp_KD +
## B_avg_SUB_ATT + B_avg_CTRL_time.seconds. + B_avg_opp_CTRL_time.seconds. +
## B_avg_CLINCH_landed + B_avg_opp_CLINCH_landed + B_age + B_Reach_cms +
## B_avg_REV, family = binomial, data = ufc_train_heavy)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8910 -0.8163 -0.2096 0.8473 2.5762
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.233829 13.715522 0.090 0.92832
## R_avg_TD_att -0.042351 0.208952 -0.203 0.83938
## R_avg_SIG_STR_pct -3.670197 2.939163 -1.249 0.21177
## R_avg_opp_SIG_STR_pct 1.327193 2.771365 0.479 0.63201
## R_avg_KD -2.208979 1.303984 -1.694 0.09026 .
## R_avg_opp_KD 1.523529 1.287393 1.183 0.23664
## R_avg_SUB_ATT 1.090745 0.887356 1.229 0.21899
## R_avg_CTRL_time.seconds. -0.000126 0.004202 -0.030 0.97608
## R_avg_opp_CTRL_time.seconds. -0.002226 0.003528 -0.631 0.52807
## R_avg_CLINCH_landed 0.024421 0.048826 0.500 0.61696
## R_age 0.032069 0.084399 0.380 0.70397
## R_Reach_cms -0.064619 0.046894 -1.378 0.16821
## R_avg_REV 4.155635 3.451160 1.204 0.22854
## R_avg_opp_GROUND_landed 0.035214 0.079738 0.442 0.65876
## B_avg_opp_GROUND_landed -0.096923 0.068530 -1.414 0.15727
## B_avg_TD_att 0.702107 0.280481 2.503 0.01231 *
## B_avg_opp_TD_pct 0.856574 1.495334 0.573 0.56676
## B_avg_SIG_STR_pct 9.883988 4.132554 2.392 0.01677 *
## B_avg_opp_SIG_STR_pct 9.992761 3.904796 2.559 0.01049 *
## B_avg_KD -0.789811 0.991972 -0.796 0.42591
## B_avg_opp_KD -1.566898 1.345544 -1.165 0.24422
## B_avg_SUB_ATT 0.383411 1.304997 0.294 0.76891
## B_avg_CTRL_time.seconds. -0.020603 0.007057 -2.919 0.00351 **
## B_avg_opp_CTRL_time.seconds. 0.007813 0.005055 1.546 0.12222
## B_avg_CLINCH_landed 0.224601 0.100063 2.245 0.02479 *
## B_avg_opp_CLINCH_landed -0.242008 0.099980 -2.421 0.01550 *
## B_age -0.036376 0.072053 -0.505 0.61367
## B_Reach_cms 0.019278 0.045433 0.424 0.67133
## B_avg_REV -1.361957 1.885133 -0.722 0.47000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 141.049 on 101 degrees of freedom
## Residual deviance: 98.066 on 73 degrees of freedom
## AIC: 156.07
##
## Number of Fisher Scoring iterations: 6
set.seed(123)
ufc_test_heavy$PredProb = predict.glm(m1_heavy, ufc_test_heavy, type = 'response')
ufc_test_heavy$Prediction = ifelse(ufc_test_heavy$PredProb >= 0.5,1,0)
caret::confusionMatrix(as.factor(ufc_test_heavy$win_dummy), as.factor(ufc_test_heavy$Prediction), positive='1')
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 11 12
## 1 13 8
##
## Accuracy : 0.4318
## 95% CI : (0.2835, 0.5897)
## No Information Rate : 0.5455
## P-Value [Acc > NIR] : 0.9518
##
## Kappa : -0.1411
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.4000
## Specificity : 0.4583
## Pos Pred Value : 0.3810
## Neg Pred Value : 0.4783
## Prevalence : 0.4545
## Detection Rate : 0.1818
## Detection Prevalence : 0.4773
## Balanced Accuracy : 0.4292
##
## 'Positive' Class : 1
##
Our Heavyweight model has poor performance in accuracy. The Heavy Weight model predictive variables that show more correlation with that weight class could be used to see if there is improvement.
Conclusion - We’ve shown that it could be possible to create an effective predictive model to predict the winner of a UFC fight. The models can be improved further by creating specific weight class models and choosing the correct independent variables that show high correlation to each weight class. Predicting the winner of a UFC fight is likely very difficult because of the unpredictability of the fighters decision making, the amount of preparation for a fight, and the sport in general. But we’ve shown the potential to make better educated decisions on UFC fight outcomes.
Recommendations are to continue to experiment with variable selections for models, per weight class. And to try different Kernels for SVM models. Another suggestion would be to create new variables by grading fighters abilities.