Import the necessary packages. This analysis includes Logistic Regression and Random Forest modeling.

library(tidyverse)
library(caret) ### confusion matrix
library(gam)  ### Generalized Additive Models. our Logistic Regression 
library(lubridate) ##Format and filter date
library(randomForest)
library(e1071)
library(ggcorrplot)
library(car)

Import the data.

ufc <- read.csv("data.csv")

Project Overview

For this project, I will conduct Logistic Regression, Random Forest, and SVM classification models to predict the Winner of the Red vs Blue Fighter in the UFC using various fight statistics. In the UFC, the Red Corner Fighter is typically the higher ranking fighter in the UFC roster. They typically have more fights in the UFC and have a higher rank due to their success in the UFC. The Blue Corner Fighter is typically the challenger and often has fewer fights than the Red Corner Fighter and are considered the underdog. Match making is decided on by UFC executives and other athletic commissions. Often times fights are also created based on popularity of the fighter. In this article, UFC president Dana White comments on the matchmaking process: (https://mmajunkie.usatoday.com/2021/01/ufc-news-dana-white-discusses-matchmaking-strategy). For these reasons, the creation of fights in the UFC can be biased. But using data can help us discern what skills makes a fighter likely to win and which fighters will win based on their abilities and demographics alone. At the end of this analysis we conclude that it is possible to build an effective predictive model based on the results

Overview of Data Source

This dataset (data.csv) comes from the Kaggle user RAJEEV WARRIER. Link here: https://www.kaggle.com/datasets/rajeevw/ufcdata

This is a data set of every UFC fight in the history of the organisation until March 2021. Every row contains information about both fighters, fight details and the winner. The data was scraped from ufc stats website.

Each row is a compilation of both fighter stats. Fighters are represented by ‘red’ and ‘blue’ (for red and blue corner). So for instance, red fighter has the complied average stats of all their fights except the current one. The stats include damage done by the red fighter on the opponent and the damage done by the opponent on the fighter (represented by ‘opp’ in the columns) in all the fights this particular red fighter has had, except this one as it has not occurred yet (in the data). Same information exists for blue fighter. The target variable is ‘Winner’ which is the only column that tells you what happened in that fight.

Example, we can look at UFC fighter Max Holloway’s fight records for an example. Max Holloway is known for his Boxing in the UFC. If we look at his Average Significant Strikes Landed (based on his corner color) we see that his compiled stats show an increase throughout most of his career for Average Significant Strikes Landed. If we look at Max’s fight against Brian Ortega, we see that Max had a avg_SIG_STR_landed of 134.28. During that fight, Max landed a high 290 Significant Strikes against Ortega. Those 290 are not found in the dataset, but were compiled to Max’s average, which shows why Max’s average went up to 212.14 leading into his fight against Dustin Porier.

ufc_max_holloway <- ufc %>% filter_all(any_vars(grepl("Max Holloway", .)))

max_holloway <- ufc_max_holloway[, c('R_fighter', 'B_fighter', 'R_avg_SIG_STR_landed', 'B_avg_SIG_STR_landed', 'Winner')]

max_holloway

##                R_fighter             B_fighter R_avg_SIG_STR_landed
## 1           Max Holloway         Calvin Kattar            125.19643
## 2  Alexander Volkanovski          Max Holloway            119.66406
## 3           Max Holloway Alexander Volkanovski            162.78574
## 4           Max Holloway         Frankie Edgar            196.57148
## 5           Max Holloway        Dustin Poirier            212.14296
## 6           Max Holloway          Brian Ortega            134.28591
## 7           Max Holloway             Jose Aldo             94.57182
## 8              Jose Aldo          Max Holloway             59.71484
## 9           Max Holloway        Anthony Pettis             76.28729
## 10          Max Holloway         Ricardo Lamas             49.57458
## 11          Max Holloway       Jeremy Stephens             42.14917
## 12          Max Holloway      Charles Oliveira             70.29834
## 13           Cub Swanson          Max Holloway             47.70312
## 14          Max Holloway           Cole Miller             60.19336
## 15       Akira Corassani          Max Holloway             23.87500
## 16          Max Holloway          Clay Collard             54.77344
## 17          Max Holloway            Andre Fili             62.54688
## 18          Max Holloway            Will Chope             51.09375
## 19        Conor McGregor          Max Holloway             21.00000
## 20       Dennis Bermudez          Max Holloway             71.75000
## 21        Leonard Garcia          Max Holloway             32.89062
## 22       Justin Lawrence          Max Holloway             54.00000
## 23          Max Holloway         Pat Schilling             11.00000
## 24        Dustin Poirier          Max Holloway             47.00000
##    B_avg_SIG_STR_landed Winner
## 1              79.51562    Red
## 2             148.39287    Red
## 3              82.32812   Blue
## 4              51.25104    Red
## 5              85.47866   Blue
## 6              32.09375    Red
## 7              57.35742    Red
## 8              85.14365   Blue
## 9              50.19727    Red
## 10             36.47266    Red
## 11             42.82710    Red
## 12             48.88184    Red
## 13             58.59668   Blue
## 14             34.56369    Red
## 15             89.38672   Blue
## 16                   NA    Red
## 17             55.00000    Red
## 18                   NA    Red
## 19             79.18750    Red
## 20             83.37500    Red
## 21             46.75000   Blue
## 22             64.50000   Blue
## 23              4.00000    Red
## 24                   NA    Red

Data Dictionary

This dataset (data.csv) is the partially processed file. All feature engineering has been included and every row is a compilation of info about each fighter up until that fight. The data has not been one hot encoded or processed for missing data. Further processing can be done.

*Note: Columns have already had percentages converted to decimals.

Column definitions:

R_ and B_ prefix - signifies red and blue corner fighter stats respectively

opp prefix containing columns is the average damage done by the opponent on the fighter

KD - is number of knockdowns

SIG_STR is no. of significant strikes ‘landed of attempted’

SIG_STR_pct is significant strikes percentage

TOTAL_STR is total strikes ‘landed of attempted’

TD is no. of take downs

TD_pct is take down percentages

SUB_ATT is no. of submission attempts

PASS is no. times the guard was passed?

REV is the no. of Reversals landed

HEAD is no. of significant strikes to the head ‘landed of attempted’

BODY is no. of significant strikes to the body ‘landed of attempted’

CLINCH is no. of significant strikes in the clinch ‘landed of attempted’

GROUND is no. of significant strikes on the ground ‘landed of attempted’

win_by is method of win

last_round is last round of the fight (ex. if it was a KO in 1st, then this will be 1)

last_round_time is when the fight ended in the last round

Format is the format of the fight (3 rounds, 5 rounds etc.)

Referee is the name of the Ref

date is the date of the fight

location is the location in which the event took place

Fight_type is which weight class and whether it’s a title bout or not

Winner is the winner of the fight

Stance is the stance of the fighter (orthodox, southpaw, etc.)

Height_cms is the height in centimeter

Reach_cms is the reach of the fighter (arm span) in centimeter

Weight_lbs is the weight of the fighter in pounds (lbs)

age is the age of the fighter

title_bout Boolean value of whether it is title fight or not

weight_class is which weight class the fight is in (Bantamweight, heavyweight, Women’s flyweight, etc.)

no_of_rounds is the number of rounds the fight was scheduled for

current_lose_streak is the count of current concurrent losses of the fighter

current_win_streak is the count of current concurrent wins of the fighter

draw is the number of draws in the fighter’s ufc career

wins is the number of wins in the fighter’s ufc career

losses is the number of losses in the fighter’s ufc career

total_rounds_fought is the average of total rounds fought by the fighter

total_time_fought(seconds) is the count of total time spent fighting in seconds

total_title_bouts is the total number of title bouts taken part in by the fighter

win_by_Decision_Majority is the number of wins by majority judges decision in the fighter’s ufc career

win_by_Decision_Split is the number of wins by split judges decision in the fighter’s ufc career

win_by_Decision_Unanimous is the number of wins by unanimous judges decision in the fighter’s ufc career

win_by_KO/TKO is the number of wins by knockout in the fighter’s ufc career

win_by_Submission is the number of wins by submission in the fighter’s ufc career

win_by_TKO_Doctor_Stoppage is the number of wins by doctor stoppage in the fighter’s ufc career

Weight Classes weight range (lb and kg measuements)

Heavyweight: 265 lb (120.2 kg) Light Heavyweight: 205 lb (102.1 kg) Middleweight: 185 lb (83.9 kg) Welterweight: 170 lb (77.1 kg) Lightweight: 155 lb (70.3 kg) Featherweight: 145 lb (65.8 kg) Bantamweight: 135 lb (61.2 kg) Flyweight: 125 lb (56.7 kg) Strawweight: 115 lb (52.5 kg)

Weigh in policy before fights

It is mandatory that neither fighter weighs more than the upper limit of their respective division at the weigh-ins. Note: fighters cut weight to make the division they wish to fight in. Done the day before the fight. Fighters weigh the same as their opponent always.

Exploratory Data Analysis

Here we observe the names and data types of our columns.

str(ufc)

## 'data.frame':    6012 obs. of  144 variables:
##  $ R_fighter                   : chr  "Adrian Yanez" "Trevin Giles" "Tai Tuivasa" "Cheyanne Buys" ...
##  $ B_fighter                   : chr  "Gustavo Lopez" "Roman Dolidze" "Harry Hunsucker" "Montserrat Conejo" ...
##  $ Referee                     : chr  "Chris Tognoni" "Herb Dean" "Herb Dean" "Mark Smith" ...
##  $ date                        : chr  "2021-03-20" "2021-03-20" "2021-03-20" "2021-03-20" ...
##  $ location                    : chr  "Las Vegas, Nevada, USA" "Las Vegas, Nevada, USA" "Las Vegas, Nevada, USA" "Las Vegas, Nevada, USA" ...
##  $ Winner                      : chr  "Red" "Red" "Red" "Blue" ...
##  $ title_bout                  : chr  "False" "False" "False" "False" ...
##  $ weight_class                : chr  "Bantamweight" "Middleweight" "Heavyweight" "WomenStrawweight" ...
##  $ B_avg_KD                    : num  0 0.5 NA NA 0.125 ...
##  $ B_avg_opp_KD                : num  0 0 NA NA 0 0 0.125 0 NA NA ...
##  $ B_avg_SIG_STR_pct           : num  0.42 0.66 NA NA 0.536 ...
##  $ B_avg_opp_SIG_STR_pct       : num  0.495 0.305 NA NA 0.579 ...
##  $ B_avg_TD_pct                : num  0.33 0.3 NA NA 0.185 ...
##  $ B_avg_opp_TD_pct            : num  0.36 0.5 NA NA 0.166 ...
##  $ B_avg_SUB_ATT               : num  0.5 1.5 NA NA 0.125 ...
##  $ B_avg_opp_SUB_ATT           : num  1 0 NA NA 0.188 ...
##  $ B_avg_REV                   : num  0 0 NA NA 0.25 ...
##  $ B_avg_opp_REV               : num  0 0 NA NA 0 ...
##  $ B_avg_SIG_STR_att           : num  50 65.5 NA NA 109.2 ...
##  $ B_avg_SIG_STR_landed        : num  20 35 NA NA 57.9 ...
##  $ B_avg_opp_SIG_STR_att       : num  84 50 NA NA 50.6 ...
##  $ B_avg_opp_SIG_STR_landed    : num  45 16.5 NA NA 28.4 ...
##  $ B_avg_TOTAL_STR_att         : num  76.5 113.5 NA NA 170.4 ...
##  $ B_avg_TOTAL_STR_landed      : num  41 68.5 NA NA 105.6 ...
##  $ B_avg_opp_TOTAL_STR_att     : num  114 68.5 NA NA 74.4 ...
##  $ B_avg_opp_TOTAL_STR_landed  : num  64 29 NA NA 44.2 ...
##  $ B_avg_TD_att                : num  1.5 2.5 NA NA 5.38 ...
##  $ B_avg_TD_landed             : num  1 1.5 NA NA 1.5 ...
##  $ B_avg_opp_TD_att            : num  9 0.5 NA NA 2 ...
##  $ B_avg_opp_TD_landed         : num  6.5 0.5 NA NA 0.625 ...
##  $ B_avg_HEAD_att              : num  39.5 46 NA NA 77.4 ...
##  $ B_avg_HEAD_landed           : num  11 20 NA NA 31.4 ...
##  $ B_avg_opp_HEAD_att          : num  63 36 NA NA 41.6 ...
##  $ B_avg_opp_HEAD_landed       : num  27.5 7.5 NA NA 22.6 ...
##  $ B_avg_BODY_att              : num  7.5 12 NA NA 31.2 ...
##  $ B_avg_BODY_landed           : num  7 8 NA NA 26.2 ...
##  $ B_avg_opp_BODY_att          : num  12 8 NA NA 7.69 ...
##  $ B_avg_opp_BODY_landed       : num  9 3 NA NA 4.94 ...
##  $ B_avg_LEG_att               : num  3 7.5 NA NA 0.625 ...
##  $ B_avg_LEG_landed            : num  2 7 NA NA 0.375 ...
##  $ B_avg_opp_LEG_att           : num  9 6 NA NA 1.38 ...
##  $ B_avg_opp_LEG_landed        : num  8.5 6 NA NA 0.875 ...
##  $ B_avg_DISTANCE_att          : num  35 58 NA NA 33.6 ...
##  $ B_avg_DISTANCE_landed       : num  12.5 30 NA NA 11 ...
##  $ B_avg_opp_DISTANCE_att      : num  43.5 48 NA NA 32.1 ...
##  $ B_avg_opp_DISTANCE_landed   : num  17.5 15.5 NA NA 13.9 ...
##  $ B_avg_CLINCH_att            : num  10.5 0.5 NA NA 39.1 ...
##  $ B_avg_CLINCH_landed         : num  4.5 0.5 NA NA 28.8 ...
##  $ B_avg_opp_CLINCH_att        : num  4 0.5 NA NA 13.3 ...
##  $ B_avg_opp_CLINCH_landed     : num  3 0.5 NA NA 10.8 ...
##  $ B_avg_GROUND_att            : num  4.5 7 NA NA 36.6 ...
##  $ B_avg_GROUND_landed         : num  3 4.5 NA NA 18.1 ...
##  $ B_avg_opp_GROUND_att        : num  36.5 1.5 NA NA 5.25 ...
##  $ B_avg_opp_GROUND_landed     : num  24.5 0.5 NA NA 3.75 ...
##  $ B_avg_CTRL_time.seconds.    : num  34 220 NA NA 390 ...
##  $ B_avg_opp_CTRL_time.seconds.: num  277.5 24.5 NA NA 156.3 ...
##  $ B_total_time_fought.seconds.: num  532 578 NA NA 764 ...
##  $ B_total_rounds_fought       : int  4 4 0 0 11 10 28 23 0 0 ...
##  $ B_total_title_bouts         : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ B_current_win_streak        : int  0 2 0 0 3 4 0 0 0 0 ...
##  $ B_current_lose_streak       : int  1 0 0 0 0 0 1 1 0 0 ...
##  $ B_longest_win_streak        : int  1 2 0 0 3 4 1 5 0 0 ...
##  $ B_wins                      : int  1 2 0 0 4 4 4 8 0 0 ...
##  $ B_losses                    : int  1 0 0 0 1 0 6 2 0 0 ...
##  $ B_draw                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ B_win_by_Decision_Majority  : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ B_win_by_Decision_Split     : int  0 1 0 0 0 0 0 2 0 0 ...
##  $ B_win_by_Decision_Unanimous : int  0 0 0 0 1 2 1 1 0 0 ...
##  $ B_win_by_KO.TKO             : int  0 1 0 0 2 0 2 3 0 0 ...
##  $ B_win_by_Submission         : int  1 0 0 0 1 2 0 2 0 0 ...
##  $ B_win_by_TKO_Doctor_Stoppage: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ B_Stance                    : chr  "Orthodox" "Orthodox" "Orthodox" "Southpaw" ...
##  $ B_Height_cms                : num  165 188 188 152 180 ...
##  $ B_Reach_cms                 : num  170 193 190 155 183 ...
##  $ B_Weight_lbs                : num  135 205 241 115 135 145 170 185 135 125 ...
##  $ R_avg_KD                    : num  1 1.031 0.547 NA 0 ...
##  $ R_avg_opp_KD                : num  0 0.0625 0.1875 NA 0.000977 ...
##  $ R_avg_SIG_STR_pct           : num  0.5 0.577 0.539 NA 0.403 ...
##  $ R_avg_opp_SIG_STR_pct       : num  0.46 0.381 0.599 NA 0.555 ...
##  $ R_avg_TD_pct                : num  0 0.406 0 NA 0.512 ...
##  $ R_avg_opp_TD_pct            : num  0 0.116 0.312 NA 0.629 ...
##  $ R_avg_SUB_ATT               : num  0 0.25 0 NA 0.231 ...
##  $ R_avg_opp_SUB_ATT           : num  0 1.1875 0.25 NA 0.0312 ...
##  $ R_avg_REV                   : num  0 0.375 0 NA 0.0312 ...
##  $ R_avg_opp_REV               : num  0 0.25 0 NA 0.5 0.5 0.25 0 0 0.5 ...
##  $ R_avg_SIG_STR_att           : num  34 77.6 59.2 NA 109.3 ...
##  $ R_avg_SIG_STR_landed        : num  17 43.2 30.4 NA 44.4 ...
##  $ R_avg_opp_SIG_STR_att       : num  13 69.2 43.8 NA 148.8 ...
##  $ R_avg_opp_SIG_STR_landed    : num  6 27.6 24.8 NA 84.6 ...
##  $ R_avg_TOTAL_STR_att         : num  35 93.1 70.5 NA 137.2 ...
##  $ R_avg_TOTAL_STR_landed      : num  18 57.2 41.4 NA 70.2 ...
##  $ R_avg_opp_TOTAL_STR_att     : num  16 98.3 50.2 NA 172.5 ...
##  $ R_avg_opp_TOTAL_STR_landed  : num  9 52.5 30.9 NA 106.7 ...
##  $ R_avg_TD_att                : num  0 1.2812 0.0312 NA 2.2617 ...
##  $ R_avg_TD_landed             : num  0 0.781 0 NA 1.262 ...
##  $ R_avg_opp_TD_att            : num  3 4.69 2.84 NA 3.14 ...
##  $ R_avg_opp_TD_landed         : num  0 0.438 1.75 NA 1.771 ...
##  $ R_avg_HEAD_att              : num  32 71.1 42.5 NA 86.4 ...
##  $ R_avg_HEAD_landed           : num  15 38.1 16.8 NA 26 ...
##   [list output truncated]

After observing the data types we see that the majority the features are numerical/integers, with a few character variables. Because we are using classification models we will transform the ‘Win’ column to numeric data, later in this report.

First we get rid of all rows that contain null values. First time fighters all have null values, as they have not yet fought yet in the UFC and do not have complied fight stats. Therefore, our models will not predict fight winners if a fighter has never fought in the UFC. First time fighters will be removed when excluding all nulls.

ufc = na.omit(ufc)

Use substring to extract the year from date and store the values as ‘year’.

ufc$year <- substr(ufc$date, 1, 4)

Eliminate all fights that end in a ‘Draw’.

ufc <- filter(ufc, Winner != 'Draw')

One Hot Encode Dependent Variable

Here is where we create our dependent variable ‘win_dummy’. We will set Blue = 1 and Red = 0. This will be used in our classification models. This will also show us the proportion of Red vs Blue wins throughout the dataset. Below are a few insights into the distribution of win_dummy.

ufc$win_dummy = ifelse(ufc$Winner == 'Blue', 1,0)

The line graph below shows that Blue Wins were not recorded until 2010. Upon further research, before 2010 there were no red or blue corners. On the ufc stats website (http://ufcstats.com/statistics/events/completed), Red is the default value for whoever won their fight during this time period. When predicting Blue or Red being the winner, we must filter out all fights before 2010 so that we create a more balanced and accurate data set that reflects the UFC today. We also see Red wins more often than Blue since 2010.

# Multiple line plot
ggplot(ufc, aes(x = year, group=Winner)) + 
  geom_line(stat = "count", aes(color = Winner), size=1) + 
scale_color_manual(values=c('Blue','Red')) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Convert the data type of ‘date’ from Character to Date.

class(ufc$date)

## [1] "character"

ufc <- ufc %>%  mutate(date = as.Date(date, format = "%Y-%m-%d"))

class(ufc$date)

## [1] "Date"

Blue vs Red Wins over time

Below we observe the gap between Red vs Blue wins since 2010, expressed as percentage of wins for Red and Blue Corners, by year.

ufc_2010 <- ufc %>% filter(date >= "2010-01-01")

g <- ufc_2010 %>%
  group_by(win_dummy, year) %>%
  summarise(cnt = n()) %>%
  group_by(year) %>%
  mutate(freq = round(cnt / sum(cnt), 3))

## `summarise()` regrouping output by 'win_dummy' (override with `.groups` argument)

g <- as.data.frame(g)

g$win_dummy <- as.character(g$win_dummy)
g$freq <- as.numeric(g$freq)

g$year <- as.numeric(as.vector(g$year))
gg1 <- ggplot(data=g, aes(x=year, y=freq, group=win_dummy, fill=win_dummy)) +
  geom_bar(stat="identity", 
           width = 0.5, 
           position=position_dodge())+
          scale_fill_manual(values=c(
                             "red",
                             "darkblue")) +
        theme(axis.line = element_line(linetype = "solid"), 
        panel.grid.minor = element_line(linetype = "blank"), 
        axis.text = element_text(colour = "gray18"), 
        panel.background = element_rect(fill = "gray64"), 
        plot.background = element_rect(fill = "aliceblue"))+labs(title = "Red vs Blue Win Percentage (Grouped by Year)", 
    x = "Year", y = "Percentage", fill = "Red/Blue")

gg1 + scale_x_continuous(breaks = c(2010, 2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021))

Red Control Time On Ground Over Time.(seconds a fighter controls another fighter on the ground)

Does not allow their opponent to stand up.

From our graph below, we see that Red Fighter Average Control Time on the Ground (in seconds) has decreased steadily, even after 2010. We also see that Control Time was not recorded until 2000, but the original dataset includes fights back to 1997 and UFC fights span back to 1993. This suggest that all fighters ground defense and wrestling have steadily improved over the years. We will use 2016-2021 data from here on.

R_avg_CTRL_time.seconds_history <- ufc %>%
  group_by(year) %>%
  summarise_at(vars(R_avg_CTRL_time.seconds.), list(name = mean))

R_avg_CTRL_time.seconds_history <- as.data.frame(R_avg_CTRL_time.seconds_history)

R_avg_CTRL_time.seconds_history$year <- as.numeric(R_avg_CTRL_time.seconds_history$year)

r_ctrl <- ggplot(R_avg_CTRL_time.seconds_history, aes(year,name, group=year)) +
                                        geom_bar(stat='identity', fill='red')+labs(y = "Red Fighter Average Control Time on Ground (seconds)")

r_ctrl + scale_x_continuous(breaks = c(1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021)) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Now we will filter out our dataset

ufc_2 <- ufc %>% filter(date >= "2016-01-01")

Between 2016-2021, 0=Red, 1=Blue.

table(ufc_2$win_dummy)

## 
##    0    1 
## 1068  820

Graph showing the number of fights per weight class

unique(ufc_2$weight_class)

##  [1] "Bantamweight"       "Middleweight"       "WomenBantamweight" 
##  [4] "Lightweight"        "Welterweight"       "Flyweight"         
##  [7] "LightHeavyweight"   "WomenStrawweight"   "Featherweight"     
## [10] "WomenFlyweight"     "WomenFeatherweight" "Heavyweight"       
## [13] "CatchWeight"

Number of Fights per Weight Class since 2016

gg2 <- ggplot(data=ufc_2, aes(weight_class)) + geom_bar(fill = "purple")+labs(x = "Weight Class", y = "Number of Fights (2016-2021)")
gg2 + coord_flip()

Level of Activity among weight classes

R_avg_TD_att_wc <- ufc_2 %>%
  group_by(weight_class) %>%
  summarise_at(vars(R_avg_TD_att), list(name = mean))

R_avg_TD_att_wc <- as.data.frame(R_avg_TD_att_wc)

ggplot(data=R_avg_TD_att_wc, aes(x=weight_class,y=name)) + geom_bar(stat='identity', fill = "purple") + coord_flip()+labs(y = "Red Average Take Downs Attempted (per fight)")+labs(title = '2016 - 2021', x = "Weight Class", 
    y = "Red Fighter Average Take Downs Attempted (per fight)")

R_avg_SIG_STR_att_wc <- ufc_2 %>%
  group_by(weight_class) %>%
  summarise_at(vars(R_avg_SIG_STR_att), list(name = mean))

R_avg_SIG_STR_att_wc <- as.data.frame(R_avg_SIG_STR_att_wc)

ggplot(data=R_avg_SIG_STR_att_wc, aes(x=weight_class,y=name)) + geom_bar(stat='identity', fill = "purple") + coord_flip()+labs(y = "Red Average Significant Strikes Attempted (per fight)")+labs(title = "2016-2021")+labs(x = "Weight Class")

R_avg_SUB_ATT_wc <- ufc_2 %>%
  group_by(weight_class) %>%
  summarise_at(vars(R_avg_SUB_ATT), list(name = mean))

R_avg_SUB_ATT_wc <- as.data.frame(R_avg_SUB_ATT_wc)

ggplot(data=R_avg_SUB_ATT_wc, aes(x=weight_class,y=name)) + geom_bar(stat='identity', fill = "purple") + coord_flip()+labs(y = "Red Average Submissions Attempted (per fight)")+labs(title = "2016-2021", x = "Weight Class")

R_avg_win_by_KO_TKO <- ufc_2 %>%
  group_by(weight_class) %>%
  summarise_at(vars(R_win_by_KO.TKO), list(name = mean))

R_avg_win_by_KO_TKO <- as.data.frame(R_avg_win_by_KO_TKO)

ggplot(data=R_avg_win_by_KO_TKO, aes(x=weight_class,y=name)) + geom_bar(stat='identity', fill = "purple") + coord_flip()+labs(y = "Red Fighter Average Wins by KO/TKO")+labs(title = "2016-2021", x = "Weight Class")

Data visual Conclusions

We can conclude that the lower in weight class (meaning the less the fighter weighs in pounds), the more activity they display in their fights. The graphs show that Heavyweight fighters not only have the least amount of fights since 2016, but also average less attempts of strikes, take downs, and submissions as the smaller UFC fighters. They do however average more knockouts than anyother weight class. For this reason, it would be likely that some independent variables like ‘Blue Average Submissions Attempted’ or ‘Red Average Knock Downs’ would have more importance for other classes. To better predict fight outcomes, it’s likely models should be built around specific weight classes to find the best results.

But for our initial model builds, we include all weight classes and use all of the same independent(predictive) variables. At the end of this report specific weight classes are tested.

We will leave the ‘win_dummy’ variable set as Blue = 1 and Red = 0 with Blue being the positive outcome. We know that Red typically wins more often as shown by the results above. We want to create a model that essentially predicts upsets of Blue winning.

We also observe that the percentage of wins between Red vs Blue have gotten closer over time, meaning that Blue fighters now win more often than they used to. This could support the idea that the level of skill and competition among UFC fighters has increased over the years. Therefore we will use data from 2016-2021 as it is more representative of the level of competition that the UFC sees today. Also, Women did not compete in the UFC until 2013

Variable Inference

We will now gather inference on which variables are significant influences on the outcome of Red or Blue Corner winning. Our dependent ‘y’ variable will be ‘win_dummy’ (Blue win = 1, Red win = 0). Our independent ‘x’ variables will be chosen based on significance and theoretical need.

ufc_cor <- select_if(ufc_2, is.numeric)

Below we see the correlation of the fight variables to the Winner variable(win_dummy). The closer to ‘-1 or 1’ the variable is the better. A negative value does not indicate a bad correlation, only the size of the number matters.Example:(-.75 > .65). Age seems to have the highest correlation at first look.

cor(ufc_cor[135], ufc_cor[])

## Warning in cor(ufc_cor[135], ufc_cor[]): the standard deviation is zero

##             B_avg_KD B_avg_opp_KD B_avg_SIG_STR_pct B_avg_opp_SIG_STR_pct
## win_dummy 0.00306321  -0.03899145        0.04204823           -0.03005528
##           B_avg_TD_pct B_avg_opp_TD_pct B_avg_SUB_ATT B_avg_opp_SUB_ATT
## win_dummy   0.03714695      -0.05547428    0.03070611       -0.01733953
##            B_avg_REV B_avg_opp_REV B_avg_SIG_STR_att B_avg_SIG_STR_landed
## win_dummy 0.01894169    0.03367675        0.01532451           0.02324385
##           B_avg_opp_SIG_STR_att B_avg_opp_SIG_STR_landed B_avg_TOTAL_STR_att
## win_dummy           -0.02031625              -0.03309539          0.02740114
##           B_avg_TOTAL_STR_landed B_avg_opp_TOTAL_STR_att
## win_dummy             0.04181535             -0.01948659
##           B_avg_opp_TOTAL_STR_landed B_avg_TD_att B_avg_TD_landed
## win_dummy                -0.02774442   0.06196834      0.07321986
##           B_avg_opp_TD_att B_avg_opp_TD_landed B_avg_HEAD_att B_avg_HEAD_landed
## win_dummy       0.04465765           0.0309859     0.01384461        0.02408938
##           B_avg_opp_HEAD_att B_avg_opp_HEAD_landed B_avg_BODY_att
## win_dummy        -0.02071813           -0.03864946     0.01479837
##           B_avg_BODY_landed B_avg_opp_BODY_att B_avg_opp_BODY_landed
## win_dummy        0.01697846        -0.02043532           -0.02779698
##           B_avg_LEG_att B_avg_LEG_landed B_avg_opp_LEG_att B_avg_opp_LEG_landed
## win_dummy    0.01013503      0.007056198     -0.0008558403          0.004390985
##           B_avg_DISTANCE_att B_avg_DISTANCE_landed B_avg_opp_DISTANCE_att
## win_dummy        0.005365871           0.006427694           -0.006663574
##           B_avg_opp_DISTANCE_landed B_avg_CLINCH_att B_avg_CLINCH_landed
## win_dummy               -0.01041182      0.001296825         0.007211577
##           B_avg_opp_CLINCH_att B_avg_opp_CLINCH_landed B_avg_GROUND_att
## win_dummy          -0.04026424             -0.03782442       0.06171763
##           B_avg_GROUND_landed B_avg_opp_GROUND_att B_avg_opp_GROUND_landed
## win_dummy           0.0587167          -0.05242162             -0.05475416
##           B_avg_CTRL_time.seconds. B_avg_opp_CTRL_time.seconds.
## win_dummy               0.05019004                   0.00205904
##           B_total_time_fought.seconds. B_total_rounds_fought
## win_dummy                   0.03791624           -0.03425858
##           B_total_title_bouts B_current_win_streak B_current_lose_streak
## win_dummy         -0.06098007         -0.007216505           -0.02305906
##           B_longest_win_streak      B_wins    B_losses B_draw
## win_dummy         -0.006702128 -0.01874144 -0.04857157     NA
##           B_win_by_Decision_Majority B_win_by_Decision_Split
## win_dummy                -0.01340058             -0.05034336
##           B_win_by_Decision_Unanimous B_win_by_KO.TKO B_win_by_Submission
## win_dummy                 0.007953964     -0.03091963         0.009940745
##           B_win_by_TKO_Doctor_Stoppage B_Height_cms B_Reach_cms B_Weight_lbs
## win_dummy                  -0.01618828   0.02728662  0.03034889   0.02776254
##              R_avg_KD R_avg_opp_KD R_avg_SIG_STR_pct R_avg_opp_SIG_STR_pct
## win_dummy -0.04096858   0.05978554       -0.04379517            0.09915966
##           R_avg_TD_pct R_avg_opp_TD_pct R_avg_SUB_ATT R_avg_opp_SUB_ATT
## win_dummy  -0.04921816       0.03230588   -0.05286117       -0.01384936
##            R_avg_REV R_avg_opp_REV R_avg_SIG_STR_att R_avg_SIG_STR_landed
## win_dummy 0.01965227   -0.01394109       -0.03095346          -0.04196227
##           R_avg_opp_SIG_STR_att R_avg_opp_SIG_STR_landed R_avg_TOTAL_STR_att
## win_dummy             0.0174484               0.04870163         -0.04976161
##           R_avg_TOTAL_STR_landed R_avg_opp_TOTAL_STR_att
## win_dummy            -0.06546941              0.01820164
##           R_avg_opp_TOTAL_STR_landed R_avg_TD_att R_avg_TD_landed
## win_dummy                 0.04136129  -0.07558121     -0.07763501
##           R_avg_opp_TD_att R_avg_opp_TD_landed R_avg_HEAD_att R_avg_HEAD_landed
## win_dummy     -0.006153334          0.02582334    -0.03283888       -0.04850527
##           R_avg_opp_HEAD_att R_avg_opp_HEAD_landed R_avg_BODY_att
## win_dummy         0.01539803            0.04801341    0.001764856
##           R_avg_BODY_landed R_avg_opp_BODY_att R_avg_opp_BODY_landed
## win_dummy      -0.003982463         0.03025601            0.04656319
##           R_avg_LEG_att R_avg_LEG_landed R_avg_opp_LEG_att R_avg_opp_LEG_landed
## win_dummy   -0.02992376      -0.02717299      0.0004393277           0.01116835
##           R_avg_DISTANCE_att R_avg_DISTANCE_landed R_avg_opp_DISTANCE_att
## win_dummy        -0.01752002            -0.0245582            0.007482186
##           R_avg_opp_DISTANCE_landed R_avg_CLINCH_att R_avg_CLINCH_landed
## win_dummy                0.03178595      0.008217354          0.01308849
##           R_avg_opp_CLINCH_att R_avg_opp_CLINCH_landed R_avg_GROUND_att
## win_dummy           0.03657432              0.04085239      -0.08076469
##           R_avg_GROUND_landed R_avg_opp_GROUND_att R_avg_opp_GROUND_landed
## win_dummy         -0.07660279           0.03227853              0.03998505
##           R_avg_CTRL_time.seconds. R_avg_opp_CTRL_time.seconds.
## win_dummy              -0.08062083                   0.05382894
##           R_total_time_fought.seconds. R_total_rounds_fought
## win_dummy                  -0.02239872             0.0624352
##           R_total_title_bouts R_current_win_streak R_current_lose_streak
## win_dummy        -0.002108865           -0.0432342            0.04194568
##           R_longest_win_streak     R_wins  R_losses R_draw
## win_dummy          -0.02615285 0.03952394 0.1093061     NA
##           R_win_by_Decision_Majority R_win_by_Decision_Split
## win_dummy                 0.04471593              0.08173435
##           R_win_by_Decision_Unanimous R_win_by_KO.TKO R_win_by_Submission
## win_dummy                -0.003083416      0.04700467        -0.003545831
##           R_win_by_TKO_Doctor_Stoppage R_Height_cms R_Reach_cms R_Weight_lbs
## win_dummy                   0.04195498 -0.007027602 -0.02956736   0.01928033
##                B_age    R_age win_dummy
## win_dummy -0.1335294 0.141457         1

cr <- cor(ufc_cor[135], ufc_cor[c(1:20)])

We can also use a color scale to see values closer to one. The darker the color the better, regardless of the color itself.

ggcorrplot(cr)

We can begin to make some inferences. We see that none of the variables are significantly strong in correlation, however some are stronger than others. We see that any variable associated with ‘KD’ has a very low correlation. We can run this test per weight class as well to see if some variables have higher correlation rather than seeing the correlation when all weight classes data is combined.

Predictive Models

Logistic Regression

Now we build our initial model. Logistic Regression and find Multi-collinearity among variables.

Below we split the data into train and test. We make sure do gather random sample during the formation of the training data, that way the dataset is not read into our variable in the order of the dataset. If we do not do this, then %70 of the fights in the training data will be the first %70 of fights since 2016.

set.seed(123)
sample <- sample.int(n = nrow(ufc_2), size = round(.7*nrow(ufc_2)), replace = F)
ufc_train <- ufc_2[sample, ]
ufc_test  <- ufc_2[-sample, ]

set.seed(123)
m1 = glm(win_dummy ~ B_avg_TD_pct + B_avg_opp_TD_pct + B_avg_REV + B_avg_opp_REV + B_avg_TD_landed + B_avg_opp_TD_landed + B_avg_SIG_STR_landed + B_avg_opp_SIG_STR_landed + B_avg_SIG_STR_pct + B_avg_opp_SIG_STR_pct + B_avg_TOTAL_STR_landed + B_avg_opp_TOTAL_STR_landed + B_avg_CLINCH_landed + B_avg_opp_CLINCH_landed + B_avg_KD + B_avg_opp_KD + B_avg_SUB_ATT + B_avg_opp_SUB_ATT + B_avg_GROUND_landed + B_avg_opp_GROUND_landed + B_win_by_KO.TKO + B_win_by_Submission + B_win_by_Decision_Unanimous + R_avg_TD_pct + R_avg_opp_TD_pct + R_avg_REV + R_avg_opp_REV + R_avg_TD_landed + R_avg_opp_TD_landed + R_avg_SIG_STR_landed + R_avg_opp_SIG_STR_landed + R_avg_SIG_STR_pct + R_avg_opp_SIG_STR_pct + R_avg_TOTAL_STR_landed + R_avg_opp_TOTAL_STR_landed + R_avg_CLINCH_landed + R_avg_opp_CLINCH_landed + R_avg_KD + R_avg_opp_KD +  R_avg_opp_SUB_ATT + R_avg_GROUND_landed + R_avg_opp_GROUND_landed + R_win_by_KO.TKO + R_win_by_Submission + R_win_by_Decision_Unanimous + 
B_total_rounds_fought + R_total_rounds_fought + B_avg_CTRL_time.seconds. + B_avg_opp_CTRL_time.seconds. + B_current_win_streak + B_losses + R_age + B_age + B_Reach_cms + R_Reach_cms + B_Height_cms + R_Height_cms + weight_class, 
data = ufc_train, family = binomial)

summary(m1)

## 
## Call:
## glm(formula = win_dummy ~ B_avg_TD_pct + B_avg_opp_TD_pct + B_avg_REV + 
##     B_avg_opp_REV + B_avg_TD_landed + B_avg_opp_TD_landed + B_avg_SIG_STR_landed + 
##     B_avg_opp_SIG_STR_landed + B_avg_SIG_STR_pct + B_avg_opp_SIG_STR_pct + 
##     B_avg_TOTAL_STR_landed + B_avg_opp_TOTAL_STR_landed + B_avg_CLINCH_landed + 
##     B_avg_opp_CLINCH_landed + B_avg_KD + B_avg_opp_KD + B_avg_SUB_ATT + 
##     B_avg_opp_SUB_ATT + B_avg_GROUND_landed + B_avg_opp_GROUND_landed + 
##     B_win_by_KO.TKO + B_win_by_Submission + B_win_by_Decision_Unanimous + 
##     R_avg_TD_pct + R_avg_opp_TD_pct + R_avg_REV + R_avg_opp_REV + 
##     R_avg_TD_landed + R_avg_opp_TD_landed + R_avg_SIG_STR_landed + 
##     R_avg_opp_SIG_STR_landed + R_avg_SIG_STR_pct + R_avg_opp_SIG_STR_pct + 
##     R_avg_TOTAL_STR_landed + R_avg_opp_TOTAL_STR_landed + R_avg_CLINCH_landed + 
##     R_avg_opp_CLINCH_landed + R_avg_KD + R_avg_opp_KD + R_avg_opp_SUB_ATT + 
##     R_avg_GROUND_landed + R_avg_opp_GROUND_landed + R_win_by_KO.TKO + 
##     R_win_by_Submission + R_win_by_Decision_Unanimous + B_total_rounds_fought + 
##     R_total_rounds_fought + B_avg_CTRL_time.seconds. + B_avg_opp_CTRL_time.seconds. + 
##     B_current_win_streak + B_losses + R_age + B_age + B_Reach_cms + 
##     R_Reach_cms + B_Height_cms + R_Height_cms + weight_class, 
##     family = binomial, data = ufc_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0882  -1.0031  -0.6387   1.1125   2.2068  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     1.969e+00  3.533e+00   0.557 0.577305    
## B_avg_TD_pct                   -3.403e-01  3.063e-01  -1.111 0.266568    
## B_avg_opp_TD_pct               -1.256e+00  3.510e-01  -3.579 0.000344 ***
## B_avg_REV                       7.016e-02  2.311e-01   0.304 0.761405    
## B_avg_opp_REV                   2.931e-01  2.545e-01   1.152 0.249348    
## B_avg_TD_landed                 1.552e-01  8.753e-02   1.773 0.076239 .  
## B_avg_opp_TD_landed             2.576e-01  8.916e-02   2.890 0.003857 ** 
## B_avg_SIG_STR_landed            1.470e-03  7.852e-03   0.187 0.851530    
## B_avg_opp_SIG_STR_landed        8.376e-03  8.709e-03   0.962 0.336170    
## B_avg_SIG_STR_pct              -7.295e-02  6.438e-01  -0.113 0.909789    
## B_avg_opp_SIG_STR_pct           1.963e-01  6.714e-01   0.292 0.769983    
## B_avg_TOTAL_STR_landed         -1.718e-03  5.349e-03  -0.321 0.748108    
## B_avg_opp_TOTAL_STR_landed     -5.397e-03  6.489e-03  -0.832 0.405546    
## B_avg_CLINCH_landed             2.324e-02  1.683e-02   1.381 0.167368    
## B_avg_opp_CLINCH_landed        -6.231e-02  2.016e-02  -3.090 0.001999 ** 
## B_avg_KD                       -1.244e-01  1.932e-01  -0.644 0.519668    
## B_avg_opp_KD                   -3.776e-01  2.314e-01  -1.632 0.102723    
## B_avg_SUB_ATT                   1.424e-01  1.201e-01   1.186 0.235770    
## B_avg_opp_SUB_ATT               5.151e-02  1.471e-01   0.350 0.726177    
## B_avg_GROUND_landed             3.520e-03  1.422e-02   0.248 0.804455    
## B_avg_opp_GROUND_landed        -3.078e-02  1.424e-02  -2.162 0.030620 *  
## B_win_by_KO.TKO                 8.345e-02  5.783e-02   1.443 0.148999    
## B_win_by_Submission             7.604e-02  6.094e-02   1.248 0.212128    
## B_win_by_Decision_Unanimous     2.016e-01  8.549e-02   2.359 0.018346 *  
## R_avg_TD_pct                    2.632e-01  3.381e-01   0.779 0.436255    
## R_avg_opp_TD_pct               -2.265e-01  3.594e-01  -0.630 0.528477    
## R_avg_REV                       4.890e-01  2.534e-01   1.930 0.053616 .  
## R_avg_opp_REV                  -1.855e-01  2.495e-01  -0.743 0.457242    
## R_avg_TD_landed                -8.705e-02  7.626e-02  -1.142 0.253655    
## R_avg_opp_TD_landed             1.058e-01  8.239e-02   1.284 0.199077    
## R_avg_SIG_STR_landed           -5.838e-03  7.070e-03  -0.826 0.408924    
## R_avg_opp_SIG_STR_landed        1.102e-02  7.623e-03   1.445 0.148435    
## R_avg_SIG_STR_pct              -7.144e-01  6.809e-01  -1.049 0.294076    
## R_avg_opp_SIG_STR_pct           1.597e+00  6.956e-01   2.295 0.021711 *  
## R_avg_TOTAL_STR_landed         -3.023e-03  5.004e-03  -0.604 0.545787    
## R_avg_opp_TOTAL_STR_landed     -6.070e-03  5.521e-03  -1.100 0.271516    
## R_avg_CLINCH_landed             1.454e-02  1.558e-02   0.933 0.350659    
## R_avg_opp_CLINCH_landed         1.461e-03  1.766e-02   0.083 0.934079    
## R_avg_KD                       -1.119e-01  1.926e-01  -0.581 0.561221    
## R_avg_opp_KD                    1.106e-01  2.130e-01   0.519 0.603540    
## R_avg_opp_SUB_ATT               2.218e-02  1.419e-01   0.156 0.875805    
## R_avg_GROUND_landed             7.288e-04  1.100e-02   0.066 0.947197    
## R_avg_opp_GROUND_landed         5.975e-03  1.316e-02   0.454 0.649912    
## R_win_by_KO.TKO                 1.197e-02  5.046e-02   0.237 0.812527    
## R_win_by_Submission            -5.709e-02  5.001e-02  -1.141 0.253675    
## R_win_by_Decision_Unanimous    -5.963e-02  6.132e-02  -0.972 0.330829    
## B_total_rounds_fought          -2.908e-02  2.203e-02  -1.320 0.186856    
## R_total_rounds_fought           5.772e-03  1.038e-02   0.556 0.578223    
## B_avg_CTRL_time.seconds.        4.671e-05  1.121e-03   0.042 0.966769    
## B_avg_opp_CTRL_time.seconds.    7.642e-04  1.122e-03   0.681 0.495907    
## B_current_win_streak           -3.363e-02  4.686e-02  -0.718 0.472960    
## B_losses                        4.200e-02  8.102e-02   0.518 0.604191    
## R_age                           7.784e-02  1.840e-02   4.230 2.34e-05 ***
## B_age                          -7.730e-02  1.867e-02  -4.141 3.46e-05 ***
## B_Reach_cms                    -6.582e-03  1.337e-02  -0.492 0.622538    
## R_Reach_cms                    -2.880e-02  1.352e-02  -2.131 0.033101 *  
## B_Height_cms                    8.339e-03  1.743e-02   0.478 0.632328    
## R_Height_cms                    1.391e-02  1.760e-02   0.791 0.429201    
## weight_classCatchWeight         5.484e-01  7.320e-01   0.749 0.453762    
## weight_classFeatherweight      -2.354e-01  2.811e-01  -0.837 0.402499    
## weight_classFlyweight          -2.680e-01  3.402e-01  -0.788 0.430745    
## weight_classHeavyweight         2.469e-01  5.446e-01   0.453 0.650236    
## weight_classLightHeavyweight    2.505e-01  4.779e-01   0.524 0.600194    
## weight_classLightweight         2.646e-01  2.924e-01   0.905 0.365468    
## weight_classMiddleweight       -3.183e-02  4.086e-01  -0.078 0.937918    
## weight_classWelterweight        5.807e-02  3.446e-01   0.169 0.866178    
## weight_classWomenBantamweight   1.368e-01  3.988e-01   0.343 0.731579    
## weight_classWomenFeatherweight -2.905e-01  1.025e+00  -0.283 0.776891    
## weight_classWomenFlyweight     -2.792e-01  3.948e-01  -0.707 0.479482    
## weight_classWomenStrawweight   -1.985e-01  3.925e-01  -0.506 0.612999    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1811.8  on 1321  degrees of freedom
## Residual deviance: 1641.0  on 1252  degrees of freedom
## AIC: 1781
## 
## Number of Fisher Scoring iterations: 4

We get our variable coefficients and p-values for the model. We see there are some variables that show significance, mainly Age.

Now we fit the model onto the Test data and see how well it predicts. We’ll use the caret package to print out a confusion matrix and show the models performance.

set.seed(123)
ufc_test$PredProb = predict.glm(m1, ufc_test, type = 'response')

ufc_test$Prediction = ifelse(ufc_test$PredProb >= 0.5,1,0)

caret::confusionMatrix(as.factor(ufc_test$win_dummy), as.factor(ufc_test$Prediction), positive='1')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 223 101
##          1 128 114
##                                           
##                Accuracy : 0.5954          
##                  95% CI : (0.5537, 0.6361)
##     No Information Rate : 0.6201          
##     P-Value [Acc > NIR] : 0.89506         
##                                           
##                   Kappa : 0.1616          
##                                           
##  Mcnemar's Test P-Value : 0.08577         
##                                           
##             Sensitivity : 0.5302          
##             Specificity : 0.6353          
##          Pos Pred Value : 0.4711          
##          Neg Pred Value : 0.6883          
##              Prevalence : 0.3799          
##          Detection Rate : 0.2014          
##    Detection Prevalence : 0.4276          
##       Balanced Accuracy : 0.5828          
##                                           
##        'Positive' Class : 1               
##

Our confusion matrix for our initial model shows 59% accuracy. Accuracy is our models ability to correctly predict the outcome of the dependent variable. But we have a higher ‘No Information Rate’(NIR) of 62%. The No Information Rate shows us the largest proportion of the observed data for this sample size(‘ufc_test’). In this Test data, the largest proportion of wins belongs to Red at 62%, meaning if we predicted Red as the winner in this test set for each fight(row), we would have a higher accuracy than the accuracy our predictive model (m1) has given us. Accuracy should be higher than NIR to be considered effective. We can also use ‘P-Value [Acc > NIR] : .89’ to see that the model is not significant in predicting the winner as well. A p-value that is lower is better, such as ‘0.05’ or for UFC predictions even ‘0.10’

We now check for multicollinearity among the independent variables. Multicollinearity is a statistical concept where several independent variables in a model are overly correlated. Two variables are considered to be perfectly collinear if their correlation coefficient is +/- 1.0. Multicollinearity among independent variables will result in less reliable statistical inferences.

Multicollinearity can be detected via various methods. In this report, we will focus on a common one – VIF (Variable Inflation Factors).

VIF determines the strength of the correlation between the independent variables. It is predicted by taking a variable and regressing it against every other variable. VIF score of an independent variable represents how well the variable is explained by other independent variables.

R^2 value is determined to find out how well an independent variable is described by the other independent variables. A high value of R^2 means that the variable is highly correlated with the other variables. This is captured by the VIF.

As a rule of thumb, a vif score over 5 is regarded as overly correlated. A score over 10 should be remedied and consider dropping the variable from the regression model or creating an index of all the closely related variables.

vif(m1)

##                                   GVIF Df GVIF^(1/(2*Df))
## B_avg_TD_pct                  1.758453  1        1.326067
## B_avg_opp_TD_pct              1.948953  1        1.396049
## B_avg_REV                     1.409804  1        1.187352
## B_avg_opp_REV                 1.350532  1        1.162124
## B_avg_TD_landed               3.386404  1        1.840218
## B_avg_opp_TD_landed           3.077177  1        1.754189
## B_avg_SIG_STR_landed          9.606699  1        3.099467
## B_avg_opp_SIG_STR_landed     10.814295  1        3.288509
## B_avg_SIG_STR_pct             1.518125  1        1.232122
## B_avg_opp_SIG_STR_pct         1.473767  1        1.213988
## B_avg_TOTAL_STR_landed        7.642021  1        2.764420
## B_avg_opp_TOTAL_STR_landed    9.538876  1        3.088507
## B_avg_CLINCH_landed           2.389040  1        1.545652
## B_avg_opp_CLINCH_landed       2.399104  1        1.548904
## B_avg_KD                      1.526855  1        1.235660
## B_avg_opp_KD                  1.291885  1        1.136611
## B_avg_SUB_ATT                 1.476067  1        1.214935
## B_avg_opp_SUB_ATT             1.304696  1        1.142233
## B_avg_GROUND_landed           2.170574  1        1.473287
## B_avg_opp_GROUND_landed       2.053602  1        1.433039
## B_win_by_KO.TKO               3.554093  1        1.885230
## B_win_by_Submission           2.190286  1        1.479962
## B_win_by_Decision_Unanimous   6.022816  1        2.454143
## R_avg_TD_pct                  1.880171  1        1.371193
## R_avg_opp_TD_pct              1.885595  1        1.373170
## R_avg_REV                     1.467499  1        1.211404
## R_avg_opp_REV                 1.460827  1        1.208647
## R_avg_TD_landed               2.309801  1        1.519803
## R_avg_opp_TD_landed           2.146734  1        1.465174
## R_avg_SIG_STR_landed          8.267906  1        2.875397
## R_avg_opp_SIG_STR_landed      9.061933  1        3.010304
## R_avg_SIG_STR_pct             1.496087  1        1.223147
## R_avg_opp_SIG_STR_pct         1.487193  1        1.219505
## R_avg_TOTAL_STR_landed        7.038128  1        2.652947
## R_avg_opp_TOTAL_STR_landed    7.803845  1        2.793536
## R_avg_CLINCH_landed           2.338427  1        1.529192
## R_avg_opp_CLINCH_landed       2.339588  1        1.529571
## R_avg_KD                      1.369624  1        1.170309
## R_avg_opp_KD                  1.277758  1        1.130380
## R_avg_opp_SUB_ATT             1.250770  1        1.118378
## R_avg_GROUND_landed           2.002038  1        1.414934
## R_avg_opp_GROUND_landed       2.158113  1        1.469052
## R_win_by_KO.TKO               3.379912  1        1.838454
## R_win_by_Submission           1.970394  1        1.403707
## R_win_by_Decision_Unanimous   4.374091  1        2.091433
## B_total_rounds_fought        25.756155  1        5.075052
## R_total_rounds_fought         8.320626  1        2.884550
## B_avg_CTRL_time.seconds.      5.267908  1        2.295192
## B_avg_opp_CTRL_time.seconds.  4.739391  1        2.177014
## B_current_win_streak          1.794786  1        1.339696
## B_losses                      9.787890  1        3.128560
## R_age                         1.713216  1        1.308899
## B_age                         1.511191  1        1.229305
## B_Reach_cms                   6.524453  1        2.554301
## R_Reach_cms                   6.901464  1        2.627064
## B_Height_cms                  7.593179  1        2.755572
## R_Height_cms                  8.017919  1        2.831593
## weight_class                 35.601309 12        1.160498

We notice some high VIF scores. We should keep in mind that UFC fights always occur within the same weight class and same weight, similar height, etc. Also, some variables highly correlated because they occur together like ‘B_total_rounds_fought’ and ‘B_total_seconds_fought’.

After observing the summary of the m1 model, significance levels of variables, and the VIF we will choose the following variables in our second Logisitc Regression model below (m2).

For the initial model build, we will use striking variables that are more general than specific. For example, we will use Striking Percentage rather than Leg Strikes Attempted. The reason for this is because styles of fight vary among fighters, and we want to focus more on their efficiency rather than they’re style of fight when predicting. We can experiment with specific strikes later.

Note: All fights happen with respect to a fighters weight class, therefore their complied stats from previous fights and their opponents compiled stats have all occurred within their weight class. Therefore patterns in their fighting are bound to be consistent with their weight class which could lead to multi-collinearity. Height could be an issue as well. We will leave weight class out as weight class is more of a state of the fighter rather than a variable that is subject to change which would provide predictive ability.

set.seed(123)
m2 = glm(win_dummy ~  R_avg_TD_att + R_avg_SIG_STR_pct + R_avg_opp_SIG_STR_pct + R_avg_KD + R_avg_opp_KD + 
R_avg_SUB_ATT +  R_avg_CTRL_time.seconds. + R_avg_opp_CTRL_time.seconds. + R_avg_CLINCH_landed +
R_age + R_Reach_cms + R_avg_REV + R_avg_opp_GROUND_landed +

B_avg_opp_GROUND_landed +B_avg_TD_att +B_avg_opp_TD_pct +B_avg_SIG_STR_pct + B_avg_opp_SIG_STR_pct +
B_avg_KD + B_avg_opp_KD + B_avg_SUB_ATT +B_avg_CTRL_time.seconds.+B_avg_opp_CTRL_time.seconds.+
B_avg_CLINCH_landed + B_avg_opp_CLINCH_landed +
B_age + B_Reach_cms + B_avg_REV,
data = ufc_train, family = binomial)

summary(m2)

## 
## Call:
## glm(formula = win_dummy ~ R_avg_TD_att + R_avg_SIG_STR_pct + 
##     R_avg_opp_SIG_STR_pct + R_avg_KD + R_avg_opp_KD + R_avg_SUB_ATT + 
##     R_avg_CTRL_time.seconds. + R_avg_opp_CTRL_time.seconds. + 
##     R_avg_CLINCH_landed + R_age + R_Reach_cms + R_avg_REV + R_avg_opp_GROUND_landed + 
##     B_avg_opp_GROUND_landed + B_avg_TD_att + B_avg_opp_TD_pct + 
##     B_avg_SIG_STR_pct + B_avg_opp_SIG_STR_pct + B_avg_KD + B_avg_opp_KD + 
##     B_avg_SUB_ATT + B_avg_CTRL_time.seconds. + B_avg_opp_CTRL_time.seconds. + 
##     B_avg_CLINCH_landed + B_avg_opp_CLINCH_landed + B_age + B_Reach_cms + 
##     B_avg_REV, family = binomial, data = ufc_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9289  -1.0063  -0.6767   1.1274   2.2714  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   3.943e-01  1.194e+00   0.330 0.741254    
## R_avg_TD_att                 -6.728e-02  3.107e-02  -2.165 0.030353 *  
## R_avg_SIG_STR_pct            -7.133e-01  6.121e-01  -1.165 0.243848    
## R_avg_opp_SIG_STR_pct         2.211e+00  6.327e-01   3.494 0.000476 ***
## R_avg_KD                     -1.480e-01  1.809e-01  -0.818 0.413229    
## R_avg_opp_KD                  1.587e-01  2.055e-01   0.772 0.439846    
## R_avg_SUB_ATT                -2.999e-01  1.273e-01  -2.357 0.018423 *  
## R_avg_CTRL_time.seconds.      2.862e-05  7.068e-04   0.040 0.967701    
## R_avg_opp_CTRL_time.seconds.  1.170e-03  7.625e-04   1.534 0.125085    
## R_avg_CLINCH_landed          -5.096e-03  1.164e-02  -0.438 0.661529    
## R_age                         8.143e-02  1.534e-02   5.307 1.12e-07 ***
## R_Reach_cms                  -1.424e-02  7.917e-03  -1.799 0.071971 .  
## R_avg_REV                     3.796e-01  2.276e-01   1.668 0.095281 .  
## R_avg_opp_GROUND_landed      -4.215e-03  1.205e-02  -0.350 0.726412    
## B_avg_opp_GROUND_landed      -3.399e-02  1.247e-02  -2.725 0.006433 ** 
## B_avg_TD_att                  7.750e-02  3.106e-02   2.495 0.012584 *  
## B_avg_opp_TD_pct             -7.198e-01  2.936e-01  -2.451 0.014236 *  
## B_avg_SIG_STR_pct            -7.250e-02  5.821e-01  -0.125 0.900874    
## B_avg_opp_SIG_STR_pct         4.370e-01  6.365e-01   0.687 0.492385    
## B_avg_KD                     -8.804e-02  1.745e-01  -0.505 0.613798    
## B_avg_opp_KD                 -3.689e-01  2.204e-01  -1.674 0.094176 .  
## B_avg_SUB_ATT                 6.702e-02  1.083e-01   0.619 0.536102    
## B_avg_CTRL_time.seconds.     -5.392e-05  7.186e-04  -0.075 0.940184    
## B_avg_opp_CTRL_time.seconds.  1.581e-03  7.112e-04   2.222 0.026271 *  
## B_avg_CLINCH_landed           2.588e-02  1.393e-02   1.858 0.063142 .  
## B_avg_opp_CLINCH_landed      -6.428e-02  1.669e-02  -3.852 0.000117 ***
## B_age                        -7.328e-02  1.587e-02  -4.618 3.87e-06 ***
## B_Reach_cms                   6.299e-03  8.249e-03   0.764 0.445084    
## B_avg_REV                     1.976e-01  2.098e-01   0.942 0.346182    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1811.8  on 1321  degrees of freedom
## Residual deviance: 1666.3  on 1293  degrees of freedom
## AIC: 1724.3
## 
## Number of Fisher Scoring iterations: 4

set.seed(123)
ufc_test$PredProb = predict.glm(m2, ufc_test, type = 'response')

ufc_test$Prediction = ifelse(ufc_test$PredProb >= 0.5,1,0)

caret::confusionMatrix(as.factor(ufc_test$win_dummy), as.factor(ufc_test$Prediction), positive='1')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 233  91
##          1 127 115
##                                           
##                Accuracy : 0.6148          
##                  95% CI : (0.5734, 0.6551)
##     No Information Rate : 0.636           
##     P-Value [Acc > NIR] : 0.86244         
##                                           
##                   Kappa : 0.1981          
##                                           
##  Mcnemar's Test P-Value : 0.01776         
##                                           
##             Sensitivity : 0.5583          
##             Specificity : 0.6472          
##          Pos Pred Value : 0.4752          
##          Neg Pred Value : 0.7191          
##              Prevalence : 0.3640          
##          Detection Rate : 0.2032          
##    Detection Prevalence : 0.4276          
##       Balanced Accuracy : 0.6027          
##                                           
##        'Positive' Class : 1               
##

Check multi-colinearity scores.

vif(m2)

##                 R_avg_TD_att            R_avg_SIG_STR_pct 
##                     1.959772                     1.230693 
##        R_avg_opp_SIG_STR_pct                     R_avg_KD 
##                     1.244932                     1.240002 
##                 R_avg_opp_KD                R_avg_SUB_ATT 
##                     1.200657                     1.283920 
##     R_avg_CTRL_time.seconds. R_avg_opp_CTRL_time.seconds. 
##                     2.251146                     2.019427 
##          R_avg_CLINCH_landed                        R_age 
##                     1.304143                     1.206647 
##                  R_Reach_cms                    R_avg_REV 
##                     2.424137                     1.200405 
##      R_avg_opp_GROUND_landed      B_avg_opp_GROUND_landed 
##                     1.855908                     1.642835 
##                 B_avg_TD_att             B_avg_opp_TD_pct 
##                     2.015960                     1.399594 
##            B_avg_SIG_STR_pct        B_avg_opp_SIG_STR_pct 
##                     1.269691                     1.345275 
##                     B_avg_KD                 B_avg_opp_KD 
##                     1.268585                     1.183851 
##                B_avg_SUB_ATT     B_avg_CTRL_time.seconds. 
##                     1.219655                     2.215003 
## B_avg_opp_CTRL_time.seconds.          B_avg_CLINCH_landed 
##                     1.935582                     1.668842 
##      B_avg_opp_CLINCH_landed                        B_age 
##                     1.687617                     1.120621 
##                  B_Reach_cms                    B_avg_REV 
##                     2.541146                     1.230209

After running the m2 model, we see that the the model still is not a significant predictor. However, we see that the VIF scores are all below 5, meaning the model likely does contain multi collinearity.

We will now run a random forest model to see if our results improve. In the random forest we will also tune the hyper parameters.

Random Forest

set.seed(123)
train_rf.idx <- sample(1:nrow(ufc_2), size = 1 * nrow(ufc_2))
train_rf_data <- ufc_2[train_rf.idx,]

train_rf_data$win_dummy = as.factor(train_rf_data$win_dummy)

table(train_rf_data$win_dummy)

## 
##    0    1 
## 1068  820

Out-of-bag method is included. Out-of-bag error, also called out-of-bag estimate, is a method of measuring the prediction error of random forests, boosted decision trees, and other machine learning models utilizing bootstrap aggregating. Bagging uses subsampling with replacement to create training samples for the model to learn from.

set.seed(123)
forest2 <- randomForest(win_dummy ~ R_avg_TD_att + R_avg_SIG_STR_pct + R_avg_opp_SIG_STR_pct + R_avg_KD + R_avg_opp_KD + R_avg_SUB_ATT +  R_avg_CTRL_time.seconds. + R_avg_opp_CTRL_time.seconds. + R_avg_CLINCH_landed +
R_age + R_Reach_cms + R_avg_REV + R_avg_opp_GROUND_landed +

B_avg_opp_GROUND_landed + B_avg_TD_att + B_avg_opp_TD_pct +B_avg_SIG_STR_pct + B_avg_opp_SIG_STR_pct +
B_avg_KD + B_avg_opp_KD + B_avg_SUB_ATT + B_avg_CTRL_time.seconds.+ B_avg_opp_CTRL_time.seconds.+
B_avg_CLINCH_landed + B_avg_opp_CLINCH_landed + B_age +  B_Reach_cms + B_avg_REV,

data = train_rf_data, 
importance = TRUE, 
oob_score= TRUE,
ntree = 1000)

forest2

## 
## Call:
##  randomForest(formula = win_dummy ~ R_avg_TD_att + R_avg_SIG_STR_pct +      R_avg_opp_SIG_STR_pct + R_avg_KD + R_avg_opp_KD + R_avg_SUB_ATT +      R_avg_CTRL_time.seconds. + R_avg_opp_CTRL_time.seconds. +      R_avg_CLINCH_landed + R_age + R_Reach_cms + R_avg_REV + R_avg_opp_GROUND_landed +      B_avg_opp_GROUND_landed + B_avg_TD_att + B_avg_opp_TD_pct +      B_avg_SIG_STR_pct + B_avg_opp_SIG_STR_pct + B_avg_KD + B_avg_opp_KD +      B_avg_SUB_ATT + B_avg_CTRL_time.seconds. + B_avg_opp_CTRL_time.seconds. +      B_avg_CLINCH_landed + B_avg_opp_CLINCH_landed + B_age + B_Reach_cms +      B_avg_REV, data = train_rf_data, importance = TRUE, oob_score = TRUE,      ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 41.21%
## Confusion matrix:
##     0   1 class.error
## 0 824 244   0.2284644
## 1 534 286   0.6512195

Out of Bag Error of 41.21%. Giving us an Out of Bag Score of about 59%.

We see that a Random Forest does a better job at predicting in this instance than Logistic Regression. This could change depending on further transforming the data, like choosing one weight class.

Random Forest Hyper Parameters

set.seed(123)
# Establish a list of possible values for hyper-parameters
mtry.values <- seq(4,6,1)
nodesize.values <- seq(4,8,2)
ntree.values <- seq(4e3,6e3,1e3)
# Create a data frame containing all combinations
hyper_grid <- expand.grid(mtry = mtry.values, nodesize = nodesize.values, ntree = ntree.values)
# Create an empty vector to store OOB error values
oob_err <- c()
# Write a loop over the rows of hyper_grid to train the grid of models
for (i in 1:nrow(hyper_grid)) {
   # Train a Random Forest model
  forest2 <- randomForest(win_dummy ~ R_avg_TD_att + R_avg_SIG_STR_pct + R_avg_opp_SIG_STR_pct + 
              R_avg_KD +   R_avg_opp_KD + R_avg_SUB_ATT +  R_avg_CTRL_time.seconds. + 
              R_avg_opp_CTRL_time.seconds. + R_avg_CLINCH_landed +
              R_age + R_Reach_cms + R_avg_REV + R_avg_opp_GROUND_landed +

              B_avg_opp_GROUND_landed + B_avg_TD_att + B_avg_opp_TD_pct + B_avg_SIG_STR_pct + 
              B_avg_opp_SIG_STR_pct + B_avg_KD + B_avg_opp_KD + 
              B_avg_SUB_ATT + B_avg_CTRL_time.seconds.+ B_avg_opp_CTRL_time.seconds.+
              B_avg_CLINCH_landed + B_avg_opp_CLINCH_landed + B_age + B_Reach_cms +
              B_avg_REV,
              
                           data = train_rf_data, 
                           mtry = hyper_grid$mtry[i],
                           nodesize = hyper_grid$nodesize[i],
                           ntree = hyper_grid$ntree[i]) 
                         
                # Store OOB error for the model                     
                oob_err[i] <- forest2$err.rate[length(forest2$err.rate)]
}
# Identify optimal set of hyperparameters based on OOB error
opt_i <- which.min(oob_err)
print(hyper_grid[opt_i,])

##    mtry nodesize ntree
## 12    6        4  5000

Redo Random Forest with new Hyperparameters

set.seed(123)
forest.hyper <- randomForest(win_dummy ~ R_avg_TD_att + R_avg_SIG_STR_pct + R_avg_opp_SIG_STR_pct + R_avg_KD + R_avg_opp_KD + R_avg_SUB_ATT +  R_avg_CTRL_time.seconds. + R_avg_opp_CTRL_time.seconds. + R_avg_CLINCH_landed +
R_age + R_Reach_cms + R_avg_REV + R_avg_opp_GROUND_landed +

B_avg_opp_GROUND_landed + B_avg_TD_att + B_avg_opp_TD_pct + B_avg_SIG_STR_pct + B_avg_opp_SIG_STR_pct +
B_avg_KD + B_avg_opp_KD +  B_avg_SUB_ATT + B_avg_CTRL_time.seconds.+ B_avg_opp_CTRL_time.seconds.+
B_avg_CLINCH_landed + B_avg_opp_CLINCH_landed + B_age + B_Reach_cms +
B_avg_REV,
                             data = train_rf_data,
                             importance = TRUE,
                             mtry = 6,
                             nodesize = 4,
                             ntree = 5000)

forest.hyper

## 
## Call:
##  randomForest(formula = win_dummy ~ R_avg_TD_att + R_avg_SIG_STR_pct +      R_avg_opp_SIG_STR_pct + R_avg_KD + R_avg_opp_KD + R_avg_SUB_ATT +      R_avg_CTRL_time.seconds. + R_avg_opp_CTRL_time.seconds. +      R_avg_CLINCH_landed + R_age + R_Reach_cms + R_avg_REV + R_avg_opp_GROUND_landed +      B_avg_opp_GROUND_landed + B_avg_TD_att + B_avg_opp_TD_pct +      B_avg_SIG_STR_pct + B_avg_opp_SIG_STR_pct + B_avg_KD + B_avg_opp_KD +      B_avg_SUB_ATT + B_avg_CTRL_time.seconds. + B_avg_opp_CTRL_time.seconds. +      B_avg_CLINCH_landed + B_avg_opp_CLINCH_landed + B_age + B_Reach_cms +      B_avg_REV, data = train_rf_data, importance = TRUE, mtry = 6,      nodesize = 4, ntree = 5000) 
##                Type of random forest: classification
##                      Number of trees: 5000
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 41.15%
## Confusion matrix:
##     0   1 class.error
## 0 833 235   0.2200375
## 1 542 278   0.6609756

Random Forest Variable Importance

source: https://topepo.github.io/caret/variable-importance.html

Variable importance evaluation functions can be separated into two groups: those that use the model information and those that do not. The advantage of using a model-based approach is that is more closely tied to the model performance and that it may be able to incorporate the correlation structure between the predictors into the importance calculation. Regardless of how the importance is calculated:

For most classification models, each predictor will have a separate variable importance for each class (the exceptions are classification trees, bagged trees and boosted trees). All measures of importance are scaled to have a maximum value of 100, unless the scale argument of varImp.train is set to FALSE.

Below we see the variable importance of our Random Forest model. Keep in mind this random forest includes all weight classes. We see that Age again shows the most importance in correlation.

rf_imp <- varImp(forest.hyper)

rf_imp

##                                        0           1
## R_avg_TD_att                  3.85835083  3.85835083
## R_avg_SIG_STR_pct            -1.93723582 -1.93723582
## R_avg_opp_SIG_STR_pct         9.73458479  9.73458479
## R_avg_KD                     -1.66510182 -1.66510182
## R_avg_opp_KD                  3.79993995  3.79993995
## R_avg_SUB_ATT                 1.93845581  1.93845581
## R_avg_CTRL_time.seconds.      4.16326805  4.16326805
## R_avg_opp_CTRL_time.seconds.  8.01785745  8.01785745
## R_avg_CLINCH_landed          -1.26455786 -1.26455786
## R_age                        20.77182477 20.77182477
## R_Reach_cms                  -0.08861086 -0.08861086
## R_avg_REV                     0.36628520  0.36628520
## R_avg_opp_GROUND_landed       0.81512365  0.81512365
## B_avg_opp_GROUND_landed      -0.39399884 -0.39399884
## B_avg_TD_att                  6.44701816  6.44701816
## B_avg_opp_TD_pct              4.83670630  4.83670630
## B_avg_SIG_STR_pct             2.03220016  2.03220016
## B_avg_opp_SIG_STR_pct        -0.44450234 -0.44450234
## B_avg_KD                     -1.80311510 -1.80311510
## B_avg_opp_KD                  0.18117799  0.18117799
## B_avg_SUB_ATT                 4.95586401  4.95586401
## B_avg_CTRL_time.seconds.      3.39856639  3.39856639
## B_avg_opp_CTRL_time.seconds. -0.19979859 -0.19979859
## B_avg_CLINCH_landed           1.08179709  1.08179709
## B_avg_opp_CLINCH_landed       1.58935508  1.58935508
## B_age                        17.22058939 17.22058939
## B_Reach_cms                  -0.65041339 -0.65041339
## B_avg_REV                    -0.87504128 -0.87504128

SVM Model

Here we use an SVM model, in attempts to see how different models improve predictive performance.

set.seed(123)
svm.idx <- sample(1:nrow(ufc_2), size = 0.70 * nrow(ufc_2))
train_svm_data <- ufc_2[svm.idx,]
test_svm_data <- ufc_2[-svm.idx,]

A cross validation method is set up to create an accurate way of sampling the dataset, multiple times.

# Set up Repeated k-fold Cross Validation
train_control <- trainControl(method="repeatedcv", number=10, repeats=3)

For SVM, we will scale the variables as distance algorithms like “KNN”, “K-means” and “SVM” are most affected by the range of features. This is because behind the scenes they are using distances between data points to determine their similarity. When two features have different scales, there is a chance that higher weight is given to features with higher magnitude. This will impact the performance of the machine learning algorithm.

set.seed(123)
svm = svm(formula = win_dummy ~ R_avg_TD_att + R_avg_opp_TD_pct + R_avg_SIG_STR_pct + R_avg_opp_SIG_STR_pct + R_avg_KD + R_avg_opp_KD + 
R_avg_SUB_ATT +  R_avg_CTRL_time.seconds. + 
R_avg_opp_CTRL_time.seconds. + R_avg_CLINCH_landed +
R_age + R_Reach_cms + R_avg_REV + R_avg_opp_GROUND_landed +

B_avg_opp_GROUND_landed + B_avg_TD_att + B_avg_opp_TD_pct + B_avg_SIG_STR_pct +
B_avg_opp_SIG_STR_pct + B_avg_KD + B_avg_opp_KD + B_avg_SUB_ATT +
B_avg_CTRL_time.seconds.+ B_avg_opp_CTRL_time.seconds.+ B_avg_CLINCH_landed +
B_avg_opp_CLINCH_landed + B_age + B_Reach_cms + B_avg_REV,
                  
data = train_svm_data,
trControl = train_control,
type = 'C-classification',
                
kernel = 'linear',
preProcess = c("center","scale"),
tuneGrid = expand.grid(C = seq(0, 2, length = 20)))

summary(svm)

## 
## Call:
## svm(formula = win_dummy ~ R_avg_TD_att + R_avg_opp_TD_pct + R_avg_SIG_STR_pct + 
##     R_avg_opp_SIG_STR_pct + R_avg_KD + R_avg_opp_KD + R_avg_SUB_ATT + 
##     R_avg_CTRL_time.seconds. + R_avg_opp_CTRL_time.seconds. + R_avg_CLINCH_landed + 
##     R_age + R_Reach_cms + R_avg_REV + R_avg_opp_GROUND_landed + B_avg_opp_GROUND_landed + 
##     B_avg_TD_att + B_avg_opp_TD_pct + B_avg_SIG_STR_pct + B_avg_opp_SIG_STR_pct + 
##     B_avg_KD + B_avg_opp_KD + B_avg_SUB_ATT + B_avg_CTRL_time.seconds. + 
##     B_avg_opp_CTRL_time.seconds. + B_avg_CLINCH_landed + B_avg_opp_CLINCH_landed + 
##     B_age + B_Reach_cms + B_avg_REV, data = train_svm_data, trControl = train_control, 
##     type = "C-classification", kernel = "linear", preProcess = c("center", 
##         "scale"), tuneGrid = expand.grid(C = seq(0, 2, length = 20)))
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  1056
## 
##  ( 529 527 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

set.seed(123)
svm.pred_1 = predict(svm,newdata = test_svm_data)

caret::confusionMatrix(as.factor(svm.pred_1), as.factor(test_svm_data$win_dummy), positive = '1')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 224 121
##          1 100 122
##                                           
##                Accuracy : 0.6102          
##                  95% CI : (0.5687, 0.6506)
##     No Information Rate : 0.5714          
##     P-Value [Acc > NIR] : 0.03363         
##                                           
##                   Kappa : 0.1955          
##                                           
##  Mcnemar's Test P-Value : 0.17851         
##                                           
##             Sensitivity : 0.5021          
##             Specificity : 0.6914          
##          Pos Pred Value : 0.5495          
##          Neg Pred Value : 0.6493          
##              Prevalence : 0.4286          
##          Detection Rate : 0.2152          
##    Detection Prevalence : 0.3915          
##       Balanced Accuracy : 0.5967          
##                                           
##        'Positive' Class : 1               
##

We notice a positive outcome in the SVM predictive model. We might be able to improve performance by using various kernels. SVM so far looks to be the best a prediction ability.

Welterweight Model.

Now we will use the same independent variables when building models for specific weight class. The models below only use rows from the original dataset that match a specific weight class (Welterweight model only uses Welterweight fights).

Logistic Regression and SVM with Welterweight data only.

Logistic Regression Model Welter

Remember that the welterweight class has the greatest number of fights between 2015-2021

ufc_welter_weight <- filter(ufc_2, weight_class == 'Welterweight')

set.seed(123)
sample_welter <- sample.int(n = nrow(ufc_welter_weight), size = round(.7*nrow(ufc_welter_weight)), replace = F)
ufc_train_welter <- ufc_welter_weight[sample_welter, ]
ufc_test_welter  <- ufc_welter_weight[-sample_welter, ]

set.seed(123)
m1_welter = glm(win_dummy ~  R_avg_TD_att  + R_avg_SIG_STR_pct + R_avg_opp_SIG_STR_pct + R_avg_KD + R_avg_opp_KD + 
R_avg_SUB_ATT +  R_avg_CTRL_time.seconds. + 
R_avg_opp_CTRL_time.seconds. + R_avg_CLINCH_landed +
R_age + R_Reach_cms + R_avg_REV + R_avg_opp_GROUND_landed +

B_avg_opp_GROUND_landed +
B_avg_TD_att +
B_avg_opp_TD_pct +
B_avg_SIG_STR_pct +
B_avg_opp_SIG_STR_pct +
B_avg_KD +
B_avg_opp_KD + 
B_avg_SUB_ATT +
B_avg_CTRL_time.seconds.+
B_avg_opp_CTRL_time.seconds.+



B_avg_CLINCH_landed +
B_avg_opp_CLINCH_landed +
B_age + 
B_Reach_cms +
B_avg_REV
               
               , data = ufc_train_welter, family = binomial)

summary(m1_welter)

## 
## Call:
## glm(formula = win_dummy ~ R_avg_TD_att + R_avg_SIG_STR_pct + 
##     R_avg_opp_SIG_STR_pct + R_avg_KD + R_avg_opp_KD + R_avg_SUB_ATT + 
##     R_avg_CTRL_time.seconds. + R_avg_opp_CTRL_time.seconds. + 
##     R_avg_CLINCH_landed + R_age + R_Reach_cms + R_avg_REV + R_avg_opp_GROUND_landed + 
##     B_avg_opp_GROUND_landed + B_avg_TD_att + B_avg_opp_TD_pct + 
##     B_avg_SIG_STR_pct + B_avg_opp_SIG_STR_pct + B_avg_KD + B_avg_opp_KD + 
##     B_avg_SUB_ATT + B_avg_CTRL_time.seconds. + B_avg_opp_CTRL_time.seconds. + 
##     B_avg_CLINCH_landed + B_avg_opp_CLINCH_landed + B_age + B_Reach_cms + 
##     B_avg_REV, family = binomial, data = ufc_train_welter)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.970  -1.004  -0.698   1.103   2.190  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)  
## (Intercept)                  11.1526290  8.3156360   1.341   0.1799  
## R_avg_TD_att                 -0.0688275  0.0792859  -0.868   0.3853  
## R_avg_SIG_STR_pct             1.0977688  1.6492100   0.666   0.5056  
## R_avg_opp_SIG_STR_pct         1.0027712  1.6346678   0.613   0.5396  
## R_avg_KD                     -0.3043669  0.4172462  -0.729   0.4657  
## R_avg_opp_KD                 -0.5810565  0.5868028  -0.990   0.3221  
## R_avg_SUB_ATT                -0.1989581  0.4417806  -0.450   0.6525  
## R_avg_CTRL_time.seconds.     -0.0022394  0.0017245  -1.299   0.1941  
## R_avg_opp_CTRL_time.seconds.  0.0020654  0.0021842   0.946   0.3443  
## R_avg_CLINCH_landed           0.0080213  0.0298085   0.269   0.7879  
## R_age                         0.0412604  0.0396676   1.040   0.2983  
## R_Reach_cms                  -0.0400960  0.0305192  -1.314   0.1889  
## R_avg_REV                    -0.0055806  0.7193079  -0.008   0.9938  
## R_avg_opp_GROUND_landed      -0.0445277  0.0496850  -0.896   0.3701  
## B_avg_opp_GROUND_landed      -0.0006107  0.0438096  -0.014   0.9889  
## B_avg_TD_att                  0.1923105  0.1065373   1.805   0.0711 .
## B_avg_opp_TD_pct              0.1011483  0.7924275   0.128   0.8984  
## B_avg_SIG_STR_pct             1.3754170  1.7325420   0.794   0.4273  
## B_avg_opp_SIG_STR_pct        -3.3820734  1.7434786  -1.940   0.0524 .
## B_avg_KD                     -0.1656506  0.4020598  -0.412   0.6803  
## B_avg_opp_KD                  0.1458027  0.5011546   0.291   0.7711  
## B_avg_SUB_ATT                 0.0441990  0.3282362   0.135   0.8929  
## B_avg_CTRL_time.seconds.     -0.0002550  0.0022599  -0.113   0.9102  
## B_avg_opp_CTRL_time.seconds.  0.0017408  0.0021023   0.828   0.4076  
## B_avg_CLINCH_landed          -0.0030797  0.0331017  -0.093   0.9259  
## B_avg_opp_CLINCH_landed      -0.0042818  0.0462423  -0.093   0.9262  
## B_age                        -0.0222009  0.0486248  -0.457   0.6480  
## B_Reach_cms                  -0.0243163  0.0305870  -0.795   0.4266  
## B_avg_REV                    -0.0525628  0.8278034  -0.063   0.9494  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 295.80  on 215  degrees of freedom
## Residual deviance: 267.13  on 187  degrees of freedom
## AIC: 325.13
## 
## Number of Fisher Scoring iterations: 4

set.seed(123)
ufc_test_welter$PredProb = predict.glm(m1_welter, ufc_test_welter, type = 'response')

ufc_test_welter$Prediction = ifelse(ufc_test_welter$PredProb >= 0.5,1,0)

caret::confusionMatrix(as.factor(ufc_test_welter$win_dummy), as.factor(ufc_test_welter$Prediction), positive='1')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 32 14
##          1 23 23
##                                           
##                Accuracy : 0.5978          
##                  95% CI : (0.4904, 0.6988)
##     No Information Rate : 0.5978          
##     P-Value [Acc > NIR] : 0.5450          
##                                           
##                   Kappa : 0.1957          
##                                           
##  Mcnemar's Test P-Value : 0.1884          
##                                           
##             Sensitivity : 0.6216          
##             Specificity : 0.5818          
##          Pos Pred Value : 0.5000          
##          Neg Pred Value : 0.6957          
##              Prevalence : 0.4022          
##          Detection Rate : 0.2500          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.6017          
##                                           
##        'Positive' Class : 1               
##

SVM Model Welter

set.seed(123)
svm.idx_welter <- sample(1:nrow(ufc_2), size = 0.70 * nrow(ufc_2))
train_svm_data_welter <- ufc_2[svm.idx_welter,]
test_svm_data_welter <- ufc_2[-svm.idx_welter,]

# Set up Repeated k-fold Cross Validation
train_control <- trainControl(method="repeatedcv", number=10, repeats=3)

set.seed(123)
svm_welter = svm(formula = win_dummy ~ R_avg_TD_att + R_avg_opp_TD_pct + R_avg_SIG_STR_pct + R_avg_opp_SIG_STR_pct + R_avg_KD + R_avg_opp_KD + 
R_avg_SUB_ATT +  R_avg_CTRL_time.seconds. + 
R_avg_opp_CTRL_time.seconds. + R_avg_CLINCH_landed +
R_age + R_Reach_cms + R_avg_REV + R_avg_opp_GROUND_landed +

B_avg_opp_GROUND_landed +
B_avg_TD_att +
B_avg_opp_TD_pct +
B_avg_SIG_STR_pct +
B_avg_opp_SIG_STR_pct +
B_avg_KD +
B_avg_opp_KD + 
B_avg_SUB_ATT +
B_avg_CTRL_time.seconds.+
B_avg_opp_CTRL_time.seconds.+



B_avg_CLINCH_landed +
B_avg_opp_CLINCH_landed +
B_age + 
B_Reach_cms +
B_avg_REV,
                             data = train_svm_data,
                  trControl = train_control,
                 type = 'C-classification',
                
                 kernel = 'linear',
                  preProcess = c("center","scale"),
                    tuneGrid = expand.grid(C = seq(0, 2, length = 20)))

summary(svm_welter)

## 
## Call:
## svm(formula = win_dummy ~ R_avg_TD_att + R_avg_opp_TD_pct + R_avg_SIG_STR_pct + 
##     R_avg_opp_SIG_STR_pct + R_avg_KD + R_avg_opp_KD + R_avg_SUB_ATT + 
##     R_avg_CTRL_time.seconds. + R_avg_opp_CTRL_time.seconds. + R_avg_CLINCH_landed + 
##     R_age + R_Reach_cms + R_avg_REV + R_avg_opp_GROUND_landed + B_avg_opp_GROUND_landed + 
##     B_avg_TD_att + B_avg_opp_TD_pct + B_avg_SIG_STR_pct + B_avg_opp_SIG_STR_pct + 
##     B_avg_KD + B_avg_opp_KD + B_avg_SUB_ATT + B_avg_CTRL_time.seconds. + 
##     B_avg_opp_CTRL_time.seconds. + B_avg_CLINCH_landed + B_avg_opp_CLINCH_landed + 
##     B_age + B_Reach_cms + B_avg_REV, data = train_svm_data, trControl = train_control, 
##     type = "C-classification", kernel = "linear", preProcess = c("center", 
##         "scale"), tuneGrid = expand.grid(C = seq(0, 2, length = 20)))
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  1056
## 
##  ( 529 527 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

set.seed(123)
svm.pred_1_welter = predict(svm_welter,newdata = test_svm_data_welter)

caret::confusionMatrix(as.factor(svm.pred_1_welter), as.factor(test_svm_data_welter$win_dummy), positive = '1')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 224 121
##          1 100 122
##                                           
##                Accuracy : 0.6102          
##                  95% CI : (0.5687, 0.6506)
##     No Information Rate : 0.5714          
##     P-Value [Acc > NIR] : 0.03363         
##                                           
##                   Kappa : 0.1955          
##                                           
##  Mcnemar's Test P-Value : 0.17851         
##                                           
##             Sensitivity : 0.5021          
##             Specificity : 0.6914          
##          Pos Pred Value : 0.5495          
##          Neg Pred Value : 0.6493          
##              Prevalence : 0.4286          
##          Detection Rate : 0.2152          
##    Detection Prevalence : 0.3915          
##       Balanced Accuracy : 0.5967          
##                                           
##        'Positive' Class : 1               
##

SVM model for Welterweight looks to be significant, than Logistic Regression model.

Lightweight Model

Logistic Regression with Lightweight data only

Lightweight Logistic Model

ufc_light_weight <- filter(ufc_2, weight_class == 'Lightweight')

set.seed(123)
sample_light <- sample.int(n = nrow(ufc_light_weight), size = round(.7*nrow(ufc_light_weight)), replace = F)
ufc_train_light <- ufc_light_weight[sample_light, ]
ufc_test_light  <- ufc_light_weight[-sample_light, ]

set.seed(123)
m1_light = glm(win_dummy ~  R_avg_TD_att  + R_avg_SIG_STR_pct + R_avg_opp_SIG_STR_pct + R_avg_KD + R_avg_opp_KD + 
R_avg_SUB_ATT +  R_avg_CTRL_time.seconds. + 
R_avg_opp_CTRL_time.seconds. + R_avg_CLINCH_landed +
R_age + R_Reach_cms + R_avg_REV + R_avg_opp_GROUND_landed +

B_avg_opp_GROUND_landed +
B_avg_TD_att +
B_avg_opp_TD_pct +
B_avg_SIG_STR_pct +
B_avg_opp_SIG_STR_pct +
B_avg_KD +
B_avg_opp_KD + 
B_avg_SUB_ATT +
B_avg_CTRL_time.seconds.+
B_avg_opp_CTRL_time.seconds.+



B_avg_CLINCH_landed +
B_avg_opp_CLINCH_landed +
B_age + 
B_Reach_cms +
B_avg_REV
               
               , data = ufc_train_light, family = binomial)

summary(m1_light)

## 
## Call:
## glm(formula = win_dummy ~ R_avg_TD_att + R_avg_SIG_STR_pct + 
##     R_avg_opp_SIG_STR_pct + R_avg_KD + R_avg_opp_KD + R_avg_SUB_ATT + 
##     R_avg_CTRL_time.seconds. + R_avg_opp_CTRL_time.seconds. + 
##     R_avg_CLINCH_landed + R_age + R_Reach_cms + R_avg_REV + R_avg_opp_GROUND_landed + 
##     B_avg_opp_GROUND_landed + B_avg_TD_att + B_avg_opp_TD_pct + 
##     B_avg_SIG_STR_pct + B_avg_opp_SIG_STR_pct + B_avg_KD + B_avg_opp_KD + 
##     B_avg_SUB_ATT + B_avg_CTRL_time.seconds. + B_avg_opp_CTRL_time.seconds. + 
##     B_avg_CLINCH_landed + B_avg_opp_CLINCH_landed + B_age + B_Reach_cms + 
##     B_avg_REV, family = binomial, data = ufc_train_light)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2711  -0.8706  -0.2447   0.8901   2.2942  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)   
## (Intercept)                  -1.333e+01  1.023e+01  -1.303  0.19274   
## R_avg_TD_att                 -3.242e-02  9.705e-02  -0.334  0.73836   
## R_avg_SIG_STR_pct            -5.380e-01  2.088e+00  -0.258  0.79667   
## R_avg_opp_SIG_STR_pct         6.156e+00  2.209e+00   2.787  0.00532 **
## R_avg_KD                     -1.075e+00  7.169e-01  -1.499  0.13384   
## R_avg_opp_KD                 -2.511e-01  7.068e-01  -0.355  0.72239   
## R_avg_SUB_ATT                -8.658e-01  4.294e-01  -2.016  0.04377 * 
## R_avg_CTRL_time.seconds.     -2.404e-03  2.319e-03  -1.037  0.29982   
## R_avg_opp_CTRL_time.seconds.  3.240e-03  2.756e-03   1.175  0.23980   
## R_avg_CLINCH_landed           3.786e-02  4.835e-02   0.783  0.43354   
## R_age                         1.447e-01  5.250e-02   2.755  0.00586 **
## R_Reach_cms                   8.371e-04  3.376e-02   0.025  0.98022   
## R_avg_REV                    -8.152e-01  8.426e-01  -0.967  0.33331   
## R_avg_opp_GROUND_landed      -4.754e-02  3.807e-02  -1.249  0.21169   
## B_avg_opp_GROUND_landed      -3.628e-02  3.250e-02  -1.116  0.26426   
## B_avg_TD_att                  7.334e-02  9.808e-02   0.748  0.45463   
## B_avg_opp_TD_pct             -2.348e+00  1.016e+00  -2.310  0.02087 * 
## B_avg_SIG_STR_pct             4.350e-02  1.774e+00   0.025  0.98043   
## B_avg_opp_SIG_STR_pct         2.050e-01  1.815e+00   0.113  0.91007   
## B_avg_KD                     -5.188e-01  5.698e-01  -0.910  0.36258   
## B_avg_opp_KD                 -6.366e-01  6.055e-01  -1.051  0.29304   
## B_avg_SUB_ATT                 3.843e-01  4.579e-01   0.839  0.40124   
## B_avg_CTRL_time.seconds.      1.851e-03  2.244e-03   0.825  0.40953   
## B_avg_opp_CTRL_time.seconds. -3.875e-04  2.116e-03  -0.183  0.85470   
## B_avg_CLINCH_landed          -4.548e-02  4.401e-02  -1.033  0.30142   
## B_avg_opp_CLINCH_landed       3.478e-02  4.696e-02   0.741  0.45891   
## B_age                        -9.593e-02  5.655e-02  -1.696  0.08981 . 
## B_Reach_cms                   5.741e-02  3.651e-02   1.572  0.11589   
## B_avg_REV                    -3.998e-01  8.045e-01  -0.497  0.61924   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 271.69  on 195  degrees of freedom
## Residual deviance: 212.11  on 167  degrees of freedom
## AIC: 270.11
## 
## Number of Fisher Scoring iterations: 4

set.seed(123)
ufc_test_light$PredProb = predict.glm(m1_light, ufc_test_light, type = 'response')

ufc_test_light$Prediction = ifelse(ufc_test_light$PredProb >= 0.5,1,0)

caret::confusionMatrix(as.factor(ufc_test_light$win_dummy), as.factor(ufc_test_light$Prediction), positive='1')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 27 22
##          1 16 19
##                                           
##                Accuracy : 0.5476          
##                  95% CI : (0.4352, 0.6566)
##     No Information Rate : 0.5119          
##     P-Value [Acc > NIR] : 0.2930          
##                                           
##                   Kappa : 0.0916          
##                                           
##  Mcnemar's Test P-Value : 0.4173          
##                                           
##             Sensitivity : 0.4634          
##             Specificity : 0.6279          
##          Pos Pred Value : 0.5429          
##          Neg Pred Value : 0.5510          
##              Prevalence : 0.4881          
##          Detection Rate : 0.2262          
##    Detection Prevalence : 0.4167          
##       Balanced Accuracy : 0.5457          
##                                           
##        'Positive' Class : 1               
##

Heavyweight Model.

Logistic Regression with Heavyweight data only.

ufc_heavy_weight <- filter(ufc_2, weight_class == 'Heavyweight')

set.seed(123)
sample_heavy <- sample.int(n = nrow(ufc_heavy_weight), size = round(.7*nrow(ufc_heavy_weight)), replace = F)
ufc_train_heavy <- ufc_heavy_weight[sample_heavy, ]
ufc_test_heavy  <- ufc_heavy_weight[-sample_heavy, ]

set.seed(123)
m1_heavy = glm(win_dummy ~  R_avg_TD_att  + R_avg_SIG_STR_pct + R_avg_opp_SIG_STR_pct + R_avg_KD + R_avg_opp_KD + 
R_avg_SUB_ATT +  R_avg_CTRL_time.seconds. + 
R_avg_opp_CTRL_time.seconds. + R_avg_CLINCH_landed +
R_age + R_Reach_cms + R_avg_REV + R_avg_opp_GROUND_landed +

B_avg_opp_GROUND_landed +
B_avg_TD_att +
B_avg_opp_TD_pct +
B_avg_SIG_STR_pct +
B_avg_opp_SIG_STR_pct +
B_avg_KD +
B_avg_opp_KD + 
B_avg_SUB_ATT +
B_avg_CTRL_time.seconds.+
B_avg_opp_CTRL_time.seconds.+



B_avg_CLINCH_landed +
B_avg_opp_CLINCH_landed +
B_age + 
B_Reach_cms +
B_avg_REV
               
               , data = ufc_train_heavy, family = binomial)

summary(m1_heavy)

## 
## Call:
## glm(formula = win_dummy ~ R_avg_TD_att + R_avg_SIG_STR_pct + 
##     R_avg_opp_SIG_STR_pct + R_avg_KD + R_avg_opp_KD + R_avg_SUB_ATT + 
##     R_avg_CTRL_time.seconds. + R_avg_opp_CTRL_time.seconds. + 
##     R_avg_CLINCH_landed + R_age + R_Reach_cms + R_avg_REV + R_avg_opp_GROUND_landed + 
##     B_avg_opp_GROUND_landed + B_avg_TD_att + B_avg_opp_TD_pct + 
##     B_avg_SIG_STR_pct + B_avg_opp_SIG_STR_pct + B_avg_KD + B_avg_opp_KD + 
##     B_avg_SUB_ATT + B_avg_CTRL_time.seconds. + B_avg_opp_CTRL_time.seconds. + 
##     B_avg_CLINCH_landed + B_avg_opp_CLINCH_landed + B_age + B_Reach_cms + 
##     B_avg_REV, family = binomial, data = ufc_train_heavy)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8910  -0.8163  -0.2096   0.8473   2.5762  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)                   1.233829  13.715522   0.090  0.92832   
## R_avg_TD_att                 -0.042351   0.208952  -0.203  0.83938   
## R_avg_SIG_STR_pct            -3.670197   2.939163  -1.249  0.21177   
## R_avg_opp_SIG_STR_pct         1.327193   2.771365   0.479  0.63201   
## R_avg_KD                     -2.208979   1.303984  -1.694  0.09026 . 
## R_avg_opp_KD                  1.523529   1.287393   1.183  0.23664   
## R_avg_SUB_ATT                 1.090745   0.887356   1.229  0.21899   
## R_avg_CTRL_time.seconds.     -0.000126   0.004202  -0.030  0.97608   
## R_avg_opp_CTRL_time.seconds. -0.002226   0.003528  -0.631  0.52807   
## R_avg_CLINCH_landed           0.024421   0.048826   0.500  0.61696   
## R_age                         0.032069   0.084399   0.380  0.70397   
## R_Reach_cms                  -0.064619   0.046894  -1.378  0.16821   
## R_avg_REV                     4.155635   3.451160   1.204  0.22854   
## R_avg_opp_GROUND_landed       0.035214   0.079738   0.442  0.65876   
## B_avg_opp_GROUND_landed      -0.096923   0.068530  -1.414  0.15727   
## B_avg_TD_att                  0.702107   0.280481   2.503  0.01231 * 
## B_avg_opp_TD_pct              0.856574   1.495334   0.573  0.56676   
## B_avg_SIG_STR_pct             9.883988   4.132554   2.392  0.01677 * 
## B_avg_opp_SIG_STR_pct         9.992761   3.904796   2.559  0.01049 * 
## B_avg_KD                     -0.789811   0.991972  -0.796  0.42591   
## B_avg_opp_KD                 -1.566898   1.345544  -1.165  0.24422   
## B_avg_SUB_ATT                 0.383411   1.304997   0.294  0.76891   
## B_avg_CTRL_time.seconds.     -0.020603   0.007057  -2.919  0.00351 **
## B_avg_opp_CTRL_time.seconds.  0.007813   0.005055   1.546  0.12222   
## B_avg_CLINCH_landed           0.224601   0.100063   2.245  0.02479 * 
## B_avg_opp_CLINCH_landed      -0.242008   0.099980  -2.421  0.01550 * 
## B_age                        -0.036376   0.072053  -0.505  0.61367   
## B_Reach_cms                   0.019278   0.045433   0.424  0.67133   
## B_avg_REV                    -1.361957   1.885133  -0.722  0.47000   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 141.049  on 101  degrees of freedom
## Residual deviance:  98.066  on  73  degrees of freedom
## AIC: 156.07
## 
## Number of Fisher Scoring iterations: 6

set.seed(123)
ufc_test_heavy$PredProb = predict.glm(m1_heavy, ufc_test_heavy, type = 'response')

ufc_test_heavy$Prediction = ifelse(ufc_test_heavy$PredProb >= 0.5,1,0)

caret::confusionMatrix(as.factor(ufc_test_heavy$win_dummy), as.factor(ufc_test_heavy$Prediction), positive='1')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 11 12
##          1 13  8
##                                           
##                Accuracy : 0.4318          
##                  95% CI : (0.2835, 0.5897)
##     No Information Rate : 0.5455          
##     P-Value [Acc > NIR] : 0.9518          
##                                           
##                   Kappa : -0.1411         
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.4000          
##             Specificity : 0.4583          
##          Pos Pred Value : 0.3810          
##          Neg Pred Value : 0.4783          
##              Prevalence : 0.4545          
##          Detection Rate : 0.1818          
##    Detection Prevalence : 0.4773          
##       Balanced Accuracy : 0.4292          
##                                           
##        'Positive' Class : 1               
##

Our Heavyweight model has poor performance in accuracy. The Heavy Weight model predictive variables that show more correlation with that weight class could be used to see if there is improvement.

Conclusion - We’ve shown that it could be possible to create an effective predictive model to predict the winner of a UFC fight. The models can be improved further by creating specific weight class models and choosing the correct independent variables that show high correlation to each weight class. Predicting the winner of a UFC fight is likely very difficult because of the unpredictability of the fighters decision making, the amount of preparation for a fight, and the sport in general. But we’ve shown the potential to make better educated decisions on UFC fight outcomes.

Recommendations are to continue to experiment with variable selections for models, per weight class. And to try different Kernels for SVM models. Another suggestion would be to create new variables by grading fighters abilities.

UFC EDA and Model Predictions

Logan Henslee

8/2/2022