Intro

My final project will be attempting to predict the outcome of fights within the Ultimate Fighting Championship (UFC) using historical data.

The UFC is an individual sport that is hard to predict for several reasons:

time spent fighting. Very active fighters will fight 3 times a year. If all three of those fights are Championship fights, thats only 75 minutes of fighting a year. Most fighters fight well under that. For comparison, just last season, NBA star James Harden played 2867 minutes and NFL star Quarterback Andrew Luck played 1,120 snaps. The minimal time spent fight, and accumulating data, give us less to base our predictions on.
puncher’s chance. MMA is unlike boxing in the sense that boxers wear large padded gloves ranging from 8 - 16 ounces. MMA gloves are only 4 ounces and there is no padding on their elbows, knees, or legs, all of which can also be used as weapons. A seemingly superior fighter could be winning a fight for 24 minutes, but make one small mistake in the very last minutes of the very last round and lose the whole fight.

Originally, I planned on scraping several websites to gain the data needed. But, in my research I stumpled upon an individual who tried to do a similiar project in Python. Jason Chan Jin An created his ‘UFC MMA Predictor Webapp’ at https://ufcmmapredictor.herokuapp.com/

Using Python, he was able to predict the outcome of fights with a 70.4% accuracy. Using the same data he scraped, I will do a brief run-down of the data and steps that Jason took to clean and compile the data before I begin my effort to beat Jason’s 70.4% accuracy.

Jason scraped historical fight data from http://www.fightmetric.com/statistics/events/completed

He scraped the betting odds from http://www.betmma.tips/mma_betting_favorites_vs_underdogs.php

Jason then merged the two datasets together to form a dataframe that includes whether the favorite or the underdog won the fight and what the betting odds were prior to the start of the fight.

# The final dataset included the following columns: 
# 
# Label - This is the response variable. Either Favourite or Underdog will win
# REACH - Fighter's reach. (Probabaly the least important feature)
# SLPM - Significant Strikes Landed per Minute
# STRA. - Significant Striking Accuracy
# SAPM - Significant Strikes Absorbed per Minute
# STRD - Significant Strike Defence (the % of opponents strikes that did not land)
# TD - Average Takedowns Landed per 15 minutes
# TDA - Takedown Accuracy
# TDD - Takedown Defense (the % of opponents TD attempts that did not land)
# SUBA - Average Submissions Attempted per 15 minutes
#  Odds - Fighter's decimal odds spread for that specific matchup

The major point to take away from the way the data is structured is that for all positive deltas, the favorite fighter has the advantage, and for all negative deltas, that means the underdog has the advantage.

Jason does not cover all the data cleansing he performed (if any), but I will sumarize the data and perform any additional cleaning before conducted exploratory research and then finally attempting to predict UFC fights at a rate of 71% or higher.

Libraries

library(mlbench)
library(tidyverse)

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats

library(caret)

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

library(corrplot)
library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(Metrics) #For ce() function

## 
## Attaching package: 'Metrics'

## The following object is masked from 'package:MicrosoftML':
## 
##     logLoss

library(rpart)
library(rpart.plot)
library(dplyr)
library(hrbrthemes)
library(knitr)
library(rmarkdown)

library(readxl)
UFC <- read_excel("C:/Users/Trevor/OneDrive/School/Data Analysis 530/Datasets/UFC.xlsx")

Summary

summary(UFC)

##     Events           Favourite           Underdog        
##  Length:1315        Length:1315        Length:1315       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##     Label            REACH_delta        SLPM_delta        SAPM_delta      
##  Length:1315        Min.   :-10.000   Min.   :-6.0200   Min.   :-10.5600  
##  Class :character   1st Qu.: -2.000   1st Qu.:-0.6200   1st Qu.: -1.1100  
##  Mode  :character   Median :  0.000   Median : 0.3000   Median : -0.2000  
##                     Mean   :  0.219   Mean   : 0.2642   Mean   : -0.3021  
##                     3rd Qu.:  2.000   3rd Qu.: 1.1750   3rd Qu.:  0.6050  
##                     Max.   : 12.000   Max.   : 7.4800   Max.   :  7.9100  
##    STRA_delta         STRD_delta          TD_delta       
##  Min.   :-0.49000   Min.   :-0.35000   Min.   :-10.7500  
##  1st Qu.:-0.06000   1st Qu.:-0.05000   1st Qu.: -0.9000  
##  Median : 0.01000   Median : 0.02000   Median :  0.2400  
##  Mean   : 0.01334   Mean   : 0.01789   Mean   :  0.2809  
##  3rd Qu.: 0.08000   3rd Qu.: 0.09000   3rd Qu.:  1.3650  
##  Max.   : 0.57000   Max.   : 0.42000   Max.   :  9.7800  
##    TDA_delta          TDD_delta          SUBA_delta      
##  Min.   :-1.00000   Min.   :-1.00000   Min.   :-12.1000  
##  1st Qu.:-0.12000   1st Qu.:-0.12000   1st Qu.: -0.5000  
##  Median : 0.04000   Median : 0.04000   Median :  0.0000  
##  Mean   : 0.05226   Mean   : 0.05575   Mean   :  0.1032  
##  3rd Qu.: 0.22500   3rd Qu.: 0.23000   3rd Qu.:  0.6000  
##  Max.   : 1.00000   Max.   : 1.00000   Max.   : 11.8000  
##    Odds_delta         Sum_delta     
##  Min.   :-12.9500   Min.   :-9.790  
##  1st Qu.: -1.9400   1st Qu.: 2.495  
##  Median : -0.7800   Median : 4.040  
##  Mean   : -0.8598   Mean   : 4.133  
##  3rd Qu.:  0.5750   3rd Qu.: 5.535  
##  Max.   :  7.3800   Max.   :21.490

Structure

str(UFC)

## Classes 'tbl_df', 'tbl' and 'data.frame':    1315 obs. of  15 variables:
##  $ Events     : chr  "UFC 159 - Jones vs. Sonnen" "UFC 159 - Jones vs. Sonnen" "UFC Fight Night 34 - Saffiedine vs. Lim" "UFC Fight Night 91 - McDonald vs. Lineker" ...
##  $ Favourite  : chr  "Jon Jones" "Leonard Garcia" "Mairbek Taisumov" "Cody Pfister" ...
##  $ Underdog   : chr  "Chael Sonnen" "Cody McKenzie" "Tae Hyun Bang" "Scott Holtzman" ...
##  $ Label      : chr  "Favourite" "Underdog" "Favourite" "Underdog" ...
##  $ REACH_delta: num  10 -3 2 4 2 2 2 -10 0 -3 ...
##  $ SLPM_delta : num  1.17 1.03 0.54 -3.15 0.02 0.6 0.39 -1.57 0.76 0.73 ...
##  $ SAPM_delta : num  0.9 2.29 0.08 -0.85 0.86 ...
##  $ STRA_delta : num  0.12 -0.1 0.05 -0.24 0.13 -0.17 0.08 -0.06 0.09 0.16 ...
##  $ STRD_delta : num  0.03 -0.15 -0.05 -0.06 -0.06 ...
##  $ TD_delta   : num  -1.56 -2.2 1.75 0.55 -0.08 ...
##  $ TDA_delta  : num  -0.07 0.01 0.44 -0.27 0.51 -0.34 0.11 0 0.4 0 ...
##  $ TDD_delta  : num  0.28 0.28 0.28 -0.58 0.37 0.05 0.15 0.27 0.17 -0.16 ...
##  $ SUBA_delta : num  0.2 -2 -0.5 -0.4 -0.5 -0.2 -0.1 -0.7 -1 0 ...
##  $ Odds_delta : num  -7.87 1.4 -2.89 6.89 0.81 -1.75 -1.27 0.71 0.05 0.72 ...
##  $ Sum_delta  : num  2.6 2 5.04 -0.86 3.18 1.03 3.3 1.98 5.61 5.35 ...

dim(UFC)

## [1] 1315   15

Unique values

Checks for the number of unique fields in each column

uniques <- lapply(UFC, function(x) length(unique(x))) 

uniques2 <- uniques %>%
  as.data.frame 

uniques3 <- gather(uniques2, key="feature", value="unique_values")
  
ggplot(uniques3, aes(x=reorder(feature,-unique_values),y=unique_values)) +
  geom_bar(stat="identity",fill="red") +
  geom_text(aes( label = unique_values, hjust = -.1)) +
  coord_flip()+
  theme_bw()

Missing values

Checks for any missing values. This Dataset has 0 missing values so, we can proceed our investigation.

missing_values <- UFC %>% 
  summarize_all(funs(sum(is.na(.))/n()))

missing_values

## # A tibble: 1 x 15
##   Events Favourite Underdog Label REACH_delta SLPM_delta SAPM_delta
##    <dbl>     <dbl>    <dbl> <dbl>       <dbl>      <dbl>      <dbl>
## 1      0         0        0     0           0          0          0
## # ... with 8 more variables: STRA_delta <dbl>, STRD_delta <dbl>,
## #   TD_delta <dbl>, TDA_delta <dbl>, TDD_delta <dbl>, SUBA_delta <dbl>,
## #   Odds_delta <dbl>, Sum_delta <dbl>

Correlation

Comparing the different variables to see how the delta’s affect the winners.

# Idea taken from https://ggplot2.tidyverse.org/reference/geom_histogram.html
UFClong <- reshape2::melt(UFC)

## Using Events, Favourite, Underdog, Label as id variables

ggplot(UFClong, aes(value)) + 
  facet_wrap(~variable, scales = 'free_x', ncol = 3, shrink = FALSE) +
  geom_histogram(binwidth = function(x) 2 * IQR(x) / (length(x)^(1/3)), alpha=.4, position="dodge", aes(color=Label, fill=Label ) )

For this graph, I either don’t understand the Odds_delta variable or the data is incorrect.

It seems like any negative value for odds_delta would be considered a win for the underdog and any positive value for Odds_delta would be considered a win for the favorite. However, this graph does not show that is what is happening.

ggplot(UFC, aes(x=Odds_delta, fill=Label)) +
    geom_histogram(bins=100, color="grey", alpha=0.4, position = 'identity') +
    scale_fill_manual(values=c("blue", "green")) +
    labs(fill="")

ggplot(UFC, aes(x=Sum_delta, fill=Label)) +
  geom_freqpoly(bins=100,aes(color=Label))

set.seed(123)
training.samples <- UFC$Odds_delta %>% 
  createDataPartition(p = 0.8, list = FALSE)
  train  <- UFC[training.samples, ]
  test <- UFC[-training.samples, ]

dim(train)

## [1] 1054   15

Building training set

train_UFC <- train %>%

  mutate(Label_Num = case_when(Label=="Favourite" ~ 1, 

                              Label=="Underdog" ~ 0))

Correlation table

corr_table <- train %>%

  select(-Favourite, -Underdog, -Label) %>%

  select_if(is.numeric) %>%

  cor(use="everything") %>%
  
  corrplot(method = "square")

train_model <- rpart(formula = Label ~., 
                      data = train, 
                      method = "class")

We’re unable to use predict because there are factors in the test set (Events and names) that the training set has never encountered. The names and events are not important now so I’m making a new dataframe without those factors.

train_noNames <- train %>% 
  select(-Favourite, -Underdog, -Events)
test_noNames <- test %>%
  select(-Favourite, -Underdog, -Events)

Model training

train_model <- rpart(formula = Label ~., 
                      data = train_noNames, 
                      method = "class")

# Generate predicted classes using the model object
class_prediction <- predict(object = train_model,  
                        newdata = test_noNames,   
                        type = "class")  
                            
# Calculate the confusion matrix for the test set
confusionMatrix(data = class_prediction,       
                reference = test$Label)

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Favourite Underdog
##   Favourite       121       55
##   Underdog         45       40
##                                           
##                Accuracy : 0.6169          
##                  95% CI : (0.5549, 0.6761)
##     No Information Rate : 0.636           
##     P-Value [Acc > NIR] : 0.7613          
##                                           
##                   Kappa : 0.1534          
##  Mcnemar's Test P-Value : 0.3681          
##                                           
##             Sensitivity : 0.7289          
##             Specificity : 0.4211          
##          Pos Pred Value : 0.6875          
##          Neg Pred Value : 0.4706          
##              Prevalence : 0.6360          
##          Detection Rate : 0.4636          
##    Detection Prevalence : 0.6743          
##       Balanced Accuracy : 0.5750          
##                                           
##        'Positive' Class : Favourite       
##

This information from the confusion matrix tells us that predicting fights based solely on choosing the favorite has only a 61.69% accuracy rate.

Train a gini-based model

UFC_model_gini <- rpart(formula = Label ~ ., 
                       data = train_noNames, 
                       method = "class",
                       parms = list(split = "gini"))

pred1 <- predict(object = UFC_model_gini, 
             newdata = test_noNames,
             type = "class")    

# classification error
ce(actual = test_noNames$Label, 
   predicted = pred1)

## [1] 0.3831418

Train an information-based model

UFC_model_info <- rpart(formula = Label ~ ., 
                       data = train_noNames, 
                       method = "class",
                       parms = list(split = "information"))


# Generate predictions on the validation set using the information model
pred2 <- predict(object = UFC_model_info, 
             newdata = test_noNames,
             type = "class")

#Classification Error
ce(actual = test_noNames$Label, 
   predicted = pred2)

## [1] 0.3716475

Both have very similar classification error rates but the information split is lower. The information split has a classification error of .3716475 which translates to a ~37% of samples incorrectly classified (false positives & false negatives) so a roughly 63% accuracy rate.

Rpart modeling

rpart_model2 <- rpart(as.factor(Label) ~ ., 
                     data = train_noNames, 
                     cp = .02
)

rpart.plot(x = rpart_model2, type = 3)

This Rpart plot was one of the most valuable pieces of this whole analysis.

Rpart Variable Importance

rpart_model2_df <- data.frame(importance = rpart_model2$variable.importance) %>%
  rownames_to_column() %>% 
  rename("variable" = rowname) 

rpart_model2_df <- rpart_model2_df %>%
  arrange(importance) %>%
  mutate(variable = factor(variable, levels = .$variable)) %>%
  mutate_if(is.numeric, round, digits = 2)


ggplot(rpart_model2_df, aes(variable, importance)) + 
  geom_bar(stat="identity",fill="red") +
  geom_text(aes( label = importance, hjust = -.1)) +
  coord_flip()+
  theme_bw()

The importance values generated from the Rpart put the decision tree information in a different frame of reference. Odds delta is obviously the largest factor with “strikes attempted per minute” having the second most important value. The sum of delta’s is the third most significant variable. The 2nd and 3rd most important values each drop off by roughly half from thei next most important value.

These values will be important to consider for future parameter hyper-tuning.

Findings

I was unable to achieve a higher prediction than Jason’s predictions. Upon further research he used a Neural Network Multi Layer Perceptron method to get his high percentage. Potentially, with additional parameter tuning I could come close or beaat the 70.4% accuracy, but that will require additional analysis and time.

From this previous weekend’s fights:

# Jairzinho Rozenstruik def.    Alistair Overeem  F
# Marina Rodriguez  vs. Cynthia Calvillo          D
# Ben Rothwell  def.    Stefan Struve             F
# Aspen Ladd    def.    Yana Kunitskaya             F
# Cody Stamann  vs. Song Yadong                 D
# Rob Font  def.    Ricky Simon                   F
# Tim Means def.    Thiago Alves                  F
# Billy Quarantillo def.    Jacob Kilburn         F
# Bryce Mitchell    def.    Matt Sayles             U (+110)
# Joe Solecki   def.    Matt Wiman                  F
# Virna Jandiroba   def.    Mallory Martin          F
# Makhmud Muradov   def.    Trevor Smith              F

#Favorites won 90% (9/10 with 2 Draws)

For Continued analysis:

# o More up to date data
# o Scrape additional data
#   Fight camp
#   missed weight
#   Record
#   Record (Last 3 or 5 fights)
#   Current streak
#   Speciality
#   Location (Of event, training camp, hometown)
#   Time since last fight

Again, much of this project was inspired and influed by Jason Chan Jin An. His work can be found at his website ( https://ufcmmapredictor.herokuapp.com/ )and his project notebook ( https://github.com/jasonchanhku/UFC-MMA-Predictor/blob/master/UFC%20MMA%20Predictor%20Workflow.ipynb ).

CSC 530 Data Analysis Final Project