My final project will be attempting to predict the outcome of fights within the Ultimate Fighting Championship (UFC) using historical data.
The UFC is an individual sport that is hard to predict for several reasons:
time spent fighting. Very active fighters will fight 3 times a year. If all three of those fights are Championship fights, thats only 75 minutes of fighting a year. Most fighters fight well under that. For comparison, just last season, NBA star James Harden played 2867 minutes and NFL star Quarterback Andrew Luck played 1,120 snaps. The minimal time spent fight, and accumulating data, give us less to base our predictions on.
puncher’s chance. MMA is unlike boxing in the sense that boxers wear large padded gloves ranging from 8 - 16 ounces. MMA gloves are only 4 ounces and there is no padding on their elbows, knees, or legs, all of which can also be used as weapons. A seemingly superior fighter could be winning a fight for 24 minutes, but make one small mistake in the very last minutes of the very last round and lose the whole fight.
Originally, I planned on scraping several websites to gain the data needed. But, in my research I stumpled upon an individual who tried to do a similiar project in Python. Jason Chan Jin An created his ‘UFC MMA Predictor Webapp’ at https://ufcmmapredictor.herokuapp.com/
Using Python, he was able to predict the outcome of fights with a 70.4% accuracy. Using the same data he scraped, I will do a brief run-down of the data and steps that Jason took to clean and compile the data before I begin my effort to beat Jason’s 70.4% accuracy.
Jason scraped historical fight data from http://www.fightmetric.com/statistics/events/completed
He scraped the betting odds from http://www.betmma.tips/mma_betting_favorites_vs_underdogs.php
Jason then merged the two datasets together to form a dataframe that includes whether the favorite or the underdog won the fight and what the betting odds were prior to the start of the fight.
# The final dataset included the following columns:
#
# Label - This is the response variable. Either Favourite or Underdog will win
# REACH - Fighter's reach. (Probabaly the least important feature)
# SLPM - Significant Strikes Landed per Minute
# STRA. - Significant Striking Accuracy
# SAPM - Significant Strikes Absorbed per Minute
# STRD - Significant Strike Defence (the % of opponents strikes that did not land)
# TD - Average Takedowns Landed per 15 minutes
# TDA - Takedown Accuracy
# TDD - Takedown Defense (the % of opponents TD attempts that did not land)
# SUBA - Average Submissions Attempted per 15 minutes
# Odds - Fighter's decimal odds spread for that specific matchup
The major point to take away from the way the data is structured is that for all positive deltas, the favorite fighter has the advantage, and for all negative deltas, that means the underdog has the advantage.
Jason does not cover all the data cleansing he performed (if any), but I will sumarize the data and perform any additional cleaning before conducted exploratory research and then finally attempting to predict UFC fights at a rate of 71% or higher.
library(mlbench)
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
library(caret)
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(corrplot)
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(Metrics) #For ce() function
##
## Attaching package: 'Metrics'
## The following object is masked from 'package:MicrosoftML':
##
## logLoss
library(rpart)
library(rpart.plot)
library(dplyr)
library(hrbrthemes)
library(knitr)
library(rmarkdown)
library(readxl)
UFC <- read_excel("C:/Users/Trevor/OneDrive/School/Data Analysis 530/Datasets/UFC.xlsx")
summary(UFC)
## Events Favourite Underdog
## Length:1315 Length:1315 Length:1315
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Label REACH_delta SLPM_delta SAPM_delta
## Length:1315 Min. :-10.000 Min. :-6.0200 Min. :-10.5600
## Class :character 1st Qu.: -2.000 1st Qu.:-0.6200 1st Qu.: -1.1100
## Mode :character Median : 0.000 Median : 0.3000 Median : -0.2000
## Mean : 0.219 Mean : 0.2642 Mean : -0.3021
## 3rd Qu.: 2.000 3rd Qu.: 1.1750 3rd Qu.: 0.6050
## Max. : 12.000 Max. : 7.4800 Max. : 7.9100
## STRA_delta STRD_delta TD_delta
## Min. :-0.49000 Min. :-0.35000 Min. :-10.7500
## 1st Qu.:-0.06000 1st Qu.:-0.05000 1st Qu.: -0.9000
## Median : 0.01000 Median : 0.02000 Median : 0.2400
## Mean : 0.01334 Mean : 0.01789 Mean : 0.2809
## 3rd Qu.: 0.08000 3rd Qu.: 0.09000 3rd Qu.: 1.3650
## Max. : 0.57000 Max. : 0.42000 Max. : 9.7800
## TDA_delta TDD_delta SUBA_delta
## Min. :-1.00000 Min. :-1.00000 Min. :-12.1000
## 1st Qu.:-0.12000 1st Qu.:-0.12000 1st Qu.: -0.5000
## Median : 0.04000 Median : 0.04000 Median : 0.0000
## Mean : 0.05226 Mean : 0.05575 Mean : 0.1032
## 3rd Qu.: 0.22500 3rd Qu.: 0.23000 3rd Qu.: 0.6000
## Max. : 1.00000 Max. : 1.00000 Max. : 11.8000
## Odds_delta Sum_delta
## Min. :-12.9500 Min. :-9.790
## 1st Qu.: -1.9400 1st Qu.: 2.495
## Median : -0.7800 Median : 4.040
## Mean : -0.8598 Mean : 4.133
## 3rd Qu.: 0.5750 3rd Qu.: 5.535
## Max. : 7.3800 Max. :21.490
str(UFC)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1315 obs. of 15 variables:
## $ Events : chr "UFC 159 - Jones vs. Sonnen" "UFC 159 - Jones vs. Sonnen" "UFC Fight Night 34 - Saffiedine vs. Lim" "UFC Fight Night 91 - McDonald vs. Lineker" ...
## $ Favourite : chr "Jon Jones" "Leonard Garcia" "Mairbek Taisumov" "Cody Pfister" ...
## $ Underdog : chr "Chael Sonnen" "Cody McKenzie" "Tae Hyun Bang" "Scott Holtzman" ...
## $ Label : chr "Favourite" "Underdog" "Favourite" "Underdog" ...
## $ REACH_delta: num 10 -3 2 4 2 2 2 -10 0 -3 ...
## $ SLPM_delta : num 1.17 1.03 0.54 -3.15 0.02 0.6 0.39 -1.57 0.76 0.73 ...
## $ SAPM_delta : num 0.9 2.29 0.08 -0.85 0.86 ...
## $ STRA_delta : num 0.12 -0.1 0.05 -0.24 0.13 -0.17 0.08 -0.06 0.09 0.16 ...
## $ STRD_delta : num 0.03 -0.15 -0.05 -0.06 -0.06 ...
## $ TD_delta : num -1.56 -2.2 1.75 0.55 -0.08 ...
## $ TDA_delta : num -0.07 0.01 0.44 -0.27 0.51 -0.34 0.11 0 0.4 0 ...
## $ TDD_delta : num 0.28 0.28 0.28 -0.58 0.37 0.05 0.15 0.27 0.17 -0.16 ...
## $ SUBA_delta : num 0.2 -2 -0.5 -0.4 -0.5 -0.2 -0.1 -0.7 -1 0 ...
## $ Odds_delta : num -7.87 1.4 -2.89 6.89 0.81 -1.75 -1.27 0.71 0.05 0.72 ...
## $ Sum_delta : num 2.6 2 5.04 -0.86 3.18 1.03 3.3 1.98 5.61 5.35 ...
dim(UFC)
## [1] 1315 15
Checks for the number of unique fields in each column
uniques <- lapply(UFC, function(x) length(unique(x)))
uniques2 <- uniques %>%
as.data.frame
uniques3 <- gather(uniques2, key="feature", value="unique_values")
ggplot(uniques3, aes(x=reorder(feature,-unique_values),y=unique_values)) +
geom_bar(stat="identity",fill="red") +
geom_text(aes( label = unique_values, hjust = -.1)) +
coord_flip()+
theme_bw()
Checks for any missing values. This Dataset has 0 missing values so, we can proceed our investigation.
missing_values <- UFC %>%
summarize_all(funs(sum(is.na(.))/n()))
missing_values
## # A tibble: 1 x 15
## Events Favourite Underdog Label REACH_delta SLPM_delta SAPM_delta
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0 0 0 0 0 0
## # ... with 8 more variables: STRA_delta <dbl>, STRD_delta <dbl>,
## # TD_delta <dbl>, TDA_delta <dbl>, TDD_delta <dbl>, SUBA_delta <dbl>,
## # Odds_delta <dbl>, Sum_delta <dbl>
Comparing the different variables to see how the delta’s affect the winners.
# Idea taken from https://ggplot2.tidyverse.org/reference/geom_histogram.html
UFClong <- reshape2::melt(UFC)
## Using Events, Favourite, Underdog, Label as id variables
ggplot(UFClong, aes(value)) +
facet_wrap(~variable, scales = 'free_x', ncol = 3, shrink = FALSE) +
geom_histogram(binwidth = function(x) 2 * IQR(x) / (length(x)^(1/3)), alpha=.4, position="dodge", aes(color=Label, fill=Label ) )
For this graph, I either don’t understand the Odds_delta variable or the data is incorrect.
It seems like any negative value for odds_delta would be considered a win for the underdog and any positive value for Odds_delta would be considered a win for the favorite. However, this graph does not show that is what is happening.
ggplot(UFC, aes(x=Odds_delta, fill=Label)) +
geom_histogram(bins=100, color="grey", alpha=0.4, position = 'identity') +
scale_fill_manual(values=c("blue", "green")) +
labs(fill="")
ggplot(UFC, aes(x=Sum_delta, fill=Label)) +
geom_freqpoly(bins=100,aes(color=Label))
set.seed(123)
training.samples <- UFC$Odds_delta %>%
createDataPartition(p = 0.8, list = FALSE)
train <- UFC[training.samples, ]
test <- UFC[-training.samples, ]
dim(train)
## [1] 1054 15
train_UFC <- train %>%
mutate(Label_Num = case_when(Label=="Favourite" ~ 1,
Label=="Underdog" ~ 0))
corr_table <- train %>%
select(-Favourite, -Underdog, -Label) %>%
select_if(is.numeric) %>%
cor(use="everything") %>%
corrplot(method = "square")
train_model <- rpart(formula = Label ~.,
data = train,
method = "class")
We’re unable to use predict because there are factors in the test set (Events and names) that the training set has never encountered. The names and events are not important now so I’m making a new dataframe without those factors.
train_noNames <- train %>%
select(-Favourite, -Underdog, -Events)
test_noNames <- test %>%
select(-Favourite, -Underdog, -Events)
train_model <- rpart(formula = Label ~.,
data = train_noNames,
method = "class")
# Generate predicted classes using the model object
class_prediction <- predict(object = train_model,
newdata = test_noNames,
type = "class")
# Calculate the confusion matrix for the test set
confusionMatrix(data = class_prediction,
reference = test$Label)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Favourite Underdog
## Favourite 121 55
## Underdog 45 40
##
## Accuracy : 0.6169
## 95% CI : (0.5549, 0.6761)
## No Information Rate : 0.636
## P-Value [Acc > NIR] : 0.7613
##
## Kappa : 0.1534
## Mcnemar's Test P-Value : 0.3681
##
## Sensitivity : 0.7289
## Specificity : 0.4211
## Pos Pred Value : 0.6875
## Neg Pred Value : 0.4706
## Prevalence : 0.6360
## Detection Rate : 0.4636
## Detection Prevalence : 0.6743
## Balanced Accuracy : 0.5750
##
## 'Positive' Class : Favourite
##
This information from the confusion matrix tells us that predicting fights based solely on choosing the favorite has only a 61.69% accuracy rate.
UFC_model_gini <- rpart(formula = Label ~ .,
data = train_noNames,
method = "class",
parms = list(split = "gini"))
pred1 <- predict(object = UFC_model_gini,
newdata = test_noNames,
type = "class")
# classification error
ce(actual = test_noNames$Label,
predicted = pred1)
## [1] 0.3831418
UFC_model_info <- rpart(formula = Label ~ .,
data = train_noNames,
method = "class",
parms = list(split = "information"))
# Generate predictions on the validation set using the information model
pred2 <- predict(object = UFC_model_info,
newdata = test_noNames,
type = "class")
#Classification Error
ce(actual = test_noNames$Label,
predicted = pred2)
## [1] 0.3716475
Both have very similar classification error rates but the information split is lower. The information split has a classification error of .3716475 which translates to a ~37% of samples incorrectly classified (false positives & false negatives) so a roughly 63% accuracy rate.
rpart_model2 <- rpart(as.factor(Label) ~ .,
data = train_noNames,
cp = .02
)
rpart.plot(x = rpart_model2, type = 3)
This Rpart plot was one of the most valuable pieces of this whole analysis.
rpart_model2_df <- data.frame(importance = rpart_model2$variable.importance) %>%
rownames_to_column() %>%
rename("variable" = rowname)
rpart_model2_df <- rpart_model2_df %>%
arrange(importance) %>%
mutate(variable = factor(variable, levels = .$variable)) %>%
mutate_if(is.numeric, round, digits = 2)
ggplot(rpart_model2_df, aes(variable, importance)) +
geom_bar(stat="identity",fill="red") +
geom_text(aes( label = importance, hjust = -.1)) +
coord_flip()+
theme_bw()
The importance values generated from the Rpart put the decision tree information in a different frame of reference. Odds delta is obviously the largest factor with “strikes attempted per minute” having the second most important value. The sum of delta’s is the third most significant variable. The 2nd and 3rd most important values each drop off by roughly half from thei next most important value.
These values will be important to consider for future parameter hyper-tuning.
I was unable to achieve a higher prediction than Jason’s predictions. Upon further research he used a Neural Network Multi Layer Perceptron method to get his high percentage. Potentially, with additional parameter tuning I could come close or beaat the 70.4% accuracy, but that will require additional analysis and time.
# Jairzinho Rozenstruik def. Alistair Overeem F
# Marina Rodriguez vs. Cynthia Calvillo D
# Ben Rothwell def. Stefan Struve F
# Aspen Ladd def. Yana Kunitskaya F
# Cody Stamann vs. Song Yadong D
# Rob Font def. Ricky Simon F
# Tim Means def. Thiago Alves F
# Billy Quarantillo def. Jacob Kilburn F
# Bryce Mitchell def. Matt Sayles U (+110)
# Joe Solecki def. Matt Wiman F
# Virna Jandiroba def. Mallory Martin F
# Makhmud Muradov def. Trevor Smith F
#Favorites won 90% (9/10 with 2 Draws)
# o More up to date data
# o Scrape additional data
# Fight camp
# missed weight
# Record
# Record (Last 3 or 5 fights)
# Current streak
# Speciality
# Location (Of event, training camp, hometown)
# Time since last fight
Again, much of this project was inspired and influed by Jason Chan Jin An. His work can be found at his website ( https://ufcmmapredictor.herokuapp.com/ )and his project notebook ( https://github.com/jasonchanhku/UFC-MMA-Predictor/blob/master/UFC%20MMA%20Predictor%20Workflow.ipynb ).