A dataset consisting on play-by-play data on MMA fights provides an exciting opportunity to quantify fighters’ probabilities both to win the current round of a fight and the fight overall. This can provide valuable information to spectators watching the broadcast and the fighters and coaching staff to analyse or predict performance.
Markov chains have previously been used to predict MMA results; Holmes, McHale and Zychaluk (2023) estimated fighter skills to generate transition probabilities and simulate fights. They found favorable results compared to benchmark models and a profitable strategy to use on the betting markets.
Sequential data was utilised by Lamas et al. (2024) to analyse the frequency of actions, transitions and reward-risk balance amongst elite no-gi brazilian jiu-jitsu competitors of the World Submission Fighting Championship in 2019. They incorporated bayesian methods for inference and determined there were transitions associated with winning, deriving valuable insights for fighters and coaches to influence strategy. Whilst the literature on markov chains and bayesian methods in the realm of MMA is limited, many other examples in other sports exist. Holden et al. (2022) used Markov processes to generate transition matrices and simulate matches of Australian Rules Football to make pre-match predictions and in-play predictions. They found the Markov model to be useful in evaluating kick in strategy, ultimately informing team tactics.
This analysis seeks to build on previous research, providing foundation to the application of markov chains to MMA. The primary difference being the sequential element, adding a layer of complexity. Lamas et al. (2024) focused on a subset of elite MMA fighters, being jiu-jitsu grapplers, whereas this dataset involves grapplers, strikers and all in between. Transition matrices generated from this data will be closer to the work done by Holmes, McHale and Zychaluk (2023) however the sequential aspect offerring other key applications. This can inform strategy, highlighting fighter strengths and weaknesses by using these transition matrices to simulate the fight many times. It can also be a valuable addition to the broadcast, introducing new metrics and visualise to predict the winner of the current round and the overall fight.
The aim of this project is to quantify an elite MMA fighter’s probability both of winning the round and winning the fight overall using a combination of Markov Chains and Bayesian methods.
This data has been provided by fightgeek, an organisation who provides combat sports statistics and analytics. Data collectors utilise a console-like controller to record play-by-play events.
The dataset features events from 9 elite fighting organisations including the UFC and ONE FC.
Explain any data cleaning or transformations you performed. Show code where necessary.
# load & import ----
#getwd()
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(eeptools)
#install.packages('markovchain')
library(markovchain)
## Warning: package 'markovchain' was built under R version 4.3.3
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Package: markovchain
## Version: 0.10.0
## Date: 2024-11-14 00:00:02 UTC
## BugReport: https://github.com/spedygiorgio/markovchain/issues
##
##
## Attaching package: 'markovchain'
##
## The following object is masked from 'package:lubridate':
##
## period
# install.packages('ufcstatr')
# library(ufcstatr)
pbp_data <- read.csv('Data/Precision_Data_NC_All.csv')
event_data <- read.csv('Data/Precision_Data_NC_All__BOUTS.csv')
For the purposes of this analysis, the data utilised will only consist of data collected from the UFC promotion.
This analysis only wants to consider fights which were conducted after the UFC adopted the unified rules in November 2000. However, fights from this dataset begin from the 23rd of November 2007 and range to the 30th July 2022.
The final variable requiring filtering is the win method variable. Only fights that were decided by judges scorecards will be analysed.
After filtering, the mean age of fighters is 29.4, ranging between 20.5-43.9 years old. There is a total of 487 UFC fight events after filtering, 105 are female fights and 382 are male fights. In total, there are 486 unique fighters, 84 female fighters and 402 male fighters.
# filtering ----
# filter to only include UFC fights
event_data <- event_data %>%
filter(PROMOTION_NAME == 'UFC')
# fights have been erroneously coded with UFC as the PROMOTION_NAME
# eliminate if UFC not found in BOUT_NAME
# first, eliminate test bout, BOUT_ID == 7793
event_data <- event_data %>% filter(BOUT_ID != 7793)
event_data <- event_data %>%
filter(!grepl('MTFL', BOUT_NAME) &
!grepl('ARES', BOUT_NAME) &
!grepl('MT-', BOUT_NAME))
### Filter out fights previous to the unified rules
# (November 2000)
event_data <- event_data %>%
mutate(EVENT_DATE = as.Date(EVENT_DATE))
# there is an incorrect date for bout 8051
event_data$EVENT_DATE[event_data$BOUT_ID == 8051] <- as.Date('2016-07-10')
# remove bout 8555 as it is a test bout
event_data <- event_data %>% filter(BOUT_ID != 8555)
# now the earliest date is 2007-05-27
### only fights that went the distance
# could filter by WIN_METHOD
judges_decisions <- c('Majority Decision', 'Majority Draw', 'Split Decision',
'Split Draw', 'Unanimous Decision')
event_data <- event_data %>%
filter(WIN_METHOD %in% judges_decisions)
Some data cleaning involves: * Removal of test bouts * Improper coding of gender, round ended * Converting variable types to factors or numeric
### cleaning ----
# some more incorrect values
# FIGHTER_1_NAME == Brad Pickett, DOB seems to be incorrect
# should actually be 1978 not 0078, BOUT_ID == 7778
event_data$FIGHTER_1_DOB[event_data$BOUT_ID == 7778] <- as.Date('1978-09-24')
# gender
g2_females <- event_data %>%
filter(FIGHTER_1_GENDER == 'Female' & FIGHTER_2_GENDER != 'Female') %>% pull(FIGHTER_2_NAME)
g1_males <- event_data %>%
filter(FIGHTER_1_GENDER == 'Male' & FIGHTER_2_GENDER != 'Male') %>% pull(FIGHTER_1_NAME)
event_data <- event_data %>%
mutate(
FIGHTER_2_GENDER = ifelse(FIGHTER_2_NAME %in% g2_females, 'Female', FIGHTER_2_GENDER),
FIGHTER_1_GENDER = ifelse(FIGHTER_1_NAME %in% g1_males, 'Female', FIGHTER_1_GENDER),
)
# convert variables to dates or factors
event_data <- event_data %>%
mutate(FIGHTER_1_DOB = as.Date(FIGHTER_1_DOB),
FIGHTER_2_DOB = as.Date(FIGHTER_2_DOB),
EVENT_DATE = as.Date(EVENT_DATE),
WIN_METHOD = as.factor(WIN_METHOD),
CHAMPIONSHIP_BOUT = as.factor(CHAMPIONSHIP_BOUT),
WEIGHTCLASS = as.factor(WEIGHTCLASS),
CATCH_WEIGHT = as.factor(CATCH_WEIGHT))
# convert anthropometric variables to numeric
event_data <- event_data %>%
mutate(FIGHTER_1_HEIGHT = as.numeric(FIGHTER_1_HEIGHT),
FIGHTER_2_HEIGHT = as.numeric(FIGHTER_2_HEIGHT),
FIGHTER_1_ARM_REACH = as.numeric(FIGHTER_1_ARM_REACH),
FIGHTER_2_ARM_REACH = as.numeric(FIGHTER_2_ARM_REACH),
FIGHTER_1_LEG_REACH = as.numeric(FIGHTER_1_LEG_REACH),
FIGHTER_2_LEG_REACH = as.numeric(FIGHTER_2_LEG_REACH))
## Warning: There were 6 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `FIGHTER_1_HEIGHT = as.numeric(FIGHTER_1_HEIGHT)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 5 remaining warnings.
# incorrectly coded round ended, lets eliminate them
event_data <- event_data %>%
filter(ROUND_ENDED == 3 | ROUND_ENDED == 5)
For Exploratory Data Analysis, we need to create an age variable to investigate our population
# deriving new variables
# age
# seems to be an incorrect date for BOUT == 8451, as Brad Katona was not 15 on debut
event_data$EVENT_DATE[event_data$BOUT_ID == 8451] <- as.Date('2018-07-06')
event_data$EVENT_NAME[event_data$BOUT_ID == 8451] <- 'TUF-27_FINALE'
event_data <- event_data %>%
mutate(FIGHTER_1_AGE = round(as.numeric((EVENT_DATE-FIGHTER_1_DOB)/365),1),
FIGHTER_2_AGE = round(as.numeric((EVENT_DATE-FIGHTER_2_DOB)/365),1))
### EDA ----
# mean age
bind_rows(
event_data %>%
select(AGE = FIGHTER_1_AGE),
event_data %>%
select(AGE = FIGHTER_2_AGE)
) %>%
summarise(mean_age = mean(AGE),
min_age = min(AGE),
max_age = max(AGE))
## mean_age min_age max_age
## 1 29.44004 20.5 43.9
# time period and collection methodology
event_data %>%
summarise(
min_date = min(EVENT_DATE),
max_date = max(EVENT_DATE)
)
## min_date max_date
## 1 2007-09-23 2022-07-30
# number of observations
event_data %>%
group_by(FIGHTER_1_GENDER) %>%
count()
## # A tibble: 2 × 2
## # Groups: FIGHTER_1_GENDER [2]
## FIGHTER_1_GENDER n
## <chr> <int>
## 1 Female 105
## 2 Male 382
# unique fighters
names_and_gender <- bind_rows(event_data %>%
select(name = FIGHTER_1_NAME, gender = FIGHTER_1_GENDER) %>%
distinct(),
event_data %>%
select(name = FIGHTER_2_NAME, gender = FIGHTER_2_GENDER) %>%
distinct())
names_and_gender %>%
filter(gender == 'Female') %>%
distinct() %>%
nrow()
## [1] 84
names_and_gender %>%
filter(gender == 'Male') %>%
distinct() %>%
nrow()
## [1] 402
The dataset predominantly contains Male fighters, with 402 unique Male fighters and only 84 unique female fighters.
# number of observations, athletes/teams/participants -
# temporal coverage (seasons, matches, events)
## fights per year
event_data$year <- format(as.Date(event_data$EVENT_DATE, format = '%Y-%m-%d'), '%Y')
year_plot <- event_data %>%
group_by(year) %>%
count() %>%
ggplot(aes(year, n)) +
geom_col()
year_plot + theme(axis.text.x = element_text(angle = 45))
The bulk of the data comes between 2018 and 2020.
weight_plot <- event_data %>%
group_by(WEIGHTCLASS) %>%
count() %>%
ggplot(aes(reorder(WEIGHTCLASS, desc(n)), n)) +
geom_col()
weight_plot + theme(axis.text.x = element_text(angle = 45))
Featherweight is the weightclass which provides us with the most fights.
facet_weight <- event_data %>%
count(WEIGHTCLASS, FIGHTER_1_GENDER) %>%
ggplot(aes(reorder(WEIGHTCLASS, desc(n)), n)) +
geom_col() +
facet_grid(~FIGHTER_1_GENDER) +
labs(x = "Weight Class", y = "Count")
facet_weight + theme(axis.text.x = element_text(angle = 45))
There are observations for both males and females in some weightclasses.