Introduction

I am a huge MMA fan, and the UFC is the largest MMA promotion, I saw this dataset on Kaggle and had to play with it. The following is an attempt to model what elements of fights and fighters are most important to winning fights. At the end, I compare the statistics with some real life intuition, to see how the numbers stack up against popular wisdom, add some caveats to the results, and potential next steps.

The fight-level dataset contains super granular data on every fight from 1993 to 2019. Our goal is to build a model that:

  1. gives us an understanding of what factors contribute the most, and in what ways, towards one fighter winning over another. In other words, this is an exercise in understanding how fighters compare across a variety of attributes, and evaluating to what extent each comparison matters.
  2. is able to predict which fighter will win

Before we start building models on top of it, we have to wrangle the data to make it make sense for modeling:

  1. Removing uninformative dimensions: Some dimensions in our fight-level dataset, such as the referee or state, do not help us understand the fighters and how they compare. To that end, we start by removing these details from our training datset Even though they might marginally improve our model’s evaluation metrics, they do not contribute to our understanding of each fighter, and are therefore unlikely to meaningfully explain why one fighter wins over another. We’re also removing rows that have NAs - these fighters tend to not be representative of your typical UFC fighter, and unfortunately it poses too many data concerns. We’re left with over 3,000 fights to work with. (A quick comparison of the two datasets shows us that the majority of fights excluded are between 1993-1997, likely when some of these metrics were not recorded).

  2. Factorizing Intuitively, it makes sense that different attributes will matter differently according to the weight class. We’ll reframe this column into a factor to make it easier to work with.

  3. Probability of red winning This prediction problem is a binary classification - we’re looking at whether the blue of red corner wins. If we’re using an OLS regression technique (like with logistic regression) we’ll have to reframe the problem as “what is the probability that __ corner wins?”. Given that for most fights, the UFC gives the fighter with this most name recognition the red corner, we’ll assess the probability that the red corner wins.

  4. Difference between fighters Finally, at its core, our analysis seeks to understand how fighters compare against each other. To this end, it is useful to have a datsaet that explicitly the differences in fighters for each stat. We’re going to be looking at the \(\frac{Red Fighter - Blue Fighter} {Red Fighter}\) difference.

The Data

The two datasets we’re working with are the following:

A Few Variables & Their Correlation with Republican Winner
fighter current_lose_streak current_win_streak draw avg_BODY_att avg_BODY_landed avg_CLINCH_att avg_CLINCH_landed avg_DISTANCE_att avg_DISTANCE_landed avg_GROUND_att avg_GROUND_landed avg_HEAD_att avg_HEAD_landed avg_KD avg_LEG_att avg_LEG_landed avg_PASS avg_REV avg_SIG_STR_att avg_SIG_STR_landed avg_SIG_STR_pct avg_SUB_ATT avg_TD_att avg_TD_landed avg_TD_pct avg_TOTAL_STR_att avg_TOTAL_STR_landed longest_win_streak losses avg_opp_BODY_att avg_opp_BODY_landed avg_opp_CLINCH_att avg_opp_CLINCH_landed avg_opp_DISTANCE_att avg_opp_DISTANCE_landed avg_opp_GROUND_att avg_opp_GROUND_landed avg_opp_HEAD_att avg_opp_HEAD_landed avg_opp_KD avg_opp_LEG_att avg_opp_LEG_landed avg_opp_PASS avg_opp_REV avg_opp_SIG_STR_att avg_opp_SIG_STR_landed avg_opp_SIG_STR_pct avg_opp_SUB_ATT avg_opp_TD_att avg_opp_TD_landed avg_opp_TD_pct avg_opp_TOTAL_STR_att avg_opp_TOTAL_STR_landed total_rounds_fought total_time_fought total_title_bouts win_by_Decision_Majority win_by_Decision_Split win_by_Decision_Unanimous win_by_KO/TKO win_by_Submission win_by_TKO_Doctor_Stoppage wins Height_cms Reach_cms Weight_lbs age stance
Aaron Phillips 1.0000000 0.000000 0 14.000000 12.000000 6.000000 3.000000 26.00000 9.00000 8.000000 6.000000 23.00000 5.00000 0.000000 3.000000 1.000000 1.0000000 1.0000000 40.00000 18.00000 0.4500000 1.0000000 0.000000 0.0000000 0.0000000 137.00000 109.00000 0.0000000 1.000000 13.000000 8.000000 6.000000 4.00000 31.00000 12.00000 31.000000 21.000000 53.00000 28.00000 0.0000000 2.000000 1.000000 7.0000000 1.0000000 68.00000 37.00000 0.5400000 1.0000000 8.000000 5.0000000 0.6200000 129.00000 95.00000 3.000000 900.0000 0 0 0.0000000 0.000000 0.000000 0.00 0 0.000000 175.26 177.80 135 25.00000 Southpaw
Aaron Riley 0.7916667 0.375000 0 14.886905 12.436905 26.699405 15.647024 71.04464 24.41488 2.977381 2.159524 70.40595 18.85536 0.000000 15.428571 10.929167 0.8041667 0.0000000 100.72143 42.22143 0.4049583 0.3553571 2.839881 0.8511905 0.3019940 133.25357 72.51012 0.7083333 2.916667 13.207143 10.910714 24.312500 13.35714 83.91071 28.59226 6.923214 3.514286 97.66131 32.18810 0.4607143 4.277976 2.364881 0.0000000 0.0000000 115.14643 45.46369 0.3908869 0.0000000 3.826786 2.3178571 0.3226190 132.08155 62.15536 9.458333 646.6417 0 0 0.0000000 1.416667 0.000000 0.00 0 1.416667 172.72 175.26 155 28.37500 Southpaw
Aaron Rosa 0.0000000 1.000000 0 23.500000 21.000000 31.500000 26.000000 107.50000 40.00000 0.000000 0.000000 110.50000 40.50000 0.000000 5.000000 4.500000 0.0000000 0.0000000 139.00000 66.00000 0.4950000 0.0000000 0.500000 0.0000000 0.0000000 284.50000 195.50000 1.0000000 1.000000 9.500000 8.000000 15.500000 13.00000 94.00000 36.50000 4.000000 4.000000 92.50000 36.00000 0.0000000 11.500000 9.500000 1.0000000 0.0000000 113.50000 53.50000 0.4700000 0.0000000 7.000000 1.0000000 0.1750000 189.50000 116.50000 6.000000 793.0000 0 1 0.0000000 0.000000 0.000000 0.00 0 1.000000 193.04 198.12 205 28.00000 Orthodox
Aaron Simpson 0.4761905 1.309524 0 10.122449 7.128685 17.370975 11.932341 35.32381 11.43931 15.311111 8.886990 51.50334 19.08965 0.543424 6.380102 6.040306 0.8887755 0.0000000 68.00590 32.25865 0.5123336 0.1816043 6.382313 2.6608560 0.6325468 95.37177 56.16344 2.7857143 1.452381 4.392942 3.372931 22.182171 12.92883 37.15958 11.10768 1.512840 1.233702 51.35788 17.92857 0.0000000 5.103770 3.968707 0.0150794 0.0953231 60.85459 25.27021 0.3638418 0.3374717 1.787869 0.3284297 0.0643577 73.02200 36.78424 12.166667 521.2818 0 0 0.8571429 1.023810 1.928571 0.00 0 3.809524 182.88 185.42 170 36.04762 Orthodox
Abdul Razak Alhassan 0.2500000 1.000000 0 1.708333 1.104167 5.416667 4.104167 41.72917 16.97917 2.083333 1.562500 45.12500 20.16667 1.770833 2.395833 1.375000 0.0000000 0.2708333 49.22917 22.64583 0.4862500 0.0000000 0.437500 0.1458333 0.0481250 51.91667 24.37500 1.2500000 0.750000 4.854167 3.687500 5.395833 3.93750 46.95833 13.39583 6.687500 3.979167 52.52083 15.95833 0.0000000 1.666667 1.666667 0.8125000 0.0000000 59.04167 21.31250 0.2770833 0.0000000 2.854167 1.6875000 0.2100000 76.16667 34.39583 4.000000 323.6042 0 0 0.0000000 0.000000 1.750000 0.00 0 1.750000 177.80 185.42 170 32.00000 Orthodox
Abel Trujillo 0.5833333 1.000000 0 12.142130 10.475066 15.108532 11.621958 37.90476 10.65483 14.117527 11.861243 51.90794 20.58221 0.428373 3.080754 3.080754 0.5609788 0.0000000 67.13082 34.13803 0.4720298 0.5920635 1.994114 1.3677249 0.7785245 84.07176 49.26739 1.8333333 2.333333 3.482077 2.533664 7.300926 4.11541 32.79987 11.86587 6.188823 4.033664 37.66369 13.90536 0.0000000 5.143849 3.575926 0.9390873 0.0000000 46.28962 20.01495 0.4701184 1.1196429 9.544511 5.6351190 0.4725694 58.51455 29.74497 11.583333 578.2272 0 0 0.0000000 0.250000 2.250000 0.25 0 3.166667 172.72 177.80 155 31.25000 Orthodox
A Few Variables & Their Correlation with Republican Winner
fight_id diff_wins diff_Weight_lbs diff_total_time_fought diff_total_rounds_fought diff_Reach_cms diff_losses diff_longest_win_streak diff_Height_cms diff_current_win_streak diff_avg_TOTAL_STR_landed diff_avg_TOTAL_STR_att diff_avg_TD_pct diff_avg_TD_landed diff_avg_TD_att diff_avg_SU_ATT diff_avg_SIG_STR_pct diff_avg_SIG_STR_landed diff_avg_SIG_STR_att diff_avg_REV diff_avg_PASS diff_avg_opp_TOTAL_STR_landed diff_avg_opp_TOTAL_STR_att diff_avg_opp_TD_pct diff_avg_opp_TD_landed diff_avg_opp_TD_att diff_avg_opp_SU_ATT diff_avg_opp_SIG_STR_pct diff_avg_opp_SIG_STR_landed diff_avg_opp_SIG_STR_att diff_avg_opp_REV diff_avg_opp_PASS diff_avg_opp_LEG_landed diff_avg_opp_LEG_att diff_avg_opp_KD diff_avg_opp_HEAD_landed diff_avg_opp_HEAD_att diff_avg_opp_GROUND_landed diff_avg_opp_GROUND_att diff_avg_opp_DISTANCE_landed diff_avg_opp_DISTANCE_att diff_avg_opp_CLINCH_landed diff_avg_opp_CLINCH_att diff_avg_opp_ODY_landed diff_avg_opp_ODY_att diff_avg_LEG_landed diff_avg_LEG_att diff_avg_KD diff_avg_HEAD_landed diff_avg_HEAD_att diff_avg_GROUND_landed diff_avg_GROUND_att diff_avg_DISTANCE_landed diff_avg_DISTANCE_att diff_avg_CLINCH_landed diff_avg_CLINCH_att diff_avg_ODY_landed diff_avg_ODY_att diff_age red_win
1 0.5000000 0.0000000 0.4352276 0.6666667 -0.0468750 0.5000000 0.0000000 -0.0312500 0.0000000 0.6584660 0.4888376 0.7816594 0.8947368 0.8490566 -3.0000000 0.0000000 0.4863636 0.3550296 0.0000000 0.6666667 0.5565820 0.5131222 -1.0000000 -3.0000000 -0.1111111 0.0000000 0.2976190 0.4347826 0.4655870 0.0000000 0.0000000 0.2131148 0.2765957 -1.0000000 0.4566474 0.4796321 0.3333333 0.2500000 0.3507463 0.4342541 0.8823529 0.8666667 0.5454545 0.5187970 -0.4594595 -0.4339623 -1.0000000 0.5313808 0.3450135 0.6923077 0.7234043 0.2226415 0.1653333 1.0000000 0.9882353 0.6341463 0.5799087 0.0312500 1
2 0.2000000 0.0000000 0.2005650 -0.1600000 0.0000000 -2.0000000 -0.5000000 -0.0153846 -0.5000000 0.3233333 0.0177719 0.6258907 0.7941176 0.8055556 -0.6333333 0.3069479 0.1367788 -0.1988903 1.0000000 0.5333333 0.0836806 0.0420054 -0.5699029 -0.0500000 0.3000000 -1.4500000 0.0666667 -0.0846645 -0.0608158 0.0000000 -0.2250000 0.5370079 0.5916667 0.0000000 -1.4781609 -0.5794457 -0.6100000 -0.2600000 0.0175439 -0.0317391 -0.2218182 -0.2108108 0.3424242 0.4709302 0.3067961 0.3622222 0.0000000 0.1351351 -0.3243243 0.8843478 0.8697674 -0.1648221 -0.4147488 -0.0645833 -0.2707692 -0.1796296 -0.2166667 -0.0322581 1
3 -0.6428571 0.0000000 0.0372750 -1.0606061 0.0394737 -7.0000000 0.2727273 -0.0281690 0.7272727 0.1711611 0.2202280 -0.0654467 -0.6935484 -1.3156682 0.6451613 -0.1537884 0.1567153 0.2198391 0.2741935 -1.8064516 -0.2894869 0.0170976 0.5161290 0.6612903 0.2685671 -0.4516129 -0.3330171 -0.2548146 0.0407528 0.7580645 0.9032258 -0.7008798 -0.5268817 0.1532258 -0.0984427 0.1357092 0.6658986 0.6420681 -0.1821278 0.0715770 -6.9618768 -2.0069124 -0.4595452 -0.2375502 -0.0887097 -0.0194175 -2.2258065 0.3466836 0.3007047 0.1078629 -0.0342742 0.2200678 0.2703048 -1.5310174 -1.3518380 -0.3064516 -0.1073201 -0.0285714 1
4 0.3333333 0.0000000 0.0554147 0.5500000 0.0147059 1.0000000 0.2000000 -0.0468750 0.0000000 -0.7093596 -0.3184239 -1.7710843 -1.0000000 -0.1111111 0.0000000 -0.5017065 -0.5355191 -0.2192394 0.0000000 -3.0000000 0.2992327 0.0901194 0.0000000 0.0000000 -0.8947368 0.0000000 0.2436975 0.2924791 0.0800451 0.0000000 0.0000000 0.6179775 0.4528302 -1.0000000 -0.0552147 -0.0662359 -0.6666667 -1.0000000 0.3051948 0.0860606 0.3333333 0.1272727 0.5514019 0.3950617 0.7826087 0.7692308 -0.3333333 -1.2395833 -0.4440994 -15.8000000 -12.0000000 -0.1890244 -0.0441001 -1.6666667 -1.3404255 -0.3658537 0.0684932 0.1034483 0
5 0.6666667 0.0530303 -1.7226319 -0.1428571 0.0266667 0.0000000 0.6666667 0.0405405 0.0000000 -0.8931298 -2.2125984 0.0000000 0.0000000 1.0000000 0.0000000 0.4311927 -0.8923077 -2.2690763 0.0000000 1.0000000 -2.2432432 -2.3966942 0.0000000 0.0000000 0.5000000 0.0000000 -0.0817610 -3.0000000 -2.7363636 0.0000000 1.0000000 -1.6666667 -0.9090909 1.0000000 -3.0357143 -2.5260116 1.0000000 1.0000000 -4.5076923 -3.7953216 0.8571429 0.8888889 -3.9473684 -5.8000000 0.3333333 0.4666667 1.0000000 -0.9780220 -2.6354680 1.0000000 1.0000000 -1.4040404 -2.9605911 0.7241379 0.7727273 -1.1481481 -1.1935484 -0.2307692 0
6 0.1111111 0.0000000 0.3402531 0.2812500 0.0845070 -0.3333333 0.0000000 0.0149254 0.0000000 0.4460641 0.2578313 0.4411178 0.6666667 0.4843750 0.1111111 -0.0187075 0.2057416 0.0641234 0.0000000 0.6111111 -0.2290749 -0.3691830 0.2284768 0.5384615 0.0370370 1.0000000 -0.3160763 -1.4423963 -0.8349929 1.0000000 0.9000000 -0.0185185 0.1200000 1.0000000 -2.6261682 -1.0460653 0.0000000 0.0909091 -2.2258065 -1.0228758 0.6666667 0.5362319 -0.5535714 -0.4766355 0.0546875 0.0000000 0.0000000 -0.0285714 -0.0568783 0.7092199 0.6839378 -0.1784703 -0.1699196 0.6917293 0.5654762 0.5936073 0.4039735 -0.1034483 1

Initial Analysis

We plug our modelling dataset into a random forest of 500 trees, randomly selecting 20 variables at a time, and plot how much splitting on each variable contributed to a decrease in the gini coefficient, averaged across all the trees.

## 
## Call:
##  randomForest(formula = red_win ~ ., data = forest_df, ntree = 500,      mtry = 20) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 20
## 
##         OOB estimate of  error rate: 36.66%
## Confusion matrix:
##     0    1 class.error
## 0 279  907   0.7647555
## 1 267 1749   0.1324405

It looks like the following variables are the key drivers to determining the outcome of fights:

  1. % Difference in age – by far the most important variable: the importance of this variable is super interesting. Much of the mma media discusses fighters in relation to “their prime” (generally around 28-30). Fighters like Randy Couture, Fedor, Yoel Romero have made legends fighting well past their “primes” and succeeding. That this narrative is backed by some statistical analysis suggests that there is something to this general perception.

  2. Percent of significant strikes landed – the importance of significant strikes is not surprising, significant strikes are called out in the judges scorecards, which contribute how judges score fights that go to a decision. What is surprising, however, is that the % of significant strikes is so important (vs. overall volume). This suggests that fighters with accurate striking are favoured over those with volume.

  3. Number of ground strikes landed – contrasing numbeer of ground strikes vs % of significant strikes presents an interesting nuance. Ground strikes are those landed against a grounded opponent (typically brought to the ground by a takedown or trip). What’s interesting about the number of ground strikes being more important than % of ground strikes landed, is that it suggests that overall ground control time is potentially more important than actual damaage done on the ground. Interstingly, barring actively working for a submission (which referees sometimes don’t catch), ground strikes are the only way to keep your opponent on the ground without the referee standing you both (due to inactivity). I would be curious as to how relatively important ground strikes are compared with overall ground control time

  4. % Difference in reach – this makes intuitive sense, a fighters relative reach over another presents the opportunity to land strikes with less risk of getting hit. Shorter stature fighters like BJ Penn made their names using technique, physicality and a strong chin to overcome this disadvantage. I would be curious to see how this variable becomes more/less important if you subset the data for the more proficient athletes (potentially filtering by only UFC champions)

  5. Number of head strikes landed – headstrikes are some of the most damaging strikes to receive. Most KO/TKO via strikes will be due to at least one head strike. Sports like boxing focus heavily on strikes to the head, so fighters coming in from this background will carry that instinct. In practice, head strikes are set up by shots in longer combinations (shots to the body, elbows in the clinch etc.). It would be great to understand if the importance of head strikes is a function of ones ability to land long combinations (and so potentially a proxy for % significant strikes), or if the damage of a head strike is inherently important.

  6. % Difference in time fought – again, this makes intuitive sense. MMA is a crazy sport, and time in the cage contributes to a fighters “fight IQ” (ability to make adjustments on the fly), endurance to go all 3 (or 5) rounds, ability to handle stress etc… .

Another interesting comparison is how many of the key drivers were physical attributes vs. in-fight strategies. A learning for future fighters would be to look at their age and reach relative to their peers, seriously. Fighters who are particularly young with large reaches would be especially advantaged. The analysis also highlights the importance of conditioning for MMA athletes, particularly when fighting more experienced fighters.

One piece I would love to understand more is the importance of submissions - submissions are a key element of the game, and a great submission game can neutralize threats of reach or size (famously proved by Joyce Gracie in the very first UFC tournament in 1994). I was surprised at how striking-heavy the important variables turned out to be. I would love to see how the important variables shift as we subset for 1) current or ex champions - my hypothesis is that as calibre of fighter goes up, physicality becomes less important and technique-driven variables will emerge (such as % sig strikes, submission attempts etc…).

Neural Network

As a predictive exercise, we’ll feed these variables into a small, 5 layered neural net. Neural nets have the advantage of learning, through each layer, which variables (and/or which combinations of layers) are most important to predicting the outcome. While this makes large neural nets strong algorithms, it is very hard to interpret them and get an intuition for which variables are important and why. Therefore, in this case, we feed the neural net with variables we already know from our random forest (and our intuition) are important. The neural net improves marginally on the random forest, achieving an overall accuracy of 65%. Looking at the confusion matrix, it looks like Recall actually went down to about 73% and Precision went up to 67%. Comparing this to the random forest, it looks like the model loosened its prediction of wins, versus the random forest that overpredicted the red corner winning.

Confusion Matrix

## # weights:  41
## initial  value 2523.180341 
## iter  10 value 2039.947608
## iter  20 value 2021.179624
## iter  30 value 2010.110831
## iter  40 value 2002.729564
## iter  50 value 1998.996950
## iter  60 value 1997.313435
## iter  70 value 1994.292583
## iter  80 value 1989.469108
## iter  90 value 1988.625653
## iter 100 value 1988.417170
## final  value 1988.417170 
## stopped after 100 iterations
##                red_win
## neural_predicts    0    1
##               0  294  216
##               1  892 1800
## [1] 65.39663
##                red_win
## neural_predicts    0    1
##               0  294  216
##               1  892 1800

65% accuracy is decent, although the ability to accurately predict upsets is far less than a predictive model would like to be. MMA is an inherently chaotic and surprising sport (part of why I love it so much), so I am honestly surprised that a model is able to predict with even this level of accuracy. If the community has refinements to my modelling or additional techniques to try, I would be very open.

© 2019 GitHub, Inc. Terms Privacy Security Stat