I have always wondered if there was a magical set of traits that tell whether you win a fight or not. There has not been much research out there. I found two people attempting fight prediction model using python. They both used a lot of machine learning techniques which is above me. The Data is split up by red corner and blue corner. I refer them as blue fighter and red fighter
I obtained the data from Kaggle. It was raw data and needed a lot of cleaning. It had a total of 895 columns! I added many columns to help me with the process. I added a column for the red fighter and blue fighter. The value is either winner or loser. I also made a column for weight division. I got the difference in age, weight, and height of the winner and loser. I also added a column for year. I columns I added were to use for graphs. The hardest columns to work with, were the different types of strikes landed and attempted per every round. I decided to extract just those columns to a data frame by using the select_if(is.numeric()) function. Then I wrote that data frame to csv, “doit567.csv” and used excel to add the columns off all the rounds per fighter to one total column. It was very proud of that. Body strikes landed, Ground Strikes landed per round are now in Total body Strike and Total Ground Strikes landed per fight and per fighter (Red and Blue).
This test had a significantly low p-value I checked for normality for any variable and everything got a significantly low p-value. This is strong evidence againts the Null Hypothesis. If there exist any relationship, it is definitely not linear.
## 'data.frame': 1443 obs. of 34 variables:
## $ B_Age : int 23 32 38 23 30 38 30 27 34 33 ...
## $ R_Age : int 27 29 32 25 28 30 30 30 31 28 ...
## $ B_Weight : num 185 154 154 123 134 154 154 154 154 170 ...
## $ R_Weight : num 185 154 154 123 134 154 154 154 154 170 ...
## $ B_Height : int 182 175 172 170 167 180 182 177 177 182 ...
## $ R_Height : int 187 182 177 175 170 180 187 180 177 182 ...
## $ winner : Factor w/ 2 levels "blue","red": 2 1 2 1 2 2 2 1 2 1 ...
## $ B_Total_Body_Strikes_Attempts : int 11 0 41 0 35 0 25 0 8 49 ...
## $ B_Total_Body_Strikes_Landed : int 11 0 27 0 21 0 15 0 3 33 ...
## $ R_Total_Body_Strikes_Attempts : int 65 0 17 128 90 61 50 2 21 107 ...
## $ R_Total_Body_Strikes_Landed : int 52 0 16 108 72 38 31 2 12 88 ...
## $ B_Total_Clinch_Strikes_Attempts: int 1 0 14 0 25 0 18 0 16 52 ...
## $ B_Total_Clinch_Strikes_Landed : int 1 0 12 0 10 0 12 0 8 42 ...
## $ R_Total_Clinch_Strikes_Attempts: int 32 0 14 101 173 129 30 23 25 196 ...
## $ R_Total_Clinch_Strikes_Landed : int 27 0 13 79 132 98 20 12 17 161 ...
## $ B_Total_Ground_Strikes_Attempts: int 63 0 12 0 107 0 44 0 4 54 ...
## $ B_Total_Ground_Strikes_Landed : int 47 0 9 0 84 0 37 0 3 41 ...
## $ R_Total_Ground_Strikes_Attempts: int 210 0 26 327 42 63 21 135 17 137 ...
## $ R_Total_Ground_Strikes_Landed : int 162 0 19 244 35 37 13 114 11 120 ...
## $ Year : num 2017 2014 2015 2016 2016 ...
## $ Class : Factor w/ 9 levels "Broke Weight",..: 8 6 6 3 4 6 6 6 6 7 ...
## $ theWinner : num 1 0 1 0 1 1 1 0 1 0 ...
## $ Red_Fighter : Factor w/ 2 levels "Loser","Winner": 2 1 2 1 2 2 2 1 2 1 ...
## $ Blue_Fighter : Factor w/ 2 levels "Loser","Winner": 1 2 1 2 1 1 1 2 1 2 ...
## $ winner_weight : num 185 154 154 123 134 154 154 154 154 170 ...
## $ loser_weight : num 185 154 154 123 134 154 154 154 154 170 ...
## $ Weight_Diff : num 0 0 0 0 0 0 0 0 0 0 ...
## $ winner_height : num 187 175 177 170 170 180 187 177 177 182 ...
## $ loser_height : num 182 182 172 175 167 180 182 180 177 182 ...
## $ Height_Diff : num 5 -7 5 -5 3 0 5 -3 0 0 ...
## $ winner_Age : num 27 32 32 23 28 30 30 27 31 33 ...
## $ loser_Age : num 23 29 38 25 30 38 30 30 34 28 ...
## $ Age_Diff : num 4 3 -6 -2 -2 -8 0 -3 -3 5 ...
## $ W.diff.abs : num 0 0 0 0 0 0 0 0 0 0 ...
##
## Pearson's Chi-squared test
##
## data: age_cor.tbl
## X-squared = 52.369, df = 29, p-value = 0.004957
##
## Shapiro-Wilk normality test
##
## data: ufc_df$Age_Diff
## W = 0.99547, p-value = 0.0002538
This was tough. I know I am trying to predict a binary outcome, the winner (blue and red). Therefore, I went for logistic regression. However, the model wasn’t that great. I got an AIC: 1919. After removing some variables, I got the AIC down to 1883. But the logistic regression model still hasn’t good. After I converted the coefficients, the odds were almost one for both fighters. If the blue fighter is one year older than the blue is more likely to win. Same goes for Red Fighter.
##
## Call:
## glm(formula = theWinner ~ ., family = binomial(link = "logit"),
## data = ufc_log_reg_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2643 -1.1676 0.8764 0.9762 1.5292
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.214e+00 1.307e+00 0.928 0.3532
## B_Age 1.052e-01 1.538e-02 6.841 7.87e-12 ***
## R_Age -6.050e-02 1.469e-02 -4.119 3.80e-05 ***
## B_Height -7.832e-03 1.026e-02 -0.763 0.4453
## R_Height -3.483e-03 1.012e-02 -0.344 0.7306
## B_Total_Body_Strikes_Attempts -2.376e-03 8.013e-03 -0.297 0.7668
## B_Total_Body_Strikes_Landed -9.987e-05 1.045e-02 -0.010 0.9924
## R_Total_Body_Strikes_Attempts -1.894e-02 7.973e-03 -2.376 0.0175 *
## R_Total_Body_Strikes_Landed 2.604e-02 1.037e-02 2.511 0.0120 *
## B_Total_Clinch_Strikes_Attempts -1.261e-02 1.229e-02 -1.026 0.3050
## B_Total_Clinch_Strikes_Landed 1.483e-02 1.487e-02 0.997 0.3188
## R_Total_Clinch_Strikes_Attempts 1.306e-03 1.189e-02 0.110 0.9126
## R_Total_Clinch_Strikes_Landed -2.197e-03 1.475e-02 -0.149 0.8816
## Weight_Diff 1.006e-02 7.647e-03 1.316 0.1883
## Height_Diff -4.083e-03 9.572e-03 -0.427 0.6697
## Age_Diff 6.548e-02 1.188e-02 5.510 3.58e-08 ***
## W.diff.abs -6.771e-03 8.536e-03 -0.793 0.4276
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1944.6 on 1442 degrees of freedom
## Residual deviance: 1849.9 on 1426 degrees of freedom
## AIC: 1883.9
##
## Number of Fisher Scoring iterations: 4
## (Intercept) B_Age
## 3.3655434 1.1109758
## R_Age B_Height
## 0.9412941 0.9921988
## R_Height B_Total_Body_Strikes_Attempts
## 0.9965228 0.9976267
## B_Total_Body_Strikes_Landed R_Total_Body_Strikes_Attempts
## 0.9999001 0.9812372
## R_Total_Body_Strikes_Landed B_Total_Clinch_Strikes_Attempts
## 1.0263773 0.9874718
## B_Total_Clinch_Strikes_Landed R_Total_Clinch_Strikes_Attempts
## 1.0149378 1.0013066
## R_Total_Clinch_Strikes_Landed Weight_Diff
## 0.9978050 1.0101102
## Height_Diff Age_Diff
## 0.9959255 1.0676760
## W.diff.abs
## 0.9932518
I went directly to decision tree model. I also only used striking data greater than zero to see if I get better results. Plugging using the whole data set for the decision tree did not give me a nice result. I had many nodes and some where confusing. It centered around ages of the fighter and age differences. I decide to see one more thing. I filtered out all the weight difference that were zero and got a nicer tree. I will use this one in the presentation. I will also try a random forest model.
## n= 303
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 303 123 red (0.40594059 0.59405941)
## 2) B_Total_Body_Strikes_Landed>=48.5 30 9 blue (0.70000000 0.30000000)
## 4) B_Total_Body_Strikes_Attempts< 125.5 21 3 blue (0.85714286 0.14285714) *
## 5) B_Total_Body_Strikes_Attempts>=125.5 9 3 red (0.33333333 0.66666667) *
## 3) B_Total_Body_Strikes_Landed< 48.5 273 102 red (0.37362637 0.62637363)
## 6) Class=Broke Weight,Flyweight,Featherweight,Middleweight 76 36 blue (0.52631579 0.47368421)
## 12) B_Total_Ground_Strikes_Landed< 25.5 61 25 blue (0.59016393 0.40983607)
## 24) Weight_Diff< 12 40 13 blue (0.67500000 0.32500000)
## 48) R_Weight>=177.5 11 0 blue (1.00000000 0.00000000) *
## 49) R_Weight< 177.5 29 13 blue (0.55172414 0.44827586)
## 98) B_Height< 173.5 18 5 blue (0.72222222 0.27777778) *
## 99) B_Height>=173.5 11 3 red (0.27272727 0.72727273) *
## 25) Weight_Diff>=12 21 9 red (0.42857143 0.57142857)
## 50) R_Weight< 177.5 13 4 blue (0.69230769 0.30769231) *
## 51) R_Weight>=177.5 8 0 red (0.00000000 1.00000000) *
## 13) B_Total_Ground_Strikes_Landed>=25.5 15 4 red (0.26666667 0.73333333) *
## 7) Class=Bantamweight,Lightweight,Welterweight,Light Heavyweight,Heavyweight 197 62 red (0.31472081 0.68527919)
## 14) B_Total_Clinch_Strikes_Landed>=12.5 44 21 red (0.47727273 0.52272727)
## 28) Height_Diff>=-7.5 37 16 blue (0.56756757 0.43243243)
## 56) B_Height>=173.5 29 9 blue (0.68965517 0.31034483) *
## 57) B_Height< 173.5 8 1 red (0.12500000 0.87500000) *
## 29) Height_Diff< -7.5 7 0 red (0.00000000 1.00000000) *
## 15) B_Total_Clinch_Strikes_Landed< 12.5 153 41 red (0.26797386 0.73202614)
## 30) B_Age< 32.5 81 29 red (0.35802469 0.64197531)
## 60) Age_Diff< 0.5 35 14 blue (0.60000000 0.40000000)
## 120) R_Age>=30.5 18 1 blue (0.94444444 0.05555556) *
## 121) R_Age< 30.5 17 4 red (0.23529412 0.76470588) *
## 61) Age_Diff>=0.5 46 8 red (0.17391304 0.82608696)
## 122) R_Age< 28.5 8 2 blue (0.75000000 0.25000000) *
## 123) R_Age>=28.5 38 2 red (0.05263158 0.94736842) *
## 31) B_Age>=32.5 72 12 red (0.16666667 0.83333333) *
I used a random forest model on the UFC data. I used many numeric variables I believed wouldn’t cause faulty outputs in my mode. After I input all the information into the equation, It came back with a prediction/confusion matrix with an error rate is ~42%. This is not a good model. It predicts the red fighter’s outcome better than the blue fighter.
##
## Call:
## randomForest(x = ufcX, y = ufcY)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 42.62%
## Confusion matrix:
## blue red class.error
## blue 145 435 0.7500000
## red 180 683 0.2085747
The results not as reliable as I liked them to be. I had to get a subset of the data to get meaningful results. For example, I had to get the weight differences that are not zero in order to have a meaning full decision tree. I only used certain variable in each model. I only got certain value ranges from those variables. So overall, there wasn’t a good trend.
My model can only predict the outcome of a fight is both fighters’ data fall within the data restriction I put the variables in. For example: If the fighters have the exact same weight, then my decision tree wouldn’t work. The random forest model is the best model because there isn’t any data range restriction.
This data was scrapped from the fight metric website. It has a lot of UFC fights and fighter data; however, it is not accessible to just anyone for analysis. You have to get their API. To get their API, you have to be a graduate researcher. I wanted to put in a request but it will take months. Additional data like fighter’s discipline, best winning streak, update win-lost record. The data set I have has no information on the fighters before 2014. If I had a bigger dataset then maybe I would see a better trend.
Not Really. I choose to compare use the weight difference not equal to zero beacause almost all fight have two fighter that wieght the exact same. The only weight divinsion you see some type of trend is in the heavyweight division. This is becasue the division ranges from 205lbs to 265lbs.
The age analysis didn’t show any trends. The heavier the red fighter is, the more likely he will the red fighter will win. If we have a bigger data, then it a trends might appear.
It depends what ranges of traits the fighters have. IS their weight difference zero? If not, we can use the Decision Tree Model. If so, we can use the Random Forest Age differnce doesnt seem to have any effect.