Question: Is there any traits, like age difference or weight difference of fighters that can predict the outcome of the Fight?

1. How did you develop your question and what relevant research has already been completed on this topic?

I have always wondered if there was a magical set of traits that tell whether you win a fight or not. There has not been much research out there. I found two people attempting fight prediction model using python. They both used a lot of machine learning techniques which is above me. The Data is split up by red corner and blue corner. I refer them as blue fighter and red fighter

2. How did you gather and prepare the data for analysis?

I obtained the data from Kaggle. It was raw data and needed a lot of cleaning. It had a total of 895 columns! I added many columns to help me with the process. I added a column for the red fighter and blue fighter. The value is either winner or loser. I also made a column for weight division. I got the difference in age, weight, and height of the winner and loser. I also added a column for year. I columns I added were to use for graphs. The hardest columns to work with, were the different types of strikes landed and attempted per every round. I decided to extract just those columns to a data frame by using the select_if(is.numeric()) function. Then I wrote that data frame to csv, “doit567.csv” and used excel to add the columns off all the rounds per fighter to one total column. It was very proud of that. Body strikes landed, Ground Strikes landed per round are now in Total body Strike and Total Ground Strikes landed per fight and per fighter (Red and Blue).

Correlation

This test had a significantly low p-value I checked for normality for any variable and everything got a significantly low p-value. This is strong evidence againts the Null Hypothesis. If there exist any relationship, it is definitely not linear.

## 'data.frame':    1443 obs. of  34 variables:
##  $ B_Age                          : int  23 32 38 23 30 38 30 27 34 33 ...
##  $ R_Age                          : int  27 29 32 25 28 30 30 30 31 28 ...
##  $ B_Weight                       : num  185 154 154 123 134 154 154 154 154 170 ...
##  $ R_Weight                       : num  185 154 154 123 134 154 154 154 154 170 ...
##  $ B_Height                       : int  182 175 172 170 167 180 182 177 177 182 ...
##  $ R_Height                       : int  187 182 177 175 170 180 187 180 177 182 ...
##  $ winner                         : Factor w/ 2 levels "blue","red": 2 1 2 1 2 2 2 1 2 1 ...
##  $ B_Total_Body_Strikes_Attempts  : int  11 0 41 0 35 0 25 0 8 49 ...
##  $ B_Total_Body_Strikes_Landed    : int  11 0 27 0 21 0 15 0 3 33 ...
##  $ R_Total_Body_Strikes_Attempts  : int  65 0 17 128 90 61 50 2 21 107 ...
##  $ R_Total_Body_Strikes_Landed    : int  52 0 16 108 72 38 31 2 12 88 ...
##  $ B_Total_Clinch_Strikes_Attempts: int  1 0 14 0 25 0 18 0 16 52 ...
##  $ B_Total_Clinch_Strikes_Landed  : int  1 0 12 0 10 0 12 0 8 42 ...
##  $ R_Total_Clinch_Strikes_Attempts: int  32 0 14 101 173 129 30 23 25 196 ...
##  $ R_Total_Clinch_Strikes_Landed  : int  27 0 13 79 132 98 20 12 17 161 ...
##  $ B_Total_Ground_Strikes_Attempts: int  63 0 12 0 107 0 44 0 4 54 ...
##  $ B_Total_Ground_Strikes_Landed  : int  47 0 9 0 84 0 37 0 3 41 ...
##  $ R_Total_Ground_Strikes_Attempts: int  210 0 26 327 42 63 21 135 17 137 ...
##  $ R_Total_Ground_Strikes_Landed  : int  162 0 19 244 35 37 13 114 11 120 ...
##  $ Year                           : num  2017 2014 2015 2016 2016 ...
##  $ Class                          : Factor w/ 9 levels "Broke Weight",..: 8 6 6 3 4 6 6 6 6 7 ...
##  $ theWinner                      : num  1 0 1 0 1 1 1 0 1 0 ...
##  $ Red_Fighter                    : Factor w/ 2 levels "Loser","Winner": 2 1 2 1 2 2 2 1 2 1 ...
##  $ Blue_Fighter                   : Factor w/ 2 levels "Loser","Winner": 1 2 1 2 1 1 1 2 1 2 ...
##  $ winner_weight                  : num  185 154 154 123 134 154 154 154 154 170 ...
##  $ loser_weight                   : num  185 154 154 123 134 154 154 154 154 170 ...
##  $ Weight_Diff                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ winner_height                  : num  187 175 177 170 170 180 187 177 177 182 ...
##  $ loser_height                   : num  182 182 172 175 167 180 182 180 177 182 ...
##  $ Height_Diff                    : num  5 -7 5 -5 3 0 5 -3 0 0 ...
##  $ winner_Age                     : num  27 32 32 23 28 30 30 27 31 33 ...
##  $ loser_Age                      : num  23 29 38 25 30 38 30 30 34 28 ...
##  $ Age_Diff                       : num  4 3 -6 -2 -2 -8 0 -3 -3 5 ...
##  $ W.diff.abs                     : num  0 0 0 0 0 0 0 0 0 0 ...

## 
##  Pearson's Chi-squared test
## 
## data:  age_cor.tbl
## X-squared = 52.369, df = 29, p-value = 0.004957

## 
##  Shapiro-Wilk normality test
## 
## data:  ufc_df$Age_Diff
## W = 0.99547, p-value = 0.0002538

3.How did you select and determine the correct regression model to answer your question?

This was tough. I know I am trying to predict a binary outcome, the winner (blue and red). Therefore, I went for logistic regression. However, the model wasn’t that great. I got an AIC: 1919. After removing some variables, I got the AIC down to 1883. But the logistic regression model still hasn’t good. After I converted the coefficients, the odds were almost one for both fighters. If the blue fighter is one year older than the blue is more likely to win. Same goes for Red Fighter.

Logistic regression

## 
## Call:
## glm(formula = theWinner ~ ., family = binomial(link = "logit"), 
##     data = ufc_log_reg_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2643  -1.1676   0.8764   0.9762   1.5292  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                      1.214e+00  1.307e+00   0.928   0.3532    
## B_Age                            1.052e-01  1.538e-02   6.841 7.87e-12 ***
## R_Age                           -6.050e-02  1.469e-02  -4.119 3.80e-05 ***
## B_Height                        -7.832e-03  1.026e-02  -0.763   0.4453    
## R_Height                        -3.483e-03  1.012e-02  -0.344   0.7306    
## B_Total_Body_Strikes_Attempts   -2.376e-03  8.013e-03  -0.297   0.7668    
## B_Total_Body_Strikes_Landed     -9.987e-05  1.045e-02  -0.010   0.9924    
## R_Total_Body_Strikes_Attempts   -1.894e-02  7.973e-03  -2.376   0.0175 *  
## R_Total_Body_Strikes_Landed      2.604e-02  1.037e-02   2.511   0.0120 *  
## B_Total_Clinch_Strikes_Attempts -1.261e-02  1.229e-02  -1.026   0.3050    
## B_Total_Clinch_Strikes_Landed    1.483e-02  1.487e-02   0.997   0.3188    
## R_Total_Clinch_Strikes_Attempts  1.306e-03  1.189e-02   0.110   0.9126    
## R_Total_Clinch_Strikes_Landed   -2.197e-03  1.475e-02  -0.149   0.8816    
## Weight_Diff                      1.006e-02  7.647e-03   1.316   0.1883    
## Height_Diff                     -4.083e-03  9.572e-03  -0.427   0.6697    
## Age_Diff                         6.548e-02  1.188e-02   5.510 3.58e-08 ***
## W.diff.abs                      -6.771e-03  8.536e-03  -0.793   0.4276    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1944.6  on 1442  degrees of freedom
## Residual deviance: 1849.9  on 1426  degrees of freedom
## AIC: 1883.9
## 
## Number of Fisher Scoring iterations: 4

##                     (Intercept)                           B_Age 
##                       3.3655434                       1.1109758 
##                           R_Age                        B_Height 
##                       0.9412941                       0.9921988 
##                        R_Height   B_Total_Body_Strikes_Attempts 
##                       0.9965228                       0.9976267 
##     B_Total_Body_Strikes_Landed   R_Total_Body_Strikes_Attempts 
##                       0.9999001                       0.9812372 
##     R_Total_Body_Strikes_Landed B_Total_Clinch_Strikes_Attempts 
##                       1.0263773                       0.9874718 
##   B_Total_Clinch_Strikes_Landed R_Total_Clinch_Strikes_Attempts 
##                       1.0149378                       1.0013066 
##   R_Total_Clinch_Strikes_Landed                     Weight_Diff 
##                       0.9978050                       1.0101102 
##                     Height_Diff                        Age_Diff 
##                       0.9959255                       1.0676760 
##                      W.diff.abs 
##                       0.9932518

Decision Tree Classification

I went directly to decision tree model. I also only used striking data greater than zero to see if I get better results. Plugging using the whole data set for the decision tree did not give me a nice result. I had many nodes and some where confusing. It centered around ages of the fighter and age differences. I decide to see one more thing. I filtered out all the weight difference that were zero and got a nicer tree. I will use this one in the presentation. I will also try a random forest model.

## n= 303 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 303 123 red (0.40594059 0.59405941)  
##     2) B_Total_Body_Strikes_Landed>=48.5 30   9 blue (0.70000000 0.30000000)  
##       4) B_Total_Body_Strikes_Attempts< 125.5 21   3 blue (0.85714286 0.14285714) *
##       5) B_Total_Body_Strikes_Attempts>=125.5 9   3 red (0.33333333 0.66666667) *
##     3) B_Total_Body_Strikes_Landed< 48.5 273 102 red (0.37362637 0.62637363)  
##       6) Class=Broke Weight,Flyweight,Featherweight,Middleweight 76  36 blue (0.52631579 0.47368421)  
##        12) B_Total_Ground_Strikes_Landed< 25.5 61  25 blue (0.59016393 0.40983607)  
##          24) Weight_Diff< 12 40  13 blue (0.67500000 0.32500000)  
##            48) R_Weight>=177.5 11   0 blue (1.00000000 0.00000000) *
##            49) R_Weight< 177.5 29  13 blue (0.55172414 0.44827586)  
##              98) B_Height< 173.5 18   5 blue (0.72222222 0.27777778) *
##              99) B_Height>=173.5 11   3 red (0.27272727 0.72727273) *
##          25) Weight_Diff>=12 21   9 red (0.42857143 0.57142857)  
##            50) R_Weight< 177.5 13   4 blue (0.69230769 0.30769231) *
##            51) R_Weight>=177.5 8   0 red (0.00000000 1.00000000) *
##        13) B_Total_Ground_Strikes_Landed>=25.5 15   4 red (0.26666667 0.73333333) *
##       7) Class=Bantamweight,Lightweight,Welterweight,Light Heavyweight,Heavyweight 197  62 red (0.31472081 0.68527919)  
##        14) B_Total_Clinch_Strikes_Landed>=12.5 44  21 red (0.47727273 0.52272727)  
##          28) Height_Diff>=-7.5 37  16 blue (0.56756757 0.43243243)  
##            56) B_Height>=173.5 29   9 blue (0.68965517 0.31034483) *
##            57) B_Height< 173.5 8   1 red (0.12500000 0.87500000) *
##          29) Height_Diff< -7.5 7   0 red (0.00000000 1.00000000) *
##        15) B_Total_Clinch_Strikes_Landed< 12.5 153  41 red (0.26797386 0.73202614)  
##          30) B_Age< 32.5 81  29 red (0.35802469 0.64197531)  
##            60) Age_Diff< 0.5 35  14 blue (0.60000000 0.40000000)  
##             120) R_Age>=30.5 18   1 blue (0.94444444 0.05555556) *
##             121) R_Age< 30.5 17   4 red (0.23529412 0.76470588) *
##            61) Age_Diff>=0.5 46   8 red (0.17391304 0.82608696)  
##             122) R_Age< 28.5 8   2 blue (0.75000000 0.25000000) *
##             123) R_Age>=28.5 38   2 red (0.05263158 0.94736842) *
##          31) B_Age>=32.5 72  12 red (0.16666667 0.83333333) *

Random Forest Model

I used a random forest model on the UFC data. I used many numeric variables I believed wouldn’t cause faulty outputs in my mode. After I input all the information into the equation, It came back with a prediction/confusion matrix with an error rate is ~42%. This is not a good model. It predicts the red fighter’s outcome better than the blue fighter.

## 
## Call:
##  randomForest(x = ufcX, y = ufcY) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 42.62%
## Confusion matrix:
##      blue red class.error
## blue  145 435   0.7500000
## red   180 683   0.2085747

4. How reliable are your results?

The results not as reliable as I liked them to be. I had to get a subset of the data to get meaningful results. For example, I had to get the weight differences that are not zero in order to have a meaning full decision tree. I only used certain variable in each model. I only got certain value ranges from those variables. So overall, there wasn’t a good trend.

5. What predictions can you make with your model? Examples

My model can only predict the outcome of a fight is both fighters’ data fall within the data restriction I put the variables in. For example: If the fighters have the exact same weight, then my decision tree wouldn’t work. The random forest model is the best model because there isn’t any data range restriction.

6. What additional information or analysis might improve your model results or work to control limitations?

This data was scrapped from the fight metric website. It has a lot of UFC fights and fighter data; however, it is not accessible to just anyone for analysis. You have to get their API. To get their API, you have to be a graduate researcher. I wanted to put in a request but it will take months. Additional data like fighter’s discipline, best winning streak, update win-lost record. The data set I have has no information on the fighters before 2014. If I had a bigger dataset then maybe I would see a better trend.

Lets do some analysis

Are fighters with higher weight more likley to win?

Not Really. I choose to compare use the weight difference not equal to zero beacause almost all fight have two fighter that wieght the exact same. The only weight divinsion you see some type of trend is in the heavyweight division. This is becasue the division ranges from 205lbs to 265lbs.

Is being older give you a better chance at winning?

The age analysis didn’t show any trends. The heavier the red fighter is, the more likely he will the red fighter will win. If we have a bigger data, then it a trends might appear.

The answer to the question.

It depends what ranges of traits the fighters have. IS their weight difference zero? If not, we can use the Decision Tree Model. If so, we can use the Random Forest Age differnce doesnt seem to have any effect.

CLopez_DS_Final

Cesar Lopez

12/7/2017