1. How did you develop your question and what relevant research has already been completed on this topic?

There has not been much research out there. I found two people attempting fight prediction model using python. They both used a lot of machine learning techniques which is above me. The Data is split up by red corner and blue corner. I refer them as blue fighter and red fighter

2. How did you gather and prepare the data for analysis?

I obtained the data from Kaggle. It was raw data and needed a lot of cleaning. It had a total of 895 columns! I added many columns to help me with the process. I added a column for the red fighter and blue fighter. The value is either winner or loser. I also made a column for weight division. I got the difference in age, weight, and height of the winner and loser. I also added a column for year. I columns I added were to use for graphs. The hardest columns to work with, were the different types of strikes landed and attempted per every round. I decided to extract just those columns to a data frame by using the select_if(is.numeric()) function. Then I wrote that data frame to csv, “doit567.csv” and used excel to add the columns off all the rounds per fighter to one total column. It was very proud of that. Body strikes landed, Ground Strikes landed per round are now in Total body Strike and Total Ground Strikes landed per fight and per fighter (Red and Blue).

Correlation

This test had a significantly low p-value I checked for normality for any variable and everything got a significantly low p-value. This is strong evidence againts the Null Hypothesis. If there exist any relationship, it is definitely not linear.

#Chi-squared approximation may be incorrect
age_cor.tbl <- table(ufc_df$Age_Diff, ufc_df$winner)

agex.tbl <- chisq.test(age_cor.tbl)

## Warning in chisq.test(age_cor.tbl): Chi-squared approximation may be
## incorrect

agex.tbl

## 
##  Pearson's Chi-squared test
## 
## data:  age_cor.tbl
## X-squared = 52.369, df = 29, p-value = 0.004957

#View(agex.tbl$residuals)
shapiro.test(ufc_df$Age_Diff)

## 
##  Shapiro-Wilk normality test
## 
## data:  ufc_df$Age_Diff
## W = 0.99547, p-value = 0.0002538

3.How did you select and determine the correct regression model to answer your question?

This was tough. I know I am trying to predict a binary outcome, the winner (blue and red). Therefore, I went for logistic regression. However, the model wasn’t that great. I got an AIC: 1919. After removing some variables, I got the AIC down to 1883. But the logistic regression model still hasn’t good. After I converted the coefficients, the odds were almost one for both fighters. If the blue fighter is one year older than the blue is more likely to win. Same goes for Red Fighter.

Logistic regression

## 
## Call:
## glm(formula = theWinner ~ ., family = binomial(link = "logit"), 
##     data = ufc_log_reg_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2643  -1.1676   0.8764   0.9762   1.5292  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                      1.214e+00  1.307e+00   0.928   0.3532    
## B_Age                            1.052e-01  1.538e-02   6.841 7.87e-12 ***
## R_Age                           -6.050e-02  1.469e-02  -4.119 3.80e-05 ***
## B_Height                        -7.832e-03  1.026e-02  -0.763   0.4453    
## R_Height                        -3.483e-03  1.012e-02  -0.344   0.7306    
## B_Total_Body_Strikes_Attempts   -2.376e-03  8.013e-03  -0.297   0.7668    
## B_Total_Body_Strikes_Landed     -9.987e-05  1.045e-02  -0.010   0.9924    
## R_Total_Body_Strikes_Attempts   -1.894e-02  7.973e-03  -2.376   0.0175 *  
## R_Total_Body_Strikes_Landed      2.604e-02  1.037e-02   2.511   0.0120 *  
## B_Total_Clinch_Strikes_Attempts -1.261e-02  1.229e-02  -1.026   0.3050    
## B_Total_Clinch_Strikes_Landed    1.483e-02  1.487e-02   0.997   0.3188    
## R_Total_Clinch_Strikes_Attempts  1.306e-03  1.189e-02   0.110   0.9126    
## R_Total_Clinch_Strikes_Landed   -2.197e-03  1.475e-02  -0.149   0.8816    
## Weight_Diff                      1.006e-02  7.647e-03   1.316   0.1883    
## Height_Diff                     -4.083e-03  9.572e-03  -0.427   0.6697    
## Age_Diff                         6.548e-02  1.188e-02   5.510 3.58e-08 ***
## W.diff.abs                      -6.771e-03  8.536e-03  -0.793   0.4276    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1944.6  on 1442  degrees of freedom
## Residual deviance: 1849.9  on 1426  degrees of freedom
## AIC: 1883.9
## 
## Number of Fisher Scoring iterations: 4

Decision Tree Classification

I went directly to decision tree model. I also only used striking data greater than zero to see if I get better results. Plugging using the whole data set for the decision tree did not give me a nice result. I had many nodes and some where confusing. It centered around ages of the fighter and age differences. I decide to see one more thing. I filtered out all the weight difference that were zero and got a nicer tree. I will use this one in the presentation. I will also try a random forest model.

## n= 303 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 303 123 red (0.40594059 0.59405941)  
##     2) B_Total_Body_Strikes_Landed>=48.5 30   9 blue (0.70000000 0.30000000)  
##       4) B_Total_Body_Strikes_Attempts< 125.5 21   3 blue (0.85714286 0.14285714) *
##       5) B_Total_Body_Strikes_Attempts>=125.5 9   3 red (0.33333333 0.66666667) *
##     3) B_Total_Body_Strikes_Landed< 48.5 273 102 red (0.37362637 0.62637363)  
##       6) Class=Broke Weight,Flyweight,Featherweight,Middleweight 76  36 blue (0.52631579 0.47368421)  
##        12) B_Total_Ground_Strikes_Landed< 25.5 61  25 blue (0.59016393 0.40983607)  
##          24) Weight_Diff< 12 40  13 blue (0.67500000 0.32500000)  
##            48) R_Weight>=177.5 11   0 blue (1.00000000 0.00000000) *
##            49) R_Weight< 177.5 29  13 blue (0.55172414 0.44827586)  
##              98) B_Height< 173.5 18   5 blue (0.72222222 0.27777778) *
##              99) B_Height>=173.5 11   3 red (0.27272727 0.72727273) *
##          25) Weight_Diff>=12 21   9 red (0.42857143 0.57142857)  
##            50) R_Weight< 177.5 13   4 blue (0.69230769 0.30769231) *
##            51) R_Weight>=177.5 8   0 red (0.00000000 1.00000000) *
##        13) B_Total_Ground_Strikes_Landed>=25.5 15   4 red (0.26666667 0.73333333) *
##       7) Class=Bantamweight,Lightweight,Welterweight,Light Heavyweight,Heavyweight 197  62 red (0.31472081 0.68527919)  
##        14) B_Total_Clinch_Strikes_Landed>=12.5 44  21 red (0.47727273 0.52272727)  
##          28) Height_Diff>=-7.5 37  16 blue (0.56756757 0.43243243)  
##            56) B_Height>=173.5 29   9 blue (0.68965517 0.31034483) *
##            57) B_Height< 173.5 8   1 red (0.12500000 0.87500000) *
##          29) Height_Diff< -7.5 7   0 red (0.00000000 1.00000000) *
##        15) B_Total_Clinch_Strikes_Landed< 12.5 153  41 red (0.26797386 0.73202614)  
##          30) B_Age< 32.5 81  29 red (0.35802469 0.64197531)  
##            60) Age_Diff< 0.5 35  14 blue (0.60000000 0.40000000)  
##             120) R_Age>=30.5 18   1 blue (0.94444444 0.05555556) *
##             121) R_Age< 30.5 17   4 red (0.23529412 0.76470588) *
##            61) Age_Diff>=0.5 46   8 red (0.17391304 0.82608696)  
##             122) R_Age< 28.5 8   2 blue (0.75000000 0.25000000) *
##             123) R_Age>=28.5 38   2 red (0.05263158 0.94736842) *
##          31) B_Age>=32.5 72  12 red (0.16666667 0.83333333) *

Random Forest Model

I used a random forest model on the UFC data. I used many numeric variables I believed wouldn’t cause faulty outputs in my mode. After I input all the information into the equation, It came back with a prediction/confusion matrix with an error rate is ~42%. This is not a good model. It predicts the red fighter’s outcome better than the blue fighter.

## 
## Call:
##  randomForest(x = ufcX, y = ufcY) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 42.55%
## Confusion matrix:
##      blue red class.error
## blue  146 434   0.7482759
## red   180 683   0.2085747

4. How reliable are your results?

The results not as reliable as I liked them to be. I had to get a subset of the data to get meaningful results. For example, I had to get the weight differences that are not zero in order to have a meaning full decision tree. I only used certain variable in each model. I only got certain value ranges from those variables. So overall, there wasn’t a good trend.

5. What predictions can you make with your model? Examples

My model can only predict the outcome of a fight is both fighters’ data fall within the data restriction I put the variables in. For example: If the fighters have the exact same weight, then my decision tree wouldn’t work. The random forest model is the best model because there isn’t any data range restriction.

6. What additional information or analysis might improve your model results or work to control limitations?

This data was scrapped from the fight metric website. It has a lot of UFC fights and fighter data; however, it is not accessible to just anyone for analysis. You have to get their API. To get their API, you have to be a graduate researcher. I wanted to put in a request but it will take months. Additional data like fighter’s discipline, best winning streak, update win-lost record. The data set I have has no information on the fighters before 2014. If I had a bigger dataset then maybe I would see a better trend.

Lets do some analysis

Are fighters with higher weight more likley to win?

Not Really. I choose to compare use the weight difference not equal to zero beacause almost all fight have two fighter that wieght the exact same. The only weight divinsion you see some type of trend is in the heavyweight division. This is becasue the division ranges from 205lbs to 265lbs.

Is being older give you a better chance at winning?

The age analysis didn’t show any trends. The heavier the red fighter is, the more likely he will the red fighter will win. If we have a bigger data, then it a trends might appear.

CLopez_DS_Final

Cesar Lopez

12/7/2017