What makes a CHAMPION? – An analysis on factors affecting the outcome of a fight

Introduction

The Ultimate Fighting Championship (UFC) is one of, if not the, biggest mixed martial arts (MMA) promotions in the globe. Ever since the founding of UFC, the popularity of MMA has grown to the level of boxing, at times even surpassing it.

In a UFC fight, two fighters step into an octagon cage with the dedication to prove who is the better fighter. The rules for determining this are rather open. A fighter can incorporate boxing, kickboxing, Muay Thai, Sanshou, Brazilian Jiu-Jitsu, Judo, and more. Punches, Kicks, elbows, knees, clinching, grappling, all are allowed. The fights typically go for 3 rounds for preliminary events, and 5 for main events, each usually lasting five minutes.

Motivation

People love to drink and gamble while watching all manner of sports, with expert and amatuer fans alike frequently trying to predict the results. When it comes to fighting, the number one advantage is reach. As Coach Jeff Ruth says, “If you cannot touch your opponent, then what good is your technique?” Speed and power are of course crucial. Yet precision beats power and timing beats speed, says the reigning UFC lightweight champion, the Notorious Conor McGregor. Needless to say, there are various factors that might give an edge to a fighter, but none is absolute. What distinguishes a top fighter from the rest? Is there a factor that outweighs others? As of now, there has been little in the way of scientific research on the factors contributing to the outcome of a fight (win/lose/draw, knockout, technical knockout, submission, etc.). And we want to run a more formal analysis on the those factors. While the result of a UFC fight is judged based on each round, we want to look at how the overall performance is related to the result.

About the Data

We found a dataset on Kaggle where 1477 matchups that took place over the past years are contained. The dataset lists side by side basic information about the two fighters in each matchup, as well as detailed statistics of the immediate previous fight for each fighter prior to the matchup. The round-by-round performance is included with each fighter.
Source: https://www.kaggle.com/calmdownkarm/ufcdataset

str(read.csv("ufc fight data.csv"))
summary(read.csv("ufc fight data.csv"))
head(read.csv("ufc fight data.csv"), n=1)

Preparation of the Data

For our project, we created different data frames from the original file for covenvient different analyses.

We first picked out several variables that we are most interested in and created a data frame. For those variables related to fight statistics, we summed it up across all rounds. We also added a new column BMI, since we think at the same weight, the taller fighter (and thus likely with longer reach) has some advantage. We also roughly combined the weight divisions into 5 classes for further use. Any values listed as NA are removed. We then stack the information about the fighters from the blue corner and the red corner together.

The 28 variables are:
- Name
- PreFights, number of fights previous to the matchup
- Streak, consecutive wins previous to the matchup
- Age, measured in years
- Height, measured in cm
- Hometown
- Weight, measured in kg
- A.Sub, attempted submissions
- A.TD, attempted takedowns
- TD, successful takedowns
- A.Body, attempted total body strikes
- Body, landed total body strikes
- A.Head, attempted total head strikes
- Head, landed total head strikes
- A.Leg, attempted total leg strikes
- Leg, landed total leg strikes
- A.Kick, attempted total kicks
- Kick, landed total kicks
- A.Punch, attempted total punches
- Punch, landed total punches
- A.Strike, attempted total strikes
- Strike, landed total strikes
- KD, landed knockdowns
- Result, win or lose (draw and no contest are removed)
- By, win by decision, submission, or KO, TKO
- Rounds, maximum rounds of the fight

# ufcfightdata.df<-data.frame(read.csv("ufc fight data.csv"))
# ## Blue corner
# subm.attp1<-rowSums(cbind(ufcfightdata.df[, c(12, 99, 186, 273, 360)]), na.rm = TRUE)
# td.attp1<-rowSums(cbind(ufcfightdata.df[, c(13, 100, 187, 274, 361)]), na.rm = TRUE)
# td1<-rowSums(cbind(ufcfightdata.df[, c(14, 101, 188, 275, 362)]), na.rm = TRUE)
# bodytot.attp1<-rowSums(cbind(ufcfightdata.df[, c(17, 104, 191, 278, 365)]), na.rm = TRUE)
# bodytot1<-rowSums(cbind(ufcfightdata.df[, c(18, 105, 192, 279, 366)]), na.rm = TRUE)
# head.attp1<-rowSums(cbind(ufcfightdata.df[, c(67, 154, 241, 328, 415)]), na.rm = TRUE)
# head1<-rowSums(cbind(ufcfightdata.df[, c(68, 155, 242, 329, 416)]), na.rm = TRUE)
# leg.attp1<-rowSums(cbind(ufcfightdata.df[, c(72, 159, 246, 333, 420)]), na.rm = TRUE)
# leg1<-rowSums(cbind(ufcfightdata.df[, c(73, 160, 247, 334, 421)]), na.rm = TRUE)
# kick.attp1<-rowSums(cbind(ufcfightdata.df[, c(69, 156, 243, 330, 417)]), na.rm = TRUE)
# kick1<-rowSums(cbind(ufcfightdata.df[, c(70, 157, 244, 331, 418)]), na.rm = TRUE)
# punch.attp1<-rowSums(cbind(ufcfightdata.df[, c(78, 165, 252, 339, 426)]), na.rm = TRUE)
# punch1<-rowSums(cbind(ufcfightdata.df[, c(79, 166, 253, 340, 427)]), na.rm = TRUE)
# strike.attp1<-rowSums(cbind(ufcfightdata.df[, c(82, 169, 256, 343, 430)]), na.rm = TRUE)
# strike1<-rowSums(cbind(ufcfightdata.df[, c(83, 170, 257, 344, 431)]), na.rm = TRUE)
# KD1<-rowSums(cbind(ufcfightdata.df[, c(71, 158, 245, 332, 419)]), na.rm = TRUE)
# ## Red corner
# subm.attp2<-rowSums(cbind(ufcfightdata.df[, c(461, 548, 635, 722, 809)]), na.rm = TRUE)
# td.attp2<-rowSums(cbind(ufcfightdata.df[, c(462, 549, 636, 723, 810)]), na.rm = TRUE)
# td2<-rowSums(cbind(ufcfightdata.df[, c(463, 550, 637, 724, 811)]), na.rm = TRUE)
# bodytot.attp2<-rowSums(cbind(ufcfightdata.df[, c(466, 553, 640, 727, 814)]), na.rm = TRUE)
# bodytot2<-rowSums(cbind(ufcfightdata.df[, c(467, 554, 641, 728, 815)]), na.rm = TRUE)
# head.attp2<-rowSums(cbind(ufcfightdata.df[, c(516, 603, 690, 777, 864)]), na.rm = TRUE)
# head2<-rowSums(cbind(ufcfightdata.df[, c(517, 604, 691, 778, 865)]), na.rm = TRUE)
# leg.attp2<-rowSums(cbind(ufcfightdata.df[, c(521, 608, 695, 782, 869)]), na.rm = TRUE)
# leg2<-rowSums(cbind(ufcfightdata.df[, c(522, 609, 696, 783, 870)]), na.rm = TRUE)
# kick.attp2<-rowSums(cbind(ufcfightdata.df[, c(518, 605, 692, 779, 866)]), na.rm = TRUE)
# kick2<-rowSums(cbind(ufcfightdata.df[, c(519, 606, 693, 780, 867)]), na.rm = TRUE)
# punch.attp2<-rowSums(cbind(ufcfightdata.df[, c(527, 614, 701, 788, 875)]), na.rm = TRUE)
# punch2<-rowSums(cbind(ufcfightdata.df[, c(528, 615, 702, 789, 876)]), na.rm = TRUE)
# strike.attp2<-rowSums(cbind(ufcfightdata.df[, c(531, 618, 705, 792, 879)]), na.rm = TRUE)
# strike2<-rowSums(cbind(ufcfightdata.df[, c(532, 619, 706, 793, 880)]), na.rm = TRUE)
# KD2<-rowSums(cbind(ufcfightdata.df[, c(520, 607, 694, 781, 868)]), na.rm = TRUE)
# # convert B/R result to W/L/D/NC for each fighter
# result1<-rep("a", 1477)
# result2<-rep("a", 1477)
# for (i in c(1:1477)) {
#   if (ufcfightdata.df$winner[i]=="red") {
#     result1[i]<-"L"
#     result2[i]<-"W"
#   } else if (ufcfightdata.df$winner[i]=="blue") {
#     result1[i]<-"W"
#     result2[i]<-"L"
#   } else if (ufcfightdata.df$winner[i]=="draw") {
#     result1[i]<-"D"
#     result2[i]<-"D"
#   } else {
#     result1[i]<-"NC"
#     result2[i]<-"NC"
#   }
# }
# 
# Blue<-cbind(ufcfightdata.df$B_Name, ufcfightdata.df[, c(1:5)], ufcfightdata.df$B_Weight, subm.attp1,td.attp1, td1, bodytot.attp1, bodytot1, head.attp1, head1, leg.attp1, leg1, kick.attp1, kick1, punch.attp1, punch1, strike.attp1, strike1, KD1, result1, ufcfightdata.df$winby, ufcfightdata.df$Max_round)
# Red<-cbind(ufcfightdata.df$R_Name, ufcfightdata.df[, c(450:454)], ufcfightdata.df$R_Weight, subm.attp2, td.attp2, td2, bodytot.attp2, bodytot2, head.attp2, head2, leg.attp2, leg2, kick.attp2, kick2, punch.attp2, punch2, strike.attp2, strike2, KD2, result2, ufcfightdata.df$winby, ufcfightdata.df$Max_round)
# colnames(Blue)<-c("Name", "PreFights", "Streak", "Age", "Height", "Hometown", "Weight", "A.Sub", "A.TD", "TD", "A.Body", "Body", "A.Head", "Head", "A.Leg", "Leg", "A.Kick", "Kick","A.Punch", "Punch", "A.Strike", "Strike", "KD", "Result", "By", "Rounds")
# colnames(Red)<-c("Name", "PreFights", "Streak", "Age", "Height", "Hometown", "Weight", "A.Sub","A.TD", "TD", "A.Body", "Body", "A.Head", "Head", "A.Leg", "Leg", "A.Kick", "Kick", "A.Punch", "Punch", "A.Strike", "Strike", "KD", "Result", "By", "Rounds")
# ## height and weight, BMI
# height_m<-newdata$Height/100
# BMI<-newdata$Weight/(height_m*height_m)
# ##update data frame, add BMI column and delete rows not W/L
# newdata<-data.frame()
# for (j in c(1:1477)) {
#   newdata<-rbind(newdata, Blue[j, ], Red[j, ])
# }
# thedata<-cbind(newdata, BMI)
# thedata<-thedata[-c(which(thedata$Result!="W" & thedata$Result!="L")), ]
# thedata<-thedata[-c(893, 894, 1121,1122,1523,1524,1849,1850,2129,2130), ] ## remove NA's
# class<-rep(0, 2892)
# for (i in c(1:2892)) {
#   if (thedata$Weight[i]<=65) {
#     class[i]<-1
#   } else if (thedata$Weight[i]>65 & thedata$Weight[i]<=80) {
#     class[i]<-2
#   } else if (thedata$Weight[i]>80 & thedata$Weight[i]<=95) {
#     class[i]<-3
#   } else if (thedata$Weight[i]>90 & thedata$Weight[i]<=110) {
#     class[i]<-4
#   } else {
#     class[i]<-5
#   }
# }
# thedata<-cbind(thedata, class)
# write.csv(thedata, file="/Users/mozhuning/我的文件/GWU/Classes/Master/Fall 2017/DATS-Introduction to Data Science/Project/Project II/thedata.csv",row.names = FALSE)

We secondly created another data frame with all the original fight statistics summed up across all rounds, with all NA’s removed (replaced with 0), draw and no contest removed, and BMI added. This data frame retains the structure of the original file.

# ufcfightdata.df[is.na(ufcfightdata.df)]<-0    ##replace NA with 0
# for (x in c(10:96)) {
#   ufcfightdata.df[,x]<-rowSums(cbind(ufcfightdata.df[, c(x,x+87, x+174, x+261, x+348)]), na.rm = TRUE)
# }
# for (y in c(459:545)) {
#   ufcfightdata.df[,y]<-rowSums(cbind(ufcfightdata.df[, c(y,y+87, y+174, y+261, y+348)]), na.rm = TRUE)
# }
# ufcfightdata.df<-ufcfightdata.df[, -c(97:444, 546:893)]
# ufcfightdata.df<-ufcfightdata.df[-c(which(ufcfightdata.df$winner!="blue" & ufcfightdata.df$winner!="red")), ]
# ufcfightdata.df<-ufcfightdata.df[-c(which(rowSums(ufcfightdata.df[, c(10:96,111:197)])==0)),]
# height_b<-ufcfightdata.df$B_Height/100
# BMI_b<-ufcfightdata.df$B_Weight/(height_b*height_b)
# height_r<-ufcfightdata.df$R_Height/100
# BMI_r<-ufcfightdata.df$R_Weight/(height_r*height_r)
# ufcfightdata.df<-cbind(ufcfightdata.df, BMI_b, BMI_r)
# write.csv(ufcfightdata.df, file="/Users/mozhuning/我的文件/GWU/Classes/Master/Fall 2017/DATS-Introduction to Data Science/Project/Project II/NEWUFCPCA.csv", row.names = FALSE)

Lastly, another version of this dataset has the fighters in each matchup split up and stacked. The sum of all attempted strikes sum1 and that of all landed strikes sum2 are added to the data frame.

# newresult1<-rep("a", 1215)
# newresult2<-rep("a", 1215)
# for (i in c(1:1215)) {
#   if (ufcfightdata.df$winner[i]=="red") {
#     newresult1[i]<-"L"
#     newresult2[i]<-"W"
#   } else {
#     newresult1[i]<-"W"
#     newresult2[i]<-"L"
#   }
# }
# corner1<-rep("B", 1215)
# corner2<-rep("R", 1215)
# bluenew<-data.frame(cbind(ufcfightdata.df[,c(1:96,200)], newresult1, corner1))
# rednew<-data.frame(cbind(ufcfightdata.df[,c(102:197,201)], newresult2, corner2))
# colnames(rednew)<-colnames(bluenew)
# newufc_split<-data.frame(rbind(bluenew,rednew))
# newclass<-rep(0, 2430)
# for (l in c(1:2430)) {
#   if (newufc_split$B_Weight[l]<=65) {
#     newclass[l]<-1
#   } else if (newufc_split$B_Weight[l]>65 & newufc_split$B_Weight[l]<=80) {
#     newclass[l]<-2
#   } else if (newufc_split$B_Weight[l]>80 & newufc_split$B_Weight[l]<=95) {
#     newclass[l]<-3
#   } else if (newufc_split$B_Weight[l]>90 & newufc_split$B_Weight[l]<=110) {
#     newclass[l]<-4
#   } else {
#     newclass[l]<-5
#   }
# }
# newufc_split<-cbind(newufc_split, newclass)
# s1<-c(seq(from=13, to=69, by=2), seq(from=72, to=82, by=2))
# s2<-c(seq(from=14, to=70, by=2), seq(from=73, to=83, by=2))
# sum1<-rowSums(cbind(newufc_split[, s1])) #attempts
# sum2<-rowSums(cbind(newufc_split[, s2])) #landed
# newufc_split<-cbind(newufc_split, sum1,sum2)
# write.csv(newufc_split, file="/Users/mozhuning/我的文件/GWU/Classes/Master/Fall 2017/DATS-Introduction to Data Science/Project/Project II/newufc_split.csv",
#           row.names = FALSE)

Here is a look at the new data frames. We will only display the smaller dataset here.

thedata<-data.frame(read.csv("thedata.csv"))
NEW_Dataset <- data.frame(read.csv("NEWUFC.csv"))
newufc_split<-data.frame(read.csv("newufc_split.csv"))

str(thedata)

## 'data.frame':    2892 obs. of  28 variables:
##  $ Name     : Factor w/ 844 levels "Aaron Phillips",..: 544 73 120 165 755 399 102 505 252 656 ...
##  $ PreFights: int  1 6 0 0 2 2 0 6 3 5 ...
##  $ Streak   : int  1 1 0 0 0 0 0 4 1 2 ...
##  $ Age      : int  23 27 32 29 38 32 23 25 30 28 ...
##  $ Height   : int  182 187 175 182 172 177 170 175 167 170 ...
##  $ Hometown : Factor w/ 643 levels "","Aarhus Denmark",..: 577 247 95 134 254 75 568 221 526 111 ...
##  $ Weight   : int  84 84 70 70 70 70 56 56 61 61 ...
##  $ A.Sub    : int  1 4 0 0 0 0 0 15 1 4 ...
##  $ A.TD     : int  1 39 0 0 0 8 0 31 7 16 ...
##  $ TD       : int  1 19 0 0 0 2 0 11 2 5 ...
##  $ A.Body   : int  11 65 0 0 41 17 0 128 35 90 ...
##  $ Body     : int  11 52 0 0 27 16 0 108 21 72 ...
##  $ A.Head   : int  57 385 0 0 208 133 0 699 448 439 ...
##  $ Head     : int  39 201 0 0 94 56 0 342 171 189 ...
##  $ A.Leg    : int  0 0 0 0 0 0 0 3 0 13 ...
##  $ Leg      : int  0 0 0 0 0 0 0 3 0 13 ...
##  $ A.Kick   : int  0 0 0 0 15 3 0 0 0 44 ...
##  $ Kick     : int  0 0 0 0 12 2 0 0 0 42 ...
##  $ A.Punch  : int  0 0 0 0 242 50 0 0 0 149 ...
##  $ Punch    : int  0 0 0 0 118 32 0 0 0 58 ...
##  $ A.Strike : int  68 469 0 0 261 152 0 864 495 578 ...
##  $ Strike   : int  50 270 0 0 131 74 0 486 203 304 ...
##  $ KD       : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ Result   : Factor w/ 2 levels "L","W": 1 2 2 1 1 2 2 1 1 2 ...
##  $ By       : Factor w/ 4 levels "","DEC","KO/TKO",..: 2 2 4 4 3 3 4 4 2 2 ...
##  $ Rounds   : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ BMI      : num  25.4 24 22.9 21.1 23.7 ...
##  $ class    : int  3 3 2 2 2 2 1 1 1 1 ...

summary(thedata)

##                  Name        PreFights          Streak      
##  Donald Cerrone    :  13   Min.   : 0.000   Min.   :0.0000  
##  Neil Magny        :  12   1st Qu.: 0.000   1st Qu.:0.0000  
##  Beneil Dariush    :  11   Median : 1.000   Median :0.0000  
##  Gegard Mousasi    :  11   Mean   : 1.939   Mean   :0.6909  
##  Derrick Lewis     :  10   3rd Qu.: 3.000   3rd Qu.:1.0000  
##  Francisco Trinaldo:  10   Max.   :12.000   Max.   :9.0000  
##  (Other)           :2825                                    
##       Age            Height                          Hometown   
##  Min.   :20.00   Min.   :152.0   Rio de Janeiro Brazil   :  75  
##  1st Qu.:28.00   1st Qu.:172.0   Sao Paulo Brazil        :  40  
##  Median :31.00   Median :177.0   Dublin Ireland          :  30  
##  Mean   :31.16   Mean   :177.5   Dagestan Russia         :  24  
##  3rd Qu.:34.00   3rd Qu.:182.0   Phoenix, Arizona USA    :  24  
##  Max.   :46.00   Max.   :213.0   Milwaukee, Wisconsin USA:  22  
##                                  (Other)                 :2677  
##      Weight           A.Sub              A.TD          TD        
##  Min.   : 52.00   Min.   : 0.0000   Min.   : 0   Min.   : 0.000  
##  1st Qu.: 65.00   1st Qu.: 0.0000   1st Qu.: 0   1st Qu.: 0.000  
##  Median : 70.00   Median : 0.0000   Median : 2   Median : 1.000  
##  Mean   : 73.84   Mean   : 0.7604   Mean   : 6   Mean   : 2.275  
##  3rd Qu.: 84.00   3rd Qu.: 1.0000   3rd Qu.: 8   3rd Qu.: 3.000  
##  Max.   :120.00   Max.   :16.0000   Max.   :69   Max.   :31.000  
##                                                                  
##      A.Body            Body            A.Head            Head       
##  Min.   :  0.00   Min.   :  0.00   Min.   :   0.0   Min.   :  0.00  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:   0.0   1st Qu.:  0.00  
##  Median : 16.00   Median : 12.00   Median :  95.0   Median : 41.00  
##  Mean   : 30.17   Mean   : 22.94   Mean   : 151.5   Mean   : 67.34  
##  3rd Qu.: 44.00   3rd Qu.: 33.00   3rd Qu.: 225.0   3rd Qu.:101.00  
##  Max.   :321.00   Max.   :282.00   Max.   :1596.0   Max.   :608.00  
##                                                                     
##      A.Leg             Leg             A.Kick            Kick        
##  Min.   : 0.000   Min.   : 0.000   Min.   :  0.00   Min.   :  0.000  
##  1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.:  0.00   1st Qu.:  0.000  
##  Median : 0.000   Median : 0.000   Median :  0.00   Median :  0.000  
##  Mean   : 2.135   Mean   : 1.745   Mean   : 11.21   Mean   :  7.794  
##  3rd Qu.: 0.000   3rd Qu.: 0.000   3rd Qu.: 14.00   3rd Qu.:  9.000  
##  Max.   :62.000   Max.   :53.000   Max.   :178.00   Max.   :129.000  
##                                                                      
##     A.Punch           Punch           A.Strike          Strike     
##  Min.   :  0.00   Min.   :  0.00   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:   0.0   1st Qu.:  0.0  
##  Median :  0.00   Median :  0.00   Median : 129.0   Median : 64.0  
##  Mean   : 50.48   Mean   : 25.55   Mean   : 198.9   Mean   :104.5  
##  3rd Qu.: 73.00   3rd Qu.: 38.00   3rd Qu.: 302.0   3rd Qu.:159.0  
##  Max.   :749.00   Max.   :400.00   Max.   :1970.0   Max.   :875.0  
##                                                                    
##        KD        Result        By           Rounds          BMI       
##  Min.   :0.000   L:1446         :   4   Min.   :3.00   Min.   :17.45  
##  1st Qu.:0.000   W:1446   DEC   :1428   1st Qu.:3.00   1st Qu.:21.33  
##  Median :0.000            KO/TKO: 906   Median :3.00   Median :22.50  
##  Mean   :0.397            SUB   : 554   Mean   :3.21   Mean   :23.24  
##  3rd Qu.:1.000                          3rd Qu.:3.00   3rd Qu.:24.54  
##  Max.   :8.000                          Max.   :5.00   Max.   :38.30  
##                                                                       
##      class      
##  Min.   :1.000  
##  1st Qu.:1.000  
##  Median :2.000  
##  Mean   :1.997  
##  3rd Qu.:3.000  
##  Max.   :5.000  
##

str(NEW_Dataset)
summary(NEW_Dataset)

str(newufc_split)
summary(newufc_split)

Analysis

library(ggplot2)
library(ResourceSelection)
library(pROC)
library(pscl)
library(corrplot)
library(caTools)
library(caret)
library(e1071)
library(cluster)
library(leaps)
library(ISLR)

EDA

The variables in our data are not normally distributed. Here is an example of Weight.

shapiro.test(thedata$Weight)

## 
##  Shapiro-Wilk normality test
## 
## data:  thedata$Weight
## W = 0.89954, p-value < 2.2e-16

qqnorm(thedata$Weight)

hist(thedata$Weight)

Correlation

We want to see if some of the variables are correlated to each other.

ttcor<-cor(thedata[,c(2:5,7:23,27)])
par(xpd=TRUE)
corrplot(ttcor, type = "lower", order = "hclust", 
         tl.col = "black", tl.srt = 90, mar = c(1,1,.5,.5))

Logistic Regression

This is our main model. We want to see if any of the variables can be used to predict the result of the fight (win or lose in our case).
We trimmed the dataset and removed character variables and highly correlated variables. We used regsubsets to do the variable selection, made both the adjusted R^2 and BIC plots.

trim<-thedata[, -c(1,5,6,7,9,11,13,15,17,19,21, 25, 26, 28)]
bestselect <- regsubsets(as.factor(Result)~., data = trim, nvmax = 14)
plot(bestselect, scale = "adjr2", main = "Adjusted R^2")

plot(bestselect, scale = "bic", main = "BIC")

summary(bestselect)

## Subset selection object
## Call: regsubsets.formula(as.factor(Result) ~ ., data = trim, nvmax = 14)
## 13 Variables  (and intercept)
##           Forced in Forced out
## PreFights     FALSE      FALSE
## Streak        FALSE      FALSE
## Age           FALSE      FALSE
## A.Sub         FALSE      FALSE
## TD            FALSE      FALSE
## Body          FALSE      FALSE
## Head          FALSE      FALSE
## Leg           FALSE      FALSE
## Kick          FALSE      FALSE
## Punch         FALSE      FALSE
## Strike        FALSE      FALSE
## KD            FALSE      FALSE
## BMI           FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
##           PreFights Streak Age A.Sub TD  Body Head Leg Kick Punch Strike
## 1  ( 1 )  " "       " "    "*" " "   " " " "  " "  " " " "  " "   " "   
## 2  ( 1 )  " "       " "    "*" " "   " " " "  " "  " " " "  "*"   " "   
## 3  ( 1 )  " "       " "    "*" " "   " " " "  " "  " " " "  "*"   " "   
## 4  ( 1 )  " "       "*"    "*" " "   " " " "  " "  " " " "  "*"   " "   
## 5  ( 1 )  " "       "*"    "*" " "   " " " "  " "  "*" " "  "*"   " "   
## 6  ( 1 )  " "       "*"    "*" " "   " " " "  "*"  " " " "  "*"   "*"   
## 7  ( 1 )  " "       "*"    "*" " "   " " "*"  "*"  "*" " "  "*"   " "   
## 8  ( 1 )  " "       "*"    "*" "*"   " " "*"  "*"  "*" " "  "*"   " "   
## 9  ( 1 )  " "       "*"    "*" "*"   "*" "*"  "*"  "*" " "  "*"   " "   
## 10  ( 1 ) " "       "*"    "*" "*"   "*" " "  "*"  "*" "*"  "*"   "*"   
## 11  ( 1 ) " "       "*"    "*" "*"   "*" " "  "*"  "*" "*"  "*"   "*"   
## 12  ( 1 ) " "       "*"    "*" "*"   "*" "*"  "*"  "*" "*"  "*"   "*"   
## 13  ( 1 ) "*"       "*"    "*" "*"   "*" "*"  "*"  "*" "*"  "*"   "*"   
##           KD  BMI
## 1  ( 1 )  " " " "
## 2  ( 1 )  " " " "
## 3  ( 1 )  " " "*"
## 4  ( 1 )  " " "*"
## 5  ( 1 )  " " "*"
## 6  ( 1 )  " " "*"
## 7  ( 1 )  " " "*"
## 8  ( 1 )  " " "*"
## 9  ( 1 )  " " "*"
## 10  ( 1 ) " " "*"
## 11  ( 1 ) "*" "*"
## 12  ( 1 ) "*" "*"
## 13  ( 1 ) "*" "*"

WLlogit4 <- glm(Result~Age+Punch+BMI+Streak+Leg+Head, binomial(link = "logit"), data = thedata)
summary(WLlogit4)

## 
## Call:
## glm(formula = Result ~ Age + Punch + BMI + Streak + Leg + Head, 
##     family = binomial(link = "logit"), data = thedata)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.70811  -1.15905  -0.08079   1.16634   1.54679  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.9047311  0.3580628   2.527   0.0115 *  
## Age         -0.0527010  0.0099707  -5.286 1.25e-07 ***
## Punch        0.0026571  0.0010465   2.539   0.0111 *  
## BMI          0.0263533  0.0131633   2.002   0.0453 *  
## Streak       0.0462406  0.0402402   1.149   0.2505    
## Leg         -0.0096158  0.0074580  -1.289   0.1973    
## Head         0.0006362  0.0006389   0.996   0.3194    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4009.2  on 2891  degrees of freedom
## Residual deviance: 3957.3  on 2885  degrees of freedom
## AIC: 3971.3
## 
## Number of Fisher Scoring iterations: 4

WLlogit3 <- glm(Result~Age+Punch, binomial(link = "logit"), data = thedata)
summary(WLlogit3)

## 
## Call:
## glm(formula = Result ~ Age + Punch, family = binomial(link = "logit"), 
##     data = thedata)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.63534  -1.16607  -0.04189   1.16847   1.52692  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.4084898  0.2928307   4.810 1.51e-06 ***
## Age         -0.0478415  0.0093308  -5.127 2.94e-07 ***
## Punch        0.0032463  0.0007997   4.059 4.92e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4009.2  on 2891  degrees of freedom
## Residual deviance: 3966.9  on 2889  degrees of freedom
## AIC: 3972.9
## 
## Number of Fisher Scoring iterations: 4

We also tried different combinations of variables in training and testing, and found that Age, Punch, BMI, Streak, Strike, and Head result in a lower AIC.

train <- thedata[1:2024, ]
test<-thedata[2025:2892,]
WLlogit2 <- glm(Result~Age+Punch+BMI+Streak+Strike+Head, binomial(link = "logit"), data = train)
summary(WLlogit2)

## 
## Call:
## glm(formula = Result ~ Age + Punch + BMI + Streak + Strike + 
##     Head, family = binomial(link = "logit"), data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7027  -1.1557  -0.1217   1.1731   1.5138  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.785393   0.432068   1.818   0.0691 .  
## Age         -0.046741   0.011882  -3.934 8.37e-05 ***
## Punch        0.002576   0.001209   2.131   0.0331 *  
## BMI          0.023204   0.015870   1.462   0.1437    
## Streak       0.053099   0.048548   1.094   0.2741    
## Strike      -0.002707   0.001597  -1.695   0.0901 .  
## Head         0.004662   0.002475   1.884   0.0596 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2805.9  on 2023  degrees of freedom
## Residual deviance: 2768.7  on 2017  degrees of freedom
## AIC: 2782.7
## 
## Number of Fisher Scoring iterations: 4

WLlogit <- glm(Result~Age+Punch+BMI+Streak+Strike+Head, binomial(link = "logit"), data = thedata)
summary(WLlogit)

## 
## Call:
## glm(formula = Result ~ Age + Punch + BMI + Streak + Strike + 
##     Head, family = binomial(link = "logit"), data = thedata)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.78732  -1.15622  -0.08918   1.16844   1.54465  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.922652   0.357950   2.578  0.00995 ** 
## Age         -0.052431   0.009974  -5.257 1.47e-07 ***
## Punch        0.002415   0.001018   2.371  0.01772 *  
## BMI          0.025366   0.013175   1.925  0.05419 .  
## Streak       0.053355   0.040393   1.321  0.18654    
## Strike      -0.002108   0.001363  -1.546  0.12204    
## Head         0.003631   0.002075   1.750  0.08020 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4009.2  on 2891  degrees of freedom
## Residual deviance: 3956.6  on 2885  degrees of freedom
## AIC: 3970.6
## 
## Number of Fisher Scoring iterations: 4

hoslem.test(thedata$Result, fitted(WLlogit))

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  thedata$Result, fitted(WLlogit)
## X-squared = 2892, df = 8, p-value < 2.2e-16

prob=predict(WLlogit, type = c("response"))
h <- roc(thedata$Result~prob, data=thedata)
h

## 
## Call:
## roc.formula(formula = thedata$Result ~ prob, data = thedata)
## 
## Data: prob in 1446 controls (thedata$Result L) < 1446 cases (thedata$Result W).
## Area under the curve: 0.5746

plot(h)

While, the ROC is rather flat and AUC is only about .58, the probability value is below .05 for the age and punch variables. However, while Punch is a slightly positive correlation, Age is a slightly negative correlation.

PCA

In our bid to improve our model, instead of selecting the variables we decided to go for dimensionality reduction. Our choice was Principal Component Analysis. To keep things simple we decided to select 2 component and then re-run the logistic regression on it.

The first step was to read in the dataset after triming and removing columns like name, id, hometown using excel.

We did a split of the dataset. 80% for Training and the remaining 20% for Testing.

NEW_Dataset <- read.csv("NEWUFC.csv")
NEW_Dataset <- data.frame(NEW_Dataset)
library(caTools)
set.seed(123)
split = sample.split(NEW_Dataset$winner, SplitRatio = 0.8)
training_set = subset(NEW_Dataset, split == TRUE)
test_set = subset(NEW_Dataset, split == FALSE)

PCA requires the dataset to be standard so we scaled the dataset using the scale function.
PCA is an unsupervised algorithm, so we scaled only the independent variable, excluding the dependent variable for both the training and the testing sets.

training_set[-183] = scale(training_set[-183])
test_set[-183] = scale(test_set[-183])

We then used the preProcess function to do the PCA and selected only 2 components.
We selected the top 2 components in the training set by fitting the pca on the training set, and we re-arranged the training set.

pca = preProcess(x = training_set[-183], method = 'pca', pcaComp = 2)
training_set = predict(pca, training_set)
training_set = training_set[c(2, 3, 1)]

We selected the top 2 components in the test set by fitting the pca on the test set, and we re-arranged the test set.

test_set = predict(pca, test_set)
test_set = test_set[c(2, 3, 1)]

Next, we fit Logistic Regression to the Training set.

classifier = glm(formula = winner ~ .,
                 family = binomial,
                 data = training_set)
summary(classifier)

## 
## Call:
## glm(formula = winner ~ ., family = binomial, data = training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5728  -1.3241   0.9679   1.0257   1.3553  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.355628   0.065415   5.436 5.43e-08 ***
## PC1         -0.010093   0.009049  -1.115   0.2647    
## PC2         -0.031551   0.013084  -2.411   0.0159 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1317.6  on 971  degrees of freedom
## Residual deviance: 1310.4  on 969  degrees of freedom
## AIC: 1316.4
## 
## Number of Fisher Scoring iterations: 4

The AIC is low compared to the previous model by comparing the AIC. PCA was a good choice for the feature selection.
We then predicted the test set results and made the confusion matrix. We can see that the model is 60% accurate.

prob_pred = predict(classifier, type = 'response', newdata = test_set[-3])
y_pred = ifelse(prob_pred > 0.5, 1, 0)
cm = table(test_set[, 3], y_pred > 0.5)
cm

##    
##     FALSE TRUE
##   0     5   95
##   1     3  140

We decided to plot the ROC as well.

probn=predict(classifier, type = c("response"))
roc(test_set$winner~prob_pred, data=test_set)

## 
## Call:
## roc.formula(formula = test_set$winner ~ prob_pred, data = test_set)
## 
## Data: prob_pred in 100 controls (test_set$winner 0) < 143 cases (test_set$winner 1).
## Area under the curve: 0.5838

plot(roc(test_set$winner~prob_pred, data=test_set))

K-Means

We want to see if there are significant differences between the fighting styles, strategies, and technical soundness (accuracy of stikes, etc,) of winners and losers. So we use K-Means to see if there are clusters.
Plots (scatterplots and boxplots) are graphed as an initial visual check. Only a few are shown here.

ggplot(data=newufc_split, aes(x=newufc_split[,83], y=newufc_split[,82], color = newufc_split$newresult1)) + geom_point()+labs(x="Landed Total Strikes", y="Attempted Total Strikes")+scale_fill_continuous(guide = guide_legend(title = NULL))

ggplot(data=newufc_split, aes(x=newufc_split$newresult1, newufc_split[,83])) + 
  geom_boxplot()+labs(x="Result", y="Landed Total Strikes")

ggplot(data=newufc_split, aes(x=newufc_split$newresult1, newufc_split$B_Age)) + 
  geom_boxplot()+labs(x="Result", y="Age")

###according to weight for interest
ggplot(data=newufc_split, aes(x=newufc_split[,83], y=newufc_split[,82], color = newufc_split$newclass)) + geom_point()+labs(x="Landed Total Strikes", y="Attempted Total Strikes")

As seen in the displayed graphs, the difference is rather small.

We then used the elbow method to find the optimal number of cluster.

set.seed(6)
wcss = vector()
for (i in 1:10) wcss[i] = sum(kmeans(newufc_split[,82:83], i)$withinss)
par(mar=c(4,4,4,4))
plot(1:10,
     wcss,
     type = 'b',
     main = paste('The Elbow Method'),
     xlab = 'Number of clusters',
     ylab = 'WCSS')

# Fitting K-Means to the dataset
set.seed(29)
kmeans = kmeans(x = newufc_split[,82:83], centers = 5)
y_kmeans = kmeans$cluster

# Visualising the clusters
clusplot(newufc_split[,82:83],
         y_kmeans,
         lines = 0,
         shade = TRUE,
         color = TRUE,
         labels = 2,
         plotchar = FALSE,
         span = TRUE,
         main = paste('Clusters'),
         xlab = 'ld',
         ylab = 'attp')

set.seed(1)
ufccluster <- kmeans(newufc_split[, c(82,83)], 4, nstart = 20)
#ufccluster

table(ufccluster$cluster, newufc_split$newresult1)

##    
##       L   W
##   1 596 537
##   2 395 404
##   3 184 199
##   4  40  75

ggplot(newufc_split, aes(x=newufc_split[,83], y=newufc_split[,82], color = as.factor(ufccluster$cluster))) + geom_point()

set.seed(1)
ufccluster1 <- kmeans(newufc_split[, c(101,102)], 4, nstart = 20)
table(ufccluster1$cluster, newufc_split$newresult1)

##    
##       L   W
##   1 402 403
##   2  23  51
##   3 646 583
##   4 144 178

We have applied the above demand on various variables, but unfortunately no significant results can be drawn. As seen from the example scatterplots and the cross tabulation of clustering, winners and losers are spread rather evenly in each cluster, suggesting that there is no evident division between winners and losers.

We also tried to fit a linear regression line of landed strikes against attempted strikes for winners and losers separately. Winners generally have a a higher slope, and thus more accuracy. But the difference is small from the slope of the losers.
Here are some examples.

#body total strikes
summary(lm(formula = newufc_split[which(newufc_split$newresult1=="W"),18]
           ~newufc_split[which(newufc_split$newresult1=="W"),17], 
           data = newufc_split)) ##0.789235

## 
## Call:
## lm(formula = newufc_split[which(newufc_split$newresult1 == "W"), 
##     18] ~ newufc_split[which(newufc_split$newresult1 == "W"), 
##     17], data = newufc_split)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -66.327  -1.580   0.888   2.056  31.122 
## 
## Coefficients:
##                                                          Estimate
## (Intercept)                                             -0.888086
## newufc_split[which(newufc_split$newresult1 == "W"), 17]  0.789235
##                                                         Std. Error t value
## (Intercept)                                               0.241714  -3.674
## newufc_split[which(newufc_split$newresult1 == "W"), 17]   0.004196 188.088
##                                                         Pr(>|t|)    
## (Intercept)                                             0.000249 ***
## newufc_split[which(newufc_split$newresult1 == "W"), 17]  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.346 on 1213 degrees of freedom
## Multiple R-squared:  0.9668, Adjusted R-squared:  0.9668 
## F-statistic: 3.538e+04 on 1 and 1213 DF,  p-value: < 2.2e-16

summary(lm(formula = newufc_split[which(newufc_split$newresult1=="L"),18]
           ~newufc_split[which(newufc_split$newresult1=="L"),17], 
           data = newufc_split)) ##0.758835

## 
## Call:
## lm(formula = newufc_split[which(newufc_split$newresult1 == "L"), 
##     18] ~ newufc_split[which(newufc_split$newresult1 == "L"), 
##     17], data = newufc_split)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -64.446  -1.855   0.151   2.056  33.602 
## 
## Coefficients:
##                                                          Estimate
## (Intercept)                                             -0.150762
## newufc_split[which(newufc_split$newresult1 == "L"), 17]  0.758835
##                                                         Std. Error t value
## (Intercept)                                               0.240352  -0.627
## newufc_split[which(newufc_split$newresult1 == "L"), 17]   0.004651 163.146
##                                                         Pr(>|t|)    
## (Intercept)                                                0.531    
## newufc_split[which(newufc_split$newresult1 == "L"), 17]   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.303 on 1213 degrees of freedom
## Multiple R-squared:  0.9564, Adjusted R-squared:  0.9564 
## F-statistic: 2.662e+04 on 1 and 1213 DF,  p-value: < 2.2e-16

###kicks
summary(lm(formula = newufc_split[which(newufc_split$newresult1=="W"),70]
           ~newufc_split[which(newufc_split$newresult1=="W"),69], 
           data = newufc_split)) ##0.722423

## 
## Call:
## lm(formula = newufc_split[which(newufc_split$newresult1 == "W"), 
##     70] ~ newufc_split[which(newufc_split$newresult1 == "W"), 
##     69], data = newufc_split)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.617   0.291   0.291   0.323  22.372 
## 
## Coefficients:
##                                                          Estimate
## (Intercept)                                             -0.290954
## newufc_split[which(newufc_split$newresult1 == "W"), 69]  0.722423
##                                                         Std. Error t value
## (Intercept)                                               0.127182  -2.288
## newufc_split[which(newufc_split$newresult1 == "W"), 69]   0.004404 164.019
##                                                         Pr(>|t|)    
## (Intercept)                                               0.0223 *  
## newufc_split[which(newufc_split$newresult1 == "W"), 69]   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.821 on 1213 degrees of freedom
## Multiple R-squared:  0.9569, Adjusted R-squared:  0.9568 
## F-statistic: 2.69e+04 on 1 and 1213 DF,  p-value: < 2.2e-16

summary(lm(formula = newufc_split[which(newufc_split$newresult1=="L"),70]
           ~newufc_split[which(newufc_split$newresult1=="L"),69], 
           data = newufc_split)) ##0.684034

## 
## Call:
## lm(formula = newufc_split[which(newufc_split$newresult1 == "L"), 
##     70] ~ newufc_split[which(newufc_split$newresult1 == "L"), 
##     69], data = newufc_split)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.789  -0.031  -0.031  -0.031  27.732 
## 
## Coefficients:
##                                                         Estimate
## (Intercept)                                             0.031238
## newufc_split[which(newufc_split$newresult1 == "L"), 69] 0.684034
##                                                         Std. Error t value
## (Intercept)                                               0.117722   0.265
## newufc_split[which(newufc_split$newresult1 == "L"), 69]   0.004451 153.672
##                                                         Pr(>|t|)    
## (Intercept)                                                0.791    
## newufc_split[which(newufc_split$newresult1 == "L"), 69]   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.651 on 1213 degrees of freedom
## Multiple R-squared:  0.9511, Adjusted R-squared:  0.9511 
## F-statistic: 2.362e+04 on 1 and 1213 DF,  p-value: < 2.2e-16

Conclusion

Summary Report

Here is a summary of our project:
- How did you develop your question and what relevant research has already been completed on this topic?
  We want to analyze what contributes to a win in a fight, as there is currently no established reasearch on this subject.
- How did you gather and prepare the data for analysis? We downloaded a dataset on UFC fights from Kaggle and manipulated it in various ways to assist our analysis.
- How did you select and determine the correct regression model to answer your question? We tried several different models, but our primary regression model was logistic regression. There were many different variables that came with each fighter, and logistic regression made it possible to examine which ones were the most important to a dichotomy output, especially as the relationship was not necessarily linear, and as there was a lack of normality in variable distribution.
- How reliable are your results? Our results can be described as fairly reliable. Due to the large number of possible variables in our dataset, it is difficult to choose the right ones for conducting an analysis. We determined which ones were the most important to a fighter, but not whether or not they will determine the outcome of a fight.
- What predictions can you make with your model? Examples Currently, we cannot make any real predictions with our model. After trying multiple different types of analyses, the only difference that we have been able to determine is that there is little difference between the loser and winner in a fight.
- What additional information or analysis might improve your model results or work to control limitations? The dataset that we ended up working with is in many ways rather limited. It does not include data on many measurements, a fighter’s overall record and career statistics, their most commonly used moves, etc. If we had more info that can be described as “lifetime statistics”, we could probably build a better model.

With the given dataset, we have some reason to believe that age and the total number of punches landed have a more significant effect on a fight outcome. However, since the overall model is not significant, we cannot conclude that certain factors have a definite effect. Hopefully a more comprehensive dataset can lead to a better model. An ideal dataset would include a more complete measurements (including arm reach, leg reach), fight records, full career statistics, stance, training academy, injuries, demographic, etc. There are only raw data scattered through different sources. But due to limitation of time and resources, we did not attempt to collect and compile all possible data.
One thing is certain. It is pretty hard to predict who the winner of a fight will be. By the time a fighter is capable of participating in the UFC, their skill level is so high that a fight is determined by very small differences in skill and/or chance. But the lack of easy predictability makes UFC and the larger world of MMA more interesting and exciting. Just like the recent major UFC event, in all three main events the challengers/underdogs defeated the champions.

Project II

Jason Achonu, Zachary Stein, Darshan Dilip Kasat, Nina Mozhu Ning

December 13, 2017