The Ultimate Fighting Championship (UFC) is one of, if not the, biggest mixed martial arts (MMA) promotions in the globe. Ever since the founding of UFC, the popularity of MMA has grown to the level of boxing, at times even surpassing it.
In a UFC fight, two fighters step into an octagon cage with the dedication to prove who is the better fighter. The rules for determining this are rather open. A fighter can incorporate boxing, kickboxing, Muay Thai, Sanshou, Brazilian Jiu-Jitsu, Judo, and more. Punches, Kicks, elbows, knees, clinching, grappling, all are allowed. The fights typically go for 3 rounds for preliminary events, and 5 for main events, each usually lasting five minutes.
People love to drink and gamble while watching all manner of sports, with expert and amatuer fans alike frequently trying to predict the results. When it comes to fighting, the number one advantage is reach. As Coach Jeff Ruth says, “If you cannot touch your opponent, then what good is your technique?” Speed and power are of course crucial. Yet precision beats power and timing beats speed, says the reigning UFC lightweight champion, the Notorious Conor McGregor. Needless to say, there are various factors that might give an edge to a fighter, but none is absolute. What distinguishes a top fighter from the rest? Is there a factor that outweighs others? As of now, there has been little in the way of scientific research on the factors contributing to the outcome of a fight (win/lose/draw, knockout, technical knockout, submission, etc.). And we want to run a more formal analysis on the those factors. While the result of a UFC fight is judged based on each round, we want to look at how the overall performance is related to the result.
We found a dataset on Kaggle where 1477 matchups that took place over the past years are contained. The dataset lists side by side basic information about the two fighters in each matchup, as well as detailed statistics of the immediate previous fight for each fighter prior to the matchup. The round-by-round performance is included with each fighter.
Source: https://www.kaggle.com/calmdownkarm/ufcdataset
str(read.csv("ufc fight data.csv"))
summary(read.csv("ufc fight data.csv"))
head(read.csv("ufc fight data.csv"), n=1)
For our project, we created different data frames from the original file for covenvient different analyses.
We first picked out several variables that we are most interested in and created a data frame. For those variables related to fight statistics, we summed it up across all rounds. We also added a new column BMI, since we think at the same weight, the taller fighter (and thus likely with longer reach) has some advantage. We also roughly combined the weight divisions into 5 classes for further use. Any values listed as NA are removed. We then stack the information about the fighters from the blue corner and the red corner together.
# ufcfightdata.df<-data.frame(read.csv("ufc fight data.csv"))
# ## Blue corner
# subm.attp1<-rowSums(cbind(ufcfightdata.df[, c(12, 99, 186, 273, 360)]), na.rm = TRUE)
# td.attp1<-rowSums(cbind(ufcfightdata.df[, c(13, 100, 187, 274, 361)]), na.rm = TRUE)
# td1<-rowSums(cbind(ufcfightdata.df[, c(14, 101, 188, 275, 362)]), na.rm = TRUE)
# bodytot.attp1<-rowSums(cbind(ufcfightdata.df[, c(17, 104, 191, 278, 365)]), na.rm = TRUE)
# bodytot1<-rowSums(cbind(ufcfightdata.df[, c(18, 105, 192, 279, 366)]), na.rm = TRUE)
# head.attp1<-rowSums(cbind(ufcfightdata.df[, c(67, 154, 241, 328, 415)]), na.rm = TRUE)
# head1<-rowSums(cbind(ufcfightdata.df[, c(68, 155, 242, 329, 416)]), na.rm = TRUE)
# leg.attp1<-rowSums(cbind(ufcfightdata.df[, c(72, 159, 246, 333, 420)]), na.rm = TRUE)
# leg1<-rowSums(cbind(ufcfightdata.df[, c(73, 160, 247, 334, 421)]), na.rm = TRUE)
# kick.attp1<-rowSums(cbind(ufcfightdata.df[, c(69, 156, 243, 330, 417)]), na.rm = TRUE)
# kick1<-rowSums(cbind(ufcfightdata.df[, c(70, 157, 244, 331, 418)]), na.rm = TRUE)
# punch.attp1<-rowSums(cbind(ufcfightdata.df[, c(78, 165, 252, 339, 426)]), na.rm = TRUE)
# punch1<-rowSums(cbind(ufcfightdata.df[, c(79, 166, 253, 340, 427)]), na.rm = TRUE)
# strike.attp1<-rowSums(cbind(ufcfightdata.df[, c(82, 169, 256, 343, 430)]), na.rm = TRUE)
# strike1<-rowSums(cbind(ufcfightdata.df[, c(83, 170, 257, 344, 431)]), na.rm = TRUE)
# KD1<-rowSums(cbind(ufcfightdata.df[, c(71, 158, 245, 332, 419)]), na.rm = TRUE)
# ## Red corner
# subm.attp2<-rowSums(cbind(ufcfightdata.df[, c(461, 548, 635, 722, 809)]), na.rm = TRUE)
# td.attp2<-rowSums(cbind(ufcfightdata.df[, c(462, 549, 636, 723, 810)]), na.rm = TRUE)
# td2<-rowSums(cbind(ufcfightdata.df[, c(463, 550, 637, 724, 811)]), na.rm = TRUE)
# bodytot.attp2<-rowSums(cbind(ufcfightdata.df[, c(466, 553, 640, 727, 814)]), na.rm = TRUE)
# bodytot2<-rowSums(cbind(ufcfightdata.df[, c(467, 554, 641, 728, 815)]), na.rm = TRUE)
# head.attp2<-rowSums(cbind(ufcfightdata.df[, c(516, 603, 690, 777, 864)]), na.rm = TRUE)
# head2<-rowSums(cbind(ufcfightdata.df[, c(517, 604, 691, 778, 865)]), na.rm = TRUE)
# leg.attp2<-rowSums(cbind(ufcfightdata.df[, c(521, 608, 695, 782, 869)]), na.rm = TRUE)
# leg2<-rowSums(cbind(ufcfightdata.df[, c(522, 609, 696, 783, 870)]), na.rm = TRUE)
# kick.attp2<-rowSums(cbind(ufcfightdata.df[, c(518, 605, 692, 779, 866)]), na.rm = TRUE)
# kick2<-rowSums(cbind(ufcfightdata.df[, c(519, 606, 693, 780, 867)]), na.rm = TRUE)
# punch.attp2<-rowSums(cbind(ufcfightdata.df[, c(527, 614, 701, 788, 875)]), na.rm = TRUE)
# punch2<-rowSums(cbind(ufcfightdata.df[, c(528, 615, 702, 789, 876)]), na.rm = TRUE)
# strike.attp2<-rowSums(cbind(ufcfightdata.df[, c(531, 618, 705, 792, 879)]), na.rm = TRUE)
# strike2<-rowSums(cbind(ufcfightdata.df[, c(532, 619, 706, 793, 880)]), na.rm = TRUE)
# KD2<-rowSums(cbind(ufcfightdata.df[, c(520, 607, 694, 781, 868)]), na.rm = TRUE)
# # convert B/R result to W/L/D/NC for each fighter
# result1<-rep("a", 1477)
# result2<-rep("a", 1477)
# for (i in c(1:1477)) {
# if (ufcfightdata.df$winner[i]=="red") {
# result1[i]<-"L"
# result2[i]<-"W"
# } else if (ufcfightdata.df$winner[i]=="blue") {
# result1[i]<-"W"
# result2[i]<-"L"
# } else if (ufcfightdata.df$winner[i]=="draw") {
# result1[i]<-"D"
# result2[i]<-"D"
# } else {
# result1[i]<-"NC"
# result2[i]<-"NC"
# }
# }
#
# Blue<-cbind(ufcfightdata.df$B_Name, ufcfightdata.df[, c(1:5)], ufcfightdata.df$B_Weight, subm.attp1,td.attp1, td1, bodytot.attp1, bodytot1, head.attp1, head1, leg.attp1, leg1, kick.attp1, kick1, punch.attp1, punch1, strike.attp1, strike1, KD1, result1, ufcfightdata.df$winby, ufcfightdata.df$Max_round)
# Red<-cbind(ufcfightdata.df$R_Name, ufcfightdata.df[, c(450:454)], ufcfightdata.df$R_Weight, subm.attp2, td.attp2, td2, bodytot.attp2, bodytot2, head.attp2, head2, leg.attp2, leg2, kick.attp2, kick2, punch.attp2, punch2, strike.attp2, strike2, KD2, result2, ufcfightdata.df$winby, ufcfightdata.df$Max_round)
# colnames(Blue)<-c("Name", "PreFights", "Streak", "Age", "Height", "Hometown", "Weight", "A.Sub", "A.TD", "TD", "A.Body", "Body", "A.Head", "Head", "A.Leg", "Leg", "A.Kick", "Kick","A.Punch", "Punch", "A.Strike", "Strike", "KD", "Result", "By", "Rounds")
# colnames(Red)<-c("Name", "PreFights", "Streak", "Age", "Height", "Hometown", "Weight", "A.Sub","A.TD", "TD", "A.Body", "Body", "A.Head", "Head", "A.Leg", "Leg", "A.Kick", "Kick", "A.Punch", "Punch", "A.Strike", "Strike", "KD", "Result", "By", "Rounds")
# ## height and weight, BMI
# height_m<-newdata$Height/100
# BMI<-newdata$Weight/(height_m*height_m)
# ##update data frame, add BMI column and delete rows not W/L
# newdata<-data.frame()
# for (j in c(1:1477)) {
# newdata<-rbind(newdata, Blue[j, ], Red[j, ])
# }
# thedata<-cbind(newdata, BMI)
# thedata<-thedata[-c(which(thedata$Result!="W" & thedata$Result!="L")), ]
# thedata<-thedata[-c(893, 894, 1121,1122,1523,1524,1849,1850,2129,2130), ] ## remove NA's
# class<-rep(0, 2892)
# for (i in c(1:2892)) {
# if (thedata$Weight[i]<=65) {
# class[i]<-1
# } else if (thedata$Weight[i]>65 & thedata$Weight[i]<=80) {
# class[i]<-2
# } else if (thedata$Weight[i]>80 & thedata$Weight[i]<=95) {
# class[i]<-3
# } else if (thedata$Weight[i]>90 & thedata$Weight[i]<=110) {
# class[i]<-4
# } else {
# class[i]<-5
# }
# }
# thedata<-cbind(thedata, class)
# write.csv(thedata, file="/Users/mozhuning/我的文件/GWU/Classes/Master/Fall 2017/DATS-Introduction to Data Science/Project/Project II/thedata.csv",row.names = FALSE)
We secondly created another data frame with all the original fight statistics summed up across all rounds, with all NA’s removed (replaced with 0), draw and no contest removed, and BMI added. This data frame retains the structure of the original file.
# ufcfightdata.df[is.na(ufcfightdata.df)]<-0 ##replace NA with 0
# for (x in c(10:96)) {
# ufcfightdata.df[,x]<-rowSums(cbind(ufcfightdata.df[, c(x,x+87, x+174, x+261, x+348)]), na.rm = TRUE)
# }
# for (y in c(459:545)) {
# ufcfightdata.df[,y]<-rowSums(cbind(ufcfightdata.df[, c(y,y+87, y+174, y+261, y+348)]), na.rm = TRUE)
# }
# ufcfightdata.df<-ufcfightdata.df[, -c(97:444, 546:893)]
# ufcfightdata.df<-ufcfightdata.df[-c(which(ufcfightdata.df$winner!="blue" & ufcfightdata.df$winner!="red")), ]
# ufcfightdata.df<-ufcfightdata.df[-c(which(rowSums(ufcfightdata.df[, c(10:96,111:197)])==0)),]
# height_b<-ufcfightdata.df$B_Height/100
# BMI_b<-ufcfightdata.df$B_Weight/(height_b*height_b)
# height_r<-ufcfightdata.df$R_Height/100
# BMI_r<-ufcfightdata.df$R_Weight/(height_r*height_r)
# ufcfightdata.df<-cbind(ufcfightdata.df, BMI_b, BMI_r)
# write.csv(ufcfightdata.df, file="/Users/mozhuning/我的文件/GWU/Classes/Master/Fall 2017/DATS-Introduction to Data Science/Project/Project II/NEWUFCPCA.csv", row.names = FALSE)
Lastly, another version of this dataset has the fighters in each matchup split up and stacked. The sum of all attempted strikes sum1 and that of all landed strikes sum2 are added to the data frame.
# newresult1<-rep("a", 1215)
# newresult2<-rep("a", 1215)
# for (i in c(1:1215)) {
# if (ufcfightdata.df$winner[i]=="red") {
# newresult1[i]<-"L"
# newresult2[i]<-"W"
# } else {
# newresult1[i]<-"W"
# newresult2[i]<-"L"
# }
# }
# corner1<-rep("B", 1215)
# corner2<-rep("R", 1215)
# bluenew<-data.frame(cbind(ufcfightdata.df[,c(1:96,200)], newresult1, corner1))
# rednew<-data.frame(cbind(ufcfightdata.df[,c(102:197,201)], newresult2, corner2))
# colnames(rednew)<-colnames(bluenew)
# newufc_split<-data.frame(rbind(bluenew,rednew))
# newclass<-rep(0, 2430)
# for (l in c(1:2430)) {
# if (newufc_split$B_Weight[l]<=65) {
# newclass[l]<-1
# } else if (newufc_split$B_Weight[l]>65 & newufc_split$B_Weight[l]<=80) {
# newclass[l]<-2
# } else if (newufc_split$B_Weight[l]>80 & newufc_split$B_Weight[l]<=95) {
# newclass[l]<-3
# } else if (newufc_split$B_Weight[l]>90 & newufc_split$B_Weight[l]<=110) {
# newclass[l]<-4
# } else {
# newclass[l]<-5
# }
# }
# newufc_split<-cbind(newufc_split, newclass)
# s1<-c(seq(from=13, to=69, by=2), seq(from=72, to=82, by=2))
# s2<-c(seq(from=14, to=70, by=2), seq(from=73, to=83, by=2))
# sum1<-rowSums(cbind(newufc_split[, s1])) #attempts
# sum2<-rowSums(cbind(newufc_split[, s2])) #landed
# newufc_split<-cbind(newufc_split, sum1,sum2)
# write.csv(newufc_split, file="/Users/mozhuning/我的文件/GWU/Classes/Master/Fall 2017/DATS-Introduction to Data Science/Project/Project II/newufc_split.csv",
# row.names = FALSE)
Here is a look at the new data frames. We will only display the smaller dataset here.
thedata<-data.frame(read.csv("thedata.csv"))
NEW_Dataset <- data.frame(read.csv("NEWUFC.csv"))
newufc_split<-data.frame(read.csv("newufc_split.csv"))
str(thedata)
## 'data.frame': 2892 obs. of 28 variables:
## $ Name : Factor w/ 844 levels "Aaron Phillips",..: 544 73 120 165 755 399 102 505 252 656 ...
## $ PreFights: int 1 6 0 0 2 2 0 6 3 5 ...
## $ Streak : int 1 1 0 0 0 0 0 4 1 2 ...
## $ Age : int 23 27 32 29 38 32 23 25 30 28 ...
## $ Height : int 182 187 175 182 172 177 170 175 167 170 ...
## $ Hometown : Factor w/ 643 levels "","Aarhus Denmark",..: 577 247 95 134 254 75 568 221 526 111 ...
## $ Weight : int 84 84 70 70 70 70 56 56 61 61 ...
## $ A.Sub : int 1 4 0 0 0 0 0 15 1 4 ...
## $ A.TD : int 1 39 0 0 0 8 0 31 7 16 ...
## $ TD : int 1 19 0 0 0 2 0 11 2 5 ...
## $ A.Body : int 11 65 0 0 41 17 0 128 35 90 ...
## $ Body : int 11 52 0 0 27 16 0 108 21 72 ...
## $ A.Head : int 57 385 0 0 208 133 0 699 448 439 ...
## $ Head : int 39 201 0 0 94 56 0 342 171 189 ...
## $ A.Leg : int 0 0 0 0 0 0 0 3 0 13 ...
## $ Leg : int 0 0 0 0 0 0 0 3 0 13 ...
## $ A.Kick : int 0 0 0 0 15 3 0 0 0 44 ...
## $ Kick : int 0 0 0 0 12 2 0 0 0 42 ...
## $ A.Punch : int 0 0 0 0 242 50 0 0 0 149 ...
## $ Punch : int 0 0 0 0 118 32 0 0 0 58 ...
## $ A.Strike : int 68 469 0 0 261 152 0 864 495 578 ...
## $ Strike : int 50 270 0 0 131 74 0 486 203 304 ...
## $ KD : int 0 0 0 0 0 0 0 1 0 0 ...
## $ Result : Factor w/ 2 levels "L","W": 1 2 2 1 1 2 2 1 1 2 ...
## $ By : Factor w/ 4 levels "","DEC","KO/TKO",..: 2 2 4 4 3 3 4 4 2 2 ...
## $ Rounds : int 3 3 3 3 3 3 3 3 3 3 ...
## $ BMI : num 25.4 24 22.9 21.1 23.7 ...
## $ class : int 3 3 2 2 2 2 1 1 1 1 ...
summary(thedata)
## Name PreFights Streak
## Donald Cerrone : 13 Min. : 0.000 Min. :0.0000
## Neil Magny : 12 1st Qu.: 0.000 1st Qu.:0.0000
## Beneil Dariush : 11 Median : 1.000 Median :0.0000
## Gegard Mousasi : 11 Mean : 1.939 Mean :0.6909
## Derrick Lewis : 10 3rd Qu.: 3.000 3rd Qu.:1.0000
## Francisco Trinaldo: 10 Max. :12.000 Max. :9.0000
## (Other) :2825
## Age Height Hometown
## Min. :20.00 Min. :152.0 Rio de Janeiro Brazil : 75
## 1st Qu.:28.00 1st Qu.:172.0 Sao Paulo Brazil : 40
## Median :31.00 Median :177.0 Dublin Ireland : 30
## Mean :31.16 Mean :177.5 Dagestan Russia : 24
## 3rd Qu.:34.00 3rd Qu.:182.0 Phoenix, Arizona USA : 24
## Max. :46.00 Max. :213.0 Milwaukee, Wisconsin USA: 22
## (Other) :2677
## Weight A.Sub A.TD TD
## Min. : 52.00 Min. : 0.0000 Min. : 0 Min. : 0.000
## 1st Qu.: 65.00 1st Qu.: 0.0000 1st Qu.: 0 1st Qu.: 0.000
## Median : 70.00 Median : 0.0000 Median : 2 Median : 1.000
## Mean : 73.84 Mean : 0.7604 Mean : 6 Mean : 2.275
## 3rd Qu.: 84.00 3rd Qu.: 1.0000 3rd Qu.: 8 3rd Qu.: 3.000
## Max. :120.00 Max. :16.0000 Max. :69 Max. :31.000
##
## A.Body Body A.Head Head
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 0.00
## Median : 16.00 Median : 12.00 Median : 95.0 Median : 41.00
## Mean : 30.17 Mean : 22.94 Mean : 151.5 Mean : 67.34
## 3rd Qu.: 44.00 3rd Qu.: 33.00 3rd Qu.: 225.0 3rd Qu.:101.00
## Max. :321.00 Max. :282.00 Max. :1596.0 Max. :608.00
##
## A.Leg Leg A.Kick Kick
## Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.000 Median : 0.000 Median : 0.00 Median : 0.000
## Mean : 2.135 Mean : 1.745 Mean : 11.21 Mean : 7.794
## 3rd Qu.: 0.000 3rd Qu.: 0.000 3rd Qu.: 14.00 3rd Qu.: 9.000
## Max. :62.000 Max. :53.000 Max. :178.00 Max. :129.000
##
## A.Punch Punch A.Strike Strike
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 0.00 Median : 0.00 Median : 129.0 Median : 64.0
## Mean : 50.48 Mean : 25.55 Mean : 198.9 Mean :104.5
## 3rd Qu.: 73.00 3rd Qu.: 38.00 3rd Qu.: 302.0 3rd Qu.:159.0
## Max. :749.00 Max. :400.00 Max. :1970.0 Max. :875.0
##
## KD Result By Rounds BMI
## Min. :0.000 L:1446 : 4 Min. :3.00 Min. :17.45
## 1st Qu.:0.000 W:1446 DEC :1428 1st Qu.:3.00 1st Qu.:21.33
## Median :0.000 KO/TKO: 906 Median :3.00 Median :22.50
## Mean :0.397 SUB : 554 Mean :3.21 Mean :23.24
## 3rd Qu.:1.000 3rd Qu.:3.00 3rd Qu.:24.54
## Max. :8.000 Max. :5.00 Max. :38.30
##
## class
## Min. :1.000
## 1st Qu.:1.000
## Median :2.000
## Mean :1.997
## 3rd Qu.:3.000
## Max. :5.000
##
str(NEW_Dataset)
summary(NEW_Dataset)
str(newufc_split)
summary(newufc_split)
library(ggplot2)
library(ResourceSelection)
library(pROC)
library(pscl)
library(corrplot)
library(caTools)
library(caret)
library(e1071)
library(cluster)
library(leaps)
library(ISLR)
The variables in our data are not normally distributed. Here is an example of Weight.
shapiro.test(thedata$Weight)
##
## Shapiro-Wilk normality test
##
## data: thedata$Weight
## W = 0.89954, p-value < 2.2e-16
qqnorm(thedata$Weight)
hist(thedata$Weight)
We want to see if some of the variables are correlated to each other.
ttcor<-cor(thedata[,c(2:5,7:23,27)])
par(xpd=TRUE)
corrplot(ttcor, type = "lower", order = "hclust",
tl.col = "black", tl.srt = 90, mar = c(1,1,.5,.5))
This is our main model. We want to see if any of the variables can be used to predict the result of the fight (win or lose in our case).
We trimmed the dataset and removed character variables and highly correlated variables. We used regsubsets to do the variable selection, made both the adjusted R^2 and BIC plots.
trim<-thedata[, -c(1,5,6,7,9,11,13,15,17,19,21, 25, 26, 28)]
bestselect <- regsubsets(as.factor(Result)~., data = trim, nvmax = 14)
plot(bestselect, scale = "adjr2", main = "Adjusted R^2")
plot(bestselect, scale = "bic", main = "BIC")
summary(bestselect)
## Subset selection object
## Call: regsubsets.formula(as.factor(Result) ~ ., data = trim, nvmax = 14)
## 13 Variables (and intercept)
## Forced in Forced out
## PreFights FALSE FALSE
## Streak FALSE FALSE
## Age FALSE FALSE
## A.Sub FALSE FALSE
## TD FALSE FALSE
## Body FALSE FALSE
## Head FALSE FALSE
## Leg FALSE FALSE
## Kick FALSE FALSE
## Punch FALSE FALSE
## Strike FALSE FALSE
## KD FALSE FALSE
## BMI FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
## PreFights Streak Age A.Sub TD Body Head Leg Kick Punch Strike
## 1 ( 1 ) " " " " "*" " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " "*" " " " " " " " " " " " " "*" " "
## 3 ( 1 ) " " " " "*" " " " " " " " " " " " " "*" " "
## 4 ( 1 ) " " "*" "*" " " " " " " " " " " " " "*" " "
## 5 ( 1 ) " " "*" "*" " " " " " " " " "*" " " "*" " "
## 6 ( 1 ) " " "*" "*" " " " " " " "*" " " " " "*" "*"
## 7 ( 1 ) " " "*" "*" " " " " "*" "*" "*" " " "*" " "
## 8 ( 1 ) " " "*" "*" "*" " " "*" "*" "*" " " "*" " "
## 9 ( 1 ) " " "*" "*" "*" "*" "*" "*" "*" " " "*" " "
## 10 ( 1 ) " " "*" "*" "*" "*" " " "*" "*" "*" "*" "*"
## 11 ( 1 ) " " "*" "*" "*" "*" " " "*" "*" "*" "*" "*"
## 12 ( 1 ) " " "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## KD BMI
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " "*"
## 4 ( 1 ) " " "*"
## 5 ( 1 ) " " "*"
## 6 ( 1 ) " " "*"
## 7 ( 1 ) " " "*"
## 8 ( 1 ) " " "*"
## 9 ( 1 ) " " "*"
## 10 ( 1 ) " " "*"
## 11 ( 1 ) "*" "*"
## 12 ( 1 ) "*" "*"
## 13 ( 1 ) "*" "*"
WLlogit4 <- glm(Result~Age+Punch+BMI+Streak+Leg+Head, binomial(link = "logit"), data = thedata)
summary(WLlogit4)
##
## Call:
## glm(formula = Result ~ Age + Punch + BMI + Streak + Leg + Head,
## family = binomial(link = "logit"), data = thedata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.70811 -1.15905 -0.08079 1.16634 1.54679
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.9047311 0.3580628 2.527 0.0115 *
## Age -0.0527010 0.0099707 -5.286 1.25e-07 ***
## Punch 0.0026571 0.0010465 2.539 0.0111 *
## BMI 0.0263533 0.0131633 2.002 0.0453 *
## Streak 0.0462406 0.0402402 1.149 0.2505
## Leg -0.0096158 0.0074580 -1.289 0.1973
## Head 0.0006362 0.0006389 0.996 0.3194
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4009.2 on 2891 degrees of freedom
## Residual deviance: 3957.3 on 2885 degrees of freedom
## AIC: 3971.3
##
## Number of Fisher Scoring iterations: 4
WLlogit3 <- glm(Result~Age+Punch, binomial(link = "logit"), data = thedata)
summary(WLlogit3)
##
## Call:
## glm(formula = Result ~ Age + Punch, family = binomial(link = "logit"),
## data = thedata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.63534 -1.16607 -0.04189 1.16847 1.52692
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.4084898 0.2928307 4.810 1.51e-06 ***
## Age -0.0478415 0.0093308 -5.127 2.94e-07 ***
## Punch 0.0032463 0.0007997 4.059 4.92e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4009.2 on 2891 degrees of freedom
## Residual deviance: 3966.9 on 2889 degrees of freedom
## AIC: 3972.9
##
## Number of Fisher Scoring iterations: 4
We also tried different combinations of variables in training and testing, and found that Age, Punch, BMI, Streak, Strike, and Head result in a lower AIC.
train <- thedata[1:2024, ]
test<-thedata[2025:2892,]
WLlogit2 <- glm(Result~Age+Punch+BMI+Streak+Strike+Head, binomial(link = "logit"), data = train)
summary(WLlogit2)
##
## Call:
## glm(formula = Result ~ Age + Punch + BMI + Streak + Strike +
## Head, family = binomial(link = "logit"), data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7027 -1.1557 -0.1217 1.1731 1.5138
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.785393 0.432068 1.818 0.0691 .
## Age -0.046741 0.011882 -3.934 8.37e-05 ***
## Punch 0.002576 0.001209 2.131 0.0331 *
## BMI 0.023204 0.015870 1.462 0.1437
## Streak 0.053099 0.048548 1.094 0.2741
## Strike -0.002707 0.001597 -1.695 0.0901 .
## Head 0.004662 0.002475 1.884 0.0596 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2805.9 on 2023 degrees of freedom
## Residual deviance: 2768.7 on 2017 degrees of freedom
## AIC: 2782.7
##
## Number of Fisher Scoring iterations: 4
WLlogit <- glm(Result~Age+Punch+BMI+Streak+Strike+Head, binomial(link = "logit"), data = thedata)
summary(WLlogit)
##
## Call:
## glm(formula = Result ~ Age + Punch + BMI + Streak + Strike +
## Head, family = binomial(link = "logit"), data = thedata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.78732 -1.15622 -0.08918 1.16844 1.54465
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.922652 0.357950 2.578 0.00995 **
## Age -0.052431 0.009974 -5.257 1.47e-07 ***
## Punch 0.002415 0.001018 2.371 0.01772 *
## BMI 0.025366 0.013175 1.925 0.05419 .
## Streak 0.053355 0.040393 1.321 0.18654
## Strike -0.002108 0.001363 -1.546 0.12204
## Head 0.003631 0.002075 1.750 0.08020 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4009.2 on 2891 degrees of freedom
## Residual deviance: 3956.6 on 2885 degrees of freedom
## AIC: 3970.6
##
## Number of Fisher Scoring iterations: 4
hoslem.test(thedata$Result, fitted(WLlogit))
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: thedata$Result, fitted(WLlogit)
## X-squared = 2892, df = 8, p-value < 2.2e-16
prob=predict(WLlogit, type = c("response"))
h <- roc(thedata$Result~prob, data=thedata)
h
##
## Call:
## roc.formula(formula = thedata$Result ~ prob, data = thedata)
##
## Data: prob in 1446 controls (thedata$Result L) < 1446 cases (thedata$Result W).
## Area under the curve: 0.5746
plot(h)
While, the ROC is rather flat and AUC is only about .58, the probability value is below .05 for the age and punch variables. However, while Punch is a slightly positive correlation, Age is a slightly negative correlation.
In our bid to improve our model, instead of selecting the variables we decided to go for dimensionality reduction. Our choice was Principal Component Analysis. To keep things simple we decided to select 2 component and then re-run the logistic regression on it.
The first step was to read in the dataset after triming and removing columns like name, id, hometown using excel.
We did a split of the dataset. 80% for Training and the remaining 20% for Testing.
NEW_Dataset <- read.csv("NEWUFC.csv")
NEW_Dataset <- data.frame(NEW_Dataset)
library(caTools)
set.seed(123)
split = sample.split(NEW_Dataset$winner, SplitRatio = 0.8)
training_set = subset(NEW_Dataset, split == TRUE)
test_set = subset(NEW_Dataset, split == FALSE)
PCA requires the dataset to be standard so we scaled the dataset using the scale function.
PCA is an unsupervised algorithm, so we scaled only the independent variable, excluding the dependent variable for both the training and the testing sets.
training_set[-183] = scale(training_set[-183])
test_set[-183] = scale(test_set[-183])
We then used the preProcess function to do the PCA and selected only 2 components.
We selected the top 2 components in the training set by fitting the pca on the training set, and we re-arranged the training set.
pca = preProcess(x = training_set[-183], method = 'pca', pcaComp = 2)
training_set = predict(pca, training_set)
training_set = training_set[c(2, 3, 1)]
We selected the top 2 components in the test set by fitting the pca on the test set, and we re-arranged the test set.
test_set = predict(pca, test_set)
test_set = test_set[c(2, 3, 1)]
Next, we fit Logistic Regression to the Training set.
classifier = glm(formula = winner ~ .,
family = binomial,
data = training_set)
summary(classifier)
##
## Call:
## glm(formula = winner ~ ., family = binomial, data = training_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5728 -1.3241 0.9679 1.0257 1.3553
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.355628 0.065415 5.436 5.43e-08 ***
## PC1 -0.010093 0.009049 -1.115 0.2647
## PC2 -0.031551 0.013084 -2.411 0.0159 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1317.6 on 971 degrees of freedom
## Residual deviance: 1310.4 on 969 degrees of freedom
## AIC: 1316.4
##
## Number of Fisher Scoring iterations: 4
The AIC is low compared to the previous model by comparing the AIC. PCA was a good choice for the feature selection.
We then predicted the test set results and made the confusion matrix. We can see that the model is 60% accurate.
prob_pred = predict(classifier, type = 'response', newdata = test_set[-3])
y_pred = ifelse(prob_pred > 0.5, 1, 0)
cm = table(test_set[, 3], y_pred > 0.5)
cm
##
## FALSE TRUE
## 0 5 95
## 1 3 140
We decided to plot the ROC as well.
probn=predict(classifier, type = c("response"))
roc(test_set$winner~prob_pred, data=test_set)
##
## Call:
## roc.formula(formula = test_set$winner ~ prob_pred, data = test_set)
##
## Data: prob_pred in 100 controls (test_set$winner 0) < 143 cases (test_set$winner 1).
## Area under the curve: 0.5838
plot(roc(test_set$winner~prob_pred, data=test_set))
We want to see if there are significant differences between the fighting styles, strategies, and technical soundness (accuracy of stikes, etc,) of winners and losers. So we use K-Means to see if there are clusters.
Plots (scatterplots and boxplots) are graphed as an initial visual check. Only a few are shown here.
ggplot(data=newufc_split, aes(x=newufc_split[,83], y=newufc_split[,82], color = newufc_split$newresult1)) + geom_point()+labs(x="Landed Total Strikes", y="Attempted Total Strikes")+scale_fill_continuous(guide = guide_legend(title = NULL))
ggplot(data=newufc_split, aes(x=newufc_split$newresult1, newufc_split[,83])) +
geom_boxplot()+labs(x="Result", y="Landed Total Strikes")
ggplot(data=newufc_split, aes(x=newufc_split$newresult1, newufc_split$B_Age)) +
geom_boxplot()+labs(x="Result", y="Age")
###according to weight for interest
ggplot(data=newufc_split, aes(x=newufc_split[,83], y=newufc_split[,82], color = newufc_split$newclass)) + geom_point()+labs(x="Landed Total Strikes", y="Attempted Total Strikes")
As seen in the displayed graphs, the difference is rather small.
We then used the elbow method to find the optimal number of cluster.
set.seed(6)
wcss = vector()
for (i in 1:10) wcss[i] = sum(kmeans(newufc_split[,82:83], i)$withinss)
par(mar=c(4,4,4,4))
plot(1:10,
wcss,
type = 'b',
main = paste('The Elbow Method'),
xlab = 'Number of clusters',
ylab = 'WCSS')
# Fitting K-Means to the dataset
set.seed(29)
kmeans = kmeans(x = newufc_split[,82:83], centers = 5)
y_kmeans = kmeans$cluster
# Visualising the clusters
clusplot(newufc_split[,82:83],
y_kmeans,
lines = 0,
shade = TRUE,
color = TRUE,
labels = 2,
plotchar = FALSE,
span = TRUE,
main = paste('Clusters'),
xlab = 'ld',
ylab = 'attp')
set.seed(1)
ufccluster <- kmeans(newufc_split[, c(82,83)], 4, nstart = 20)
#ufccluster
table(ufccluster$cluster, newufc_split$newresult1)
##
## L W
## 1 596 537
## 2 395 404
## 3 184 199
## 4 40 75
ggplot(newufc_split, aes(x=newufc_split[,83], y=newufc_split[,82], color = as.factor(ufccluster$cluster))) + geom_point()
set.seed(1)
ufccluster1 <- kmeans(newufc_split[, c(101,102)], 4, nstart = 20)
table(ufccluster1$cluster, newufc_split$newresult1)
##
## L W
## 1 402 403
## 2 23 51
## 3 646 583
## 4 144 178
We have applied the above demand on various variables, but unfortunately no significant results can be drawn. As seen from the example scatterplots and the cross tabulation of clustering, winners and losers are spread rather evenly in each cluster, suggesting that there is no evident division between winners and losers.
We also tried to fit a linear regression line of landed strikes against attempted strikes for winners and losers separately. Winners generally have a a higher slope, and thus more accuracy. But the difference is small from the slope of the losers.
Here are some examples.
#body total strikes
summary(lm(formula = newufc_split[which(newufc_split$newresult1=="W"),18]
~newufc_split[which(newufc_split$newresult1=="W"),17],
data = newufc_split)) ##0.789235
##
## Call:
## lm(formula = newufc_split[which(newufc_split$newresult1 == "W"),
## 18] ~ newufc_split[which(newufc_split$newresult1 == "W"),
## 17], data = newufc_split)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66.327 -1.580 0.888 2.056 31.122
##
## Coefficients:
## Estimate
## (Intercept) -0.888086
## newufc_split[which(newufc_split$newresult1 == "W"), 17] 0.789235
## Std. Error t value
## (Intercept) 0.241714 -3.674
## newufc_split[which(newufc_split$newresult1 == "W"), 17] 0.004196 188.088
## Pr(>|t|)
## (Intercept) 0.000249 ***
## newufc_split[which(newufc_split$newresult1 == "W"), 17] < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.346 on 1213 degrees of freedom
## Multiple R-squared: 0.9668, Adjusted R-squared: 0.9668
## F-statistic: 3.538e+04 on 1 and 1213 DF, p-value: < 2.2e-16
summary(lm(formula = newufc_split[which(newufc_split$newresult1=="L"),18]
~newufc_split[which(newufc_split$newresult1=="L"),17],
data = newufc_split)) ##0.758835
##
## Call:
## lm(formula = newufc_split[which(newufc_split$newresult1 == "L"),
## 18] ~ newufc_split[which(newufc_split$newresult1 == "L"),
## 17], data = newufc_split)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64.446 -1.855 0.151 2.056 33.602
##
## Coefficients:
## Estimate
## (Intercept) -0.150762
## newufc_split[which(newufc_split$newresult1 == "L"), 17] 0.758835
## Std. Error t value
## (Intercept) 0.240352 -0.627
## newufc_split[which(newufc_split$newresult1 == "L"), 17] 0.004651 163.146
## Pr(>|t|)
## (Intercept) 0.531
## newufc_split[which(newufc_split$newresult1 == "L"), 17] <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.303 on 1213 degrees of freedom
## Multiple R-squared: 0.9564, Adjusted R-squared: 0.9564
## F-statistic: 2.662e+04 on 1 and 1213 DF, p-value: < 2.2e-16
###kicks
summary(lm(formula = newufc_split[which(newufc_split$newresult1=="W"),70]
~newufc_split[which(newufc_split$newresult1=="W"),69],
data = newufc_split)) ##0.722423
##
## Call:
## lm(formula = newufc_split[which(newufc_split$newresult1 == "W"),
## 70] ~ newufc_split[which(newufc_split$newresult1 == "W"),
## 69], data = newufc_split)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.617 0.291 0.291 0.323 22.372
##
## Coefficients:
## Estimate
## (Intercept) -0.290954
## newufc_split[which(newufc_split$newresult1 == "W"), 69] 0.722423
## Std. Error t value
## (Intercept) 0.127182 -2.288
## newufc_split[which(newufc_split$newresult1 == "W"), 69] 0.004404 164.019
## Pr(>|t|)
## (Intercept) 0.0223 *
## newufc_split[which(newufc_split$newresult1 == "W"), 69] <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.821 on 1213 degrees of freedom
## Multiple R-squared: 0.9569, Adjusted R-squared: 0.9568
## F-statistic: 2.69e+04 on 1 and 1213 DF, p-value: < 2.2e-16
summary(lm(formula = newufc_split[which(newufc_split$newresult1=="L"),70]
~newufc_split[which(newufc_split$newresult1=="L"),69],
data = newufc_split)) ##0.684034
##
## Call:
## lm(formula = newufc_split[which(newufc_split$newresult1 == "L"),
## 70] ~ newufc_split[which(newufc_split$newresult1 == "L"),
## 69], data = newufc_split)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.789 -0.031 -0.031 -0.031 27.732
##
## Coefficients:
## Estimate
## (Intercept) 0.031238
## newufc_split[which(newufc_split$newresult1 == "L"), 69] 0.684034
## Std. Error t value
## (Intercept) 0.117722 0.265
## newufc_split[which(newufc_split$newresult1 == "L"), 69] 0.004451 153.672
## Pr(>|t|)
## (Intercept) 0.791
## newufc_split[which(newufc_split$newresult1 == "L"), 69] <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.651 on 1213 degrees of freedom
## Multiple R-squared: 0.9511, Adjusted R-squared: 0.9511
## F-statistic: 2.362e+04 on 1 and 1213 DF, p-value: < 2.2e-16
With the given dataset, we have some reason to believe that age and the total number of punches landed have a more significant effect on a fight outcome. However, since the overall model is not significant, we cannot conclude that certain factors have a definite effect. Hopefully a more comprehensive dataset can lead to a better model. An ideal dataset would include a more complete measurements (including arm reach, leg reach), fight records, full career statistics, stance, training academy, injuries, demographic, etc. There are only raw data scattered through different sources. But due to limitation of time and resources, we did not attempt to collect and compile all possible data.
One thing is certain. It is pretty hard to predict who the winner of a fight will be. By the time a fighter is capable of participating in the UFC, their skill level is so high that a fight is determined by very small differences in skill and/or chance. But the lack of easy predictability makes UFC and the larger world of MMA more interesting and exciting. Just like the recent major UFC event, in all three main events the challengers/underdogs defeated the champions.