UFC 208 took place in February 2017, with 9 out of 10 fights going to a decision. I wanted to look at that number in the context of UFC history.

library(ggplot2)
library(dplyr)
library(tidyr)
fighters = read.csv("C://Users//Jason//Downloads//UFC-Fight-Card-Analysis-master//UFC-Fight-Card-Analysis-master//ALL UFC FIGHTERS 2%2F23%2F2016 SHERDOG.COM - Sheet1.csv", header=T)
fights = read.csv("C://Users//Jason//Downloads//UFC-Fight-Card-Analysis-master//UFC-Fight-Card-Analysis-master////ALL UFC FIGHTS 2%2F23%2F2016 SHERDOG.COM - Sheet1.csv", header=T)

Some quick cleanup to remove any URLs and change the date from a character to date object.

fights= fights[,-c(1,8,9)]
fighters = fighters[, -1]

fights$event_date = as.Date(fights$event_date, "%m/%d/%Y")

The fights lists several different methods of victory, many of which are abnormalities (no contests, DQ, etc.). We need to separate only the fights that ended using the 4 main methods.

#Main methods of victory
methods=c("Submission", "TKO", "KO", "Decision")
fights_method_subset = fights %>% filter(method %in% methods)

An issue with the early UFCs were that they were constantly changing rules to try to appease different stakeholders. It wasn’t until Zuffa bought the company that they really moved towards a consistent, (mostly) unified rules system. Let’s subset the data starting from UFC 30 (2/23/2001), which was the first event of the Zuffa ownership.

fights_method_subset = fights_method_subset %>% filter(event_date >= "2001-02-23")

We will attempt to solve the question of how likely a card with 90% decisions was to occur by using hacker statistics. To do this, we will simulate a UFC card thousands of times to create a probability distribution of outcomes.

Start by creating vectors to store the number of results of each individual method of victory on each simulated card.

#initialize vectors for simulation
dec_sim=c()
sub_sim=c()
ko_sim =c()
tko_sim = c()

Setup the simulation parameters, including the number of iterations, the seed, and the number of fights on each simulated card (UFC 208 had 10 fights on it, so we’ll use 10).

(Note: My computer seems to churn about 1000 iterations or so per minute. If you’re impatient, then maybe consider reducing 20,000 iterations down to a couple thousand. The more trials, the more central the results though.)

iterations=20000
set.seed(1)
# Set number of fights on the card
n_fights = 10

Create a monte carlo simulation, which randomly selects 10 fights in the data, and then appends the number of different methods of finishes to the vectors we created.

for (i in 1:iterations){
  
  # create sample n-fight card  
  sample = sample(fights_method_subset$method, n_fights, replace=T)
  
  # assign no. of finishes of each type to a vector  
  for(method in methods){  
    
    #Loop through the 4 methods, get the sum of the number of finishes on the card, and assign it to the appropriate vector.
    x = sum(sample == method)
    
    if(method == methods[2]){
      tko_sim[i] = x
    } else if(method == methods[1]){
      sub_sim[i] = x
    } else if(method == methods[3]){
      ko_sim[i] = x
    } else{
      dec_sim[i] = x
    }
    
  }
}

Move all of the simulation results into a data frame, sim_df. Then, run the summary stats. The average 10-fight card has about 4.5 decisions and 5.5 finishes.

sim_df = data.frame(dec = dec_sim, tko = tko_sim, ko = ko_sim, sub = sub_sim)
summary(sim_df)
      dec              tko              ko             sub       
 Min.   : 0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
 1st Qu.: 3.000   1st Qu.:1.000   1st Qu.:0.000   1st Qu.:1.000  
 Median : 4.000   Median :2.000   Median :1.000   Median :2.000  
 Mean   : 4.492   Mean   :2.212   Mean   :1.225   Mean   :2.071  
 3rd Qu.: 6.000   3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:3.000  
 Max.   :10.000   Max.   :8.000   Max.   :7.000   Max.   :8.000  

The 4 finishes are really just one variable, but they are dispersed among 4 columns. So this is a wide-data frame that needs to be converted to long for graphical purposes, which can be done easily using Tidyr’s gather function.

sim_wide = gather(sim_df, method, n)
head(sim_wide)

View the frequency of each type of finish in the simulation. Looks like the average show typically clusters around 3-6 decisions, 0-2 knockouts, and 1-3 submissions and TKOs.

ggplot(sim_wide, aes(x=n, fill=as.factor(n)) ) + 
  geom_bar(aes(y = (..count..)), show.legend = F) +
  scale_x_continuous(breaks=0:10) + 
  facet_wrap(~method, ncol = 1)

Now let’s look at it as a cross-table.

sim_table = table(sim_wide$method, sim_wide$n)
sim_table
     
         0    1    2    3    4    5    6    7    8    9   10
  dec   54  391 1559 3275 4833 4738 3183 1454  423   79   11
  ko  5466 7492 4743 1781  427   78    9    4    0    0    0
  sub 2027 5039 6070 4148 1949  619  129   17    2    0    0
  tko 1634 4730 5903 4467 2282  766  183   34    1    0    0

So the probability of a 10-fight UFC card having 9 decisions is approximately 79/20000 (and 90/20000 if you include the 11 times all 10 fights went to decision). To put this in decimal form comes out to…

#decimal probability
sum(sim_table["dec", c("9", "10")]) / iterations
[1] 0.0045
#One out of every x events
iterations / sum(sim_table["dec", c("9", "10")]) 
[1] 222.2222

So UFC 208 probabilisticly was a one-in-222 show event. For reference, UFC 208 was the 359th show in the Zuffa era of UFC, which means that statistically speaking it should have only occurred approximately 1.6 times during the entire 15-year sample. To visualize that outlier…

ggplot(subset(sim_wide, method=="dec"), aes(x=n, y= ..density..)) + 
  geom_freqpoly(binwidth=1, lwd=2) + 
  scale_x_continuous(breaks=0:10) + 
  geom_vline(xintercept=9, col="red", lwd=1.5) + 
  geom_text(aes(x=9, y=.2, label= paste("UFC 208 \n", 
                                        sum(sim_table["dec", c("9", "10")])/iterations )), 
            hjust=-.25,  color="red", size=6)+
  ggtitle(paste("Probability of x Decisions on a", n_fights,  "Fight UFC Card")) +
  theme(plot.title = element_text(size = 16, face = "bold")) +
  xlab(paste("Decisions (out of ", n_fights, ")", sep=""))

Those results were in the simulated world we created. How does 9/10 decisions fare in the actual course of UFC history? We need to get the number of finishes on each actual show to find out.

#Get a list of all event names sampled
event_names = unique(fights_method_subset$event_name)
#Create a new DF that simply has each show in UFC history, and then a count of all the different finishes on that show (we'll populate those counts in the next loop)
events = data.frame(event = event_names, submissions = 0, decisions = 0, ko = 0, tko = 0, other = 0)
head(events)

Loop through the original fights DF (the subset we created got rid of all the “other” finishes, which we need to count again). In the loop, we’ll keep a running count of the number of different finishes on each show.

(the loop only takes about 30 seconds)

#loop through each event in the sample to count number of finishes per card
for(ev in event_names){
  
  #Since we threw out all the "other" finishes in the initial subset, we'll use the original fights DF to count the number of different results
  for(row in 1:nrow(fights)){
    
    #look for the proper row that corresponds to the looping event
    if(fights$event_name[row] == ev){
      
      #if the fight ended by submission, then add 1 to the new events DF we created under the subm. column
      if(fights$method[row] == "Submission"){
        events[events[,1] == ev, 2] = events[events[,1] == ev, 2] +1
        
      } else if(fights$method[row] == "Decision"){
        events[events[,1] == ev, 3] = events[events[,1] == ev, 3] +1
        
      } else if(fights$method[row] == "KO"){
        events[events[,1] == ev, 4] = events[events[,1] == ev, 4] +1
        
      } else if(fights$method[row] == "TKO"){
        events[events[,1] == ev, 5] = events[events[,1] == ev, 5] +1
        
      } else{
        events[events[,1] == ev, 6] = events[events[,1] == ev, 6] +1
      }
    } 
    
  }
}
head(events)

Now we can calculate the percentage of each finish on each card. First, we’ll count the number of fights on each card, and then use that to get the finish percentages.

#Fetch the sum of each row (removes only the first column, the event name)
events$n_fights = rowSums(events[-1])
#Calculate the percentages of finishes on each card
events = events %>%
  mutate(sub_pct = submissions/n_fights,
         dec_pct = decisions/n_fights,
         ko_pct = ko/n_fights,
         tko_pct = tko/n_fights
  ) %>%
  arrange(desc(dec_pct))

Plot the top 10 cards in UFC history with the highest percentages of decisions. Note: R makes you refactor a categorical variable in order to plot on it in a different order other than the default, so I created a function that does that.

#Function to refactor a categorical variable by the order of a numeric variable
refactor = function(df, col_factor, col_sort){
  df[, col_factor] = factor(df[, col_factor], 
                            levels = df[,col_factor][order(-df[, col_sort])])   
  
  return(df)
}
#refactor the event list by highest decision percentage
events = refactor(events, "event", "dec_pct") %>% arrange(desc(dec_pct))
ggplot(events[1:10,], aes(x=event, y=dec_pct)) + 
  geom_col() + 
  theme(axis.text.x = element_text(face="italic", color="#993333", 
                                   size=8, angle=90)) +
  coord_flip(expand = T) +
  ggtitle("UFC Shows with Highest Percentage of Decisions") +
  scale_y_continuous(breaks = seq(0,.9,.1))

Only 3 of the 346 other shows analyzed had ever cleared 80% decisions, with the 83.3% of UFC 169 and UFC FN 36 being the highest. That really puts the 90% decisions of UFC 208 into perspective!

Here is the list of all shows with at least 70% decisions. Only 16 shows have even done that (which is only 4.6% of all shows)!

Now that we have devoted all this time to analyzing the anomaly that was UFC 208, one of the most boring UFC cards of all time, let’s finish on a high note by creating a watch list for UFC Fight Pass of UFC shows with the highest knockout percentage!

#refactor the event list by highest decision percentage
events = refactor(events, "event", "ko_pct") %>% arrange(desc(ko_pct))
ggplot(events[1:10,], aes(x=event, y=ko_pct)) + 
  geom_col() + 
  theme(axis.text.x = element_text(face="italic", color="#993333", 
                           size=8, angle=90)) +
  coord_flip(expand = T) +
  ggtitle("UFC Shows with Highest Percentage of Knockouts") +
  scale_y_continuous(breaks = seq(0,.5,.05))

NA
---
title: "UFC Fight Card Probability Analysis"
output:
  html_notebook: default
  html_document: default
---

```{r global_options, include=FALSE, echo=TRUE}
```

UFC 208 took place in February 2017, with 9 out of 10 fights going to a decision. I wanted to look at that number in the context of UFC history.

```{r, echo=TRUE}
library(ggplot2)
library(dplyr)
library(tidyr)

fighters = read.csv("C://Users//Jason//Downloads//UFC-Fight-Card-Analysis-master//UFC-Fight-Card-Analysis-master//ALL UFC FIGHTERS 2%2F23%2F2016 SHERDOG.COM - Sheet1.csv", header=T)
fights = read.csv("C://Users//Jason//Downloads//UFC-Fight-Card-Analysis-master//UFC-Fight-Card-Analysis-master////ALL UFC FIGHTS 2%2F23%2F2016 SHERDOG.COM - Sheet1.csv", header=T)

```

Some quick cleanup to remove any URLs and change the date from a character to date object.
```{r}
fights= fights[,-c(1,8,9)]
fighters = fighters[, -1]

fights$event_date = as.Date(fights$event_date, "%m/%d/%Y")
```

The fights lists several different methods of victory, many of which are abnormalities (no contests, DQ, etc.). We need to separate only the fights that ended using the 4 main methods.
```{r}
methods=c("Submission", "TKO", "KO", "Decision")

fights_method_subset = fights %>% filter(method %in% methods)
```

An issue with the early UFCs were that they were constantly changing rules to try to appease different stakeholders. It wasn't until Zuffa bought the company that they really moved towards a consistent, (mostly) unified rules system. Let's subset the data starting from UFC 30 (2/23/2001), which was the first event of the Zuffa ownership.
```{r}
fights_method_subset = fights_method_subset %>% filter(event_date >= "2001-02-23")
```

We will attempt to solve the question of how likely a card with 90% decisions was to occur by using hacker statistics. To do this, we will simulate a UFC card thousands of times to create a probability distribution of outcomes.

Start by creating vectors to store the number of results of each individual method of victory on each simulated card.
```{r, echo=TRUE}
dec_sim=c()
sub_sim=c()
ko_sim =c()
tko_sim = c()
```

Setup the simulation parameters, including the number of iterations, the seed, and the number of fights on each simulated card (UFC 208 had 10 fights on it, so we'll use 10).

(Note: My computer seems to churn about 1000 iterations or so per minute. If you're impatient, then maybe consider reducing 20,000 iterations down to a couple thousand. The more trials, the more central the results though.)
```{r}
iterations=20000
set.seed(1)
n_fights = 10
```

Create a monte carlo simulation, which randomly selects 10 fights in the data, and then appends the number of different methods of finishes to the vectors we created.
```{r}
for (i in 1:iterations){
  
  # create sample n-fight card  
  sample = sample(fights_method_subset$method, n_fights, replace=T)
  
  # assign no. of finishes of each type to a vector  
  for(method in methods){  
    
    #Loop through the 4 methods, get the sum of the number of finishes on the card, and assign it to the appropriate vector.
    x = sum(sample == method)
    
    if(method == methods[2]){
      tko_sim[i] = x
    } else if(method == methods[1]){
      sub_sim[i] = x
    } else if(method == methods[3]){
      ko_sim[i] = x
    } else{
      dec_sim[i] = x
    }
    
  }
}
```

Move all of the simulation results into a data frame, sim_df. Then, run the summary stats. The average 10-fight card has about 4.5 decisions and 5.5 finishes.
```{r}
sim_df = data.frame(dec = dec_sim, tko = tko_sim, ko = ko_sim, sub = sub_sim)

summary(sim_df)
```

The 4 finishes are really just one variable, but they are dispersed among 4 columns. So this is a wide-data frame that needs to be converted to long for graphical purposes, which can be done easily using Tidyr's gather function.
```{r}
sim_wide = gather(sim_df, method, n)
head(sim_wide)
```

View the frequency of each type of finish in the simulation. Looks like the average show typically clusters around 3-6 decisions, 0-2 knockouts, and 1-3 submissions and TKOs.
```{r}
ggplot(sim_wide, aes(x=n, fill=as.factor(n)) ) + 
  geom_bar(aes(y = (..count..)), show.legend = F) +
  scale_x_continuous(breaks=0:10) + 
  facet_wrap(~method, ncol = 1)

```

Now let's look at it as a cross-table.
```{r}
sim_table = table(sim_wide$method, sim_wide$n)
sim_table
```

So the probability of a 10-fight UFC card having 9 decisions is approximately 79/20000 (and 90/20000 if you include the 11 times all 10 fights went to decision). To put this in decimal form comes out to...
```{r}
#decimal probability
sum(sim_table["dec", c("9", "10")]) / iterations

#One out of every x events
iterations / sum(sim_table["dec", c("9", "10")]) 

```

So UFC 208 probabilisticly was a one-in-222 show event. For reference, UFC 208 was the 359th show in the Zuffa era of UFC, which means that statistically speaking it should have only occurred approximately 1.6 times during the entire 15-year sample.  To visualize that outlier...

```{r}
ggplot(subset(sim_wide, method=="dec"), aes(x=n, y= ..density..)) + 
  geom_freqpoly(binwidth=1, lwd=2) + 
  scale_x_continuous(breaks=0:10) + 
  geom_vline(xintercept=9, col="red", lwd=1.5) + 
  geom_text(aes(x=9, y=.2, label= paste("UFC 208 \n", 
                                        sum(sim_table["dec", c("9", "10")])/iterations )), 
            hjust=-.25,  color="red", size=6)+
  ggtitle(paste("Probability of x Decisions on a", n_fights,  "Fight UFC Card")) +
  theme(plot.title = element_text(size = 16, face = "bold")) +
  xlab(paste("Decisions (out of ", n_fights, ")", sep=""))

```

Those results were in the simulated world we created. How does 9/10 decisions fare in the actual course of UFC history? We need to get the number of finishes on each actual show to find out.

```{r}
#Get a list of all event names sampled
event_names = unique(fights_method_subset$event_name)

#Create a new DF that simply has each show in UFC history, and then a count of all the different finishes on that show (we'll populate those counts in the next loop)
events = data.frame(event = event_names, submissions = 0, decisions = 0, ko = 0, tko = 0, other = 0)
head(events)
```

Loop through the original fights DF (the subset we created got rid of all the "other" finishes, which we need to count again). In the loop, we'll keep a running count of the number of different finishes on each show.

(the loop only takes about 30 seconds)

```{r}
#loop through each event in the sample to count number of finishes per card
for(ev in event_names){
  
  #Since we threw out all the "other" finishes in the initial subset, we'll use the original fights DF to count the number of different results
  for(row in 1:nrow(fights)){
    
    #look for the proper row that corresponds to the looping event
    if(fights$event_name[row] == ev){
      
      #if the fight ended by submission, then add 1 to the new events DF we created under the subm. column
      if(fights$method[row] == "Submission"){
        events[events[,1] == ev, 2] = events[events[,1] == ev, 2] +1
        
      } else if(fights$method[row] == "Decision"){
        events[events[,1] == ev, 3] = events[events[,1] == ev, 3] +1
        
      } else if(fights$method[row] == "KO"){
        events[events[,1] == ev, 4] = events[events[,1] == ev, 4] +1
        
      } else if(fights$method[row] == "TKO"){
        events[events[,1] == ev, 5] = events[events[,1] == ev, 5] +1
        
      } else{
        events[events[,1] == ev, 6] = events[events[,1] == ev, 6] +1
      }
    } 
    
  }
}

head(events)
```

Now we can calculate the percentage of each finish on each card. First, we'll count the number of fights on each card, and then use that to get the finish percentages.

```{r}
#Fetch the sum of each row (removes only the first column, the event name)
events$n_fights = rowSums(events[-1])


#Calculate the percentages of finishes on each card
events = events %>%
  mutate(sub_pct = submissions/n_fights,
         dec_pct = decisions/n_fights,
         ko_pct = ko/n_fights,
         tko_pct = tko/n_fights
  ) %>%
  arrange(desc(dec_pct))

```

Plot the top 10 cards in UFC history with the highest percentages of decisions. Note: R makes you refactor a categorical variable in order to plot on it in a different order other than the default, so I created a function that does that.

```{r}
#Function to refactor a categorical variable by the order of a numeric variable
refactor = function(df, col_factor, col_sort){
  df[, col_factor] = factor(df[, col_factor], 
                            levels = df[,col_factor][order(-df[, col_sort])])   
  
  return(df)
}

#refactor the event list by highest decision percentage
events = refactor(events, "event", "dec_pct") %>% arrange(desc(dec_pct))


ggplot(events[1:10,], aes(x=event, y=dec_pct)) + 
  geom_col() + 
  theme(axis.text.x = element_text(face="italic", color="#993333", 
                                   size=8, angle=90)) +
  coord_flip(expand = T) +
  ggtitle("UFC Shows with Highest Percentage of Decisions") +
  scale_y_continuous(breaks = seq(0,.9,.1))


```

Only 3 of the 346 other shows analyzed had ever cleared 80% decisions, with the 83.3% of UFC 169 and UFC FN 36 being the highest. That really puts the 90% decisions of UFC 208 into perspective!

Here is the list of all shows with at least 70% decisions. Only 16 shows have even done that (which is only 4.6% of all shows)!
```{r}
events %>% filter(dec_pct >= .7) %>%
  select("event", "dec_pct", "decisions", "n_fights") %>%
  arrange(desc(dec_pct), desc(decisions))
```

Now that we have devoted all this time to analyzing the anomaly that was UFC 208, one of the most boring UFC cards of all time, let's finish on a high note by creating a watch list for UFC Fight Pass of UFC shows with the highest knockout percentage!
```{r}
#refactor the event list by highest decision percentage
events = refactor(events, "event", "ko_pct") %>% arrange(desc(ko_pct))


ggplot(events[1:10,], aes(x=event, y=ko_pct)) + 
  geom_col() + 
  theme(axis.text.x = element_text(face="italic", color="#993333", 
                           size=8, angle=90)) +
  coord_flip(expand = T) +
  ggtitle("UFC Shows with Highest Percentage of Knockouts") +
  scale_y_continuous(breaks = seq(0,.5,.05))
  
```


