Import the necessary packages.
library(tidyverse)
library(caret) ### confusion matrix
library(gam) ### Generalized Additive Models.
library(lubridate) ##Format and filter date
library(e1071)
library(ggcorrplot)
library(car)
For this project, I will use ufc fight data to conduct an Exploratory Data Analysis and Inferential Statistics to observe the relationship and potneital prediction of the Red vs Blue Fighter in the UFC. In the UFC, the Red Corner Fighter is typically the higher ranking fighter in the UFC roster. They typically have more fights in the UFC and have a higher rank due to their success in the UFC. The Blue Corner Fighter is typically the challenger and often has fewer fights than the Red Corner Fighter and are considered the underdog.
Import the data.
ufc <- read.csv("data.csv")
This dataset (data.csv) comes from the Kaggle user RAJEEV WARRIER. Link here: https://www.kaggle.com/datasets/rajeevw/ufcdata
This is a data set of every UFC fight in the history of the organisation until March 2021. Every row contains information about both fighters, fight details and the winner. The data was scraped from ufc stats website.
Each row is a compilation of both fighter stats. Fighters are represented by ‘red’ and ‘blue’ (for red and blue corner). For instance, red fighter has the complied average stats of all their fights. The stats include damage done by the red fighter and blue fighter on their past opponents. The target variable is ‘Winner’ which is the only column that tells you who won in that fight. An example is included below, where we can see Max Holloways opponents and Average Submission Attempts (per fight) compiled over time.
ufc_max_holloway <- ufc %>% filter_all(any_vars(grepl("Max Holloway", .)))
max_holloway <- ufc_max_holloway[, c('R_fighter', 'B_fighter', 'R_avg_SUB_ATT','B_avg_SUB_ATT')]
max_holloway
## R_fighter B_fighter R_avg_SUB_ATT B_avg_SUB_ATT
## 1 Max Holloway Calvin Kattar 0.06716919 0.00000000
## 2 Alexander Volkanovski Max Holloway 0.18750000 0.13433838
## 3 Max Holloway Alexander Volkanovski 0.26867676 0.37500000
## 4 Max Holloway Frankie Edgar 0.53735352 0.13336754
## 5 Max Holloway Dustin Poirier 0.07470703 1.12734795
## 6 Max Holloway Brian Ortega 0.14941406 0.73437500
## 7 Max Holloway Jose Aldo 0.29882812 0.06445312
## 8 Jose Aldo Max Holloway 0.12890625 0.59765625
## 9 Max Holloway Anthony Pettis 1.19531250 0.54882812
## 10 Max Holloway Ricardo Lamas 0.39062500 0.22265625
## 11 Max Holloway Jeremy Stephens 0.78125000 0.01954269
## 12 Max Holloway Charles Oliveira 1.56250000 2.17553711
## 13 Cub Swanson Max Holloway 0.00781250 0.12500000
## 14 Max Holloway Cole Miller 0.25000000 0.83203125
## 15 Akira Corassani Max Holloway 0.00000000 0.50000000
## 16 Max Holloway Clay Collard 1.00000000 NA
## 17 Max Holloway Andre Fili 0.00000000 0.00000000
## 18 Max Holloway Will Chope 0.00000000 NA
## 19 Conor McGregor Max Holloway 0.00000000 0.00000000
## 20 Dennis Bermudez Max Holloway 1.62500000 0.00000000
## 21 Leonard Garcia Max Holloway 0.04687500 0.00000000
## 22 Justin Lawrence Max Holloway 0.00000000 0.00000000
## 23 Max Holloway Pat Schilling 0.00000000 0.00000000
## 24 Dustin Poirier Max Holloway 0.50000000 NA
Here we will begin our exploration and cleaning of the data and the relationship between Red vs Blue Fighter Wins. The EDA inlcudes Red vs Blue Wins over time, Red Fighters ability to control their oppenent on the ground measured in seconds, Wins per weight class, and Red Fighters average submission attempts with respect to weight class.
Weight Classes weight range (lb and kg measuements) for reference.
Heavyweight: 265 lb (120.2 kg) Light Heavyweight: 205 lb (102.1 kg) Middleweight: 185 lb (83.9 kg) Welterweight: 170 lb (77.1 kg) Lightweight: 155 lb (70.3 kg) Featherweight: 145 lb (65.8 kg) Bantamweight: 135 lb (61.2 kg) Flyweight: 125 lb (56.7 kg) Strawweight: 115 lb (52.5 kg)
Here we observe the names and data types of our columns.
str(ufc)
## 'data.frame': 6012 obs. of 144 variables:
## $ R_fighter : chr "Adrian Yanez" "Trevin Giles" "Tai Tuivasa" "Cheyanne Buys" ...
## $ B_fighter : chr "Gustavo Lopez" "Roman Dolidze" "Harry Hunsucker" "Montserrat Conejo" ...
## $ Referee : chr "Chris Tognoni" "Herb Dean" "Herb Dean" "Mark Smith" ...
## $ date : chr "2021-03-20" "2021-03-20" "2021-03-20" "2021-03-20" ...
## $ location : chr "Las Vegas, Nevada, USA" "Las Vegas, Nevada, USA" "Las Vegas, Nevada, USA" "Las Vegas, Nevada, USA" ...
## $ Winner : chr "Red" "Red" "Red" "Blue" ...
## $ title_bout : chr "False" "False" "False" "False" ...
## $ weight_class : chr "Bantamweight" "Middleweight" "Heavyweight" "WomenStrawweight" ...
## $ B_avg_KD : num 0 0.5 NA NA 0.125 ...
## $ B_avg_opp_KD : num 0 0 NA NA 0 0 0.125 0 NA NA ...
## $ B_avg_SIG_STR_pct : num 0.42 0.66 NA NA 0.536 ...
## $ B_avg_opp_SIG_STR_pct : num 0.495 0.305 NA NA 0.579 ...
## $ B_avg_TD_pct : num 0.33 0.3 NA NA 0.185 ...
## $ B_avg_opp_TD_pct : num 0.36 0.5 NA NA 0.166 ...
## $ B_avg_SUB_ATT : num 0.5 1.5 NA NA 0.125 ...
## $ B_avg_opp_SUB_ATT : num 1 0 NA NA 0.188 ...
## $ B_avg_REV : num 0 0 NA NA 0.25 ...
## $ B_avg_opp_REV : num 0 0 NA NA 0 ...
## $ B_avg_SIG_STR_att : num 50 65.5 NA NA 109.2 ...
## $ B_avg_SIG_STR_landed : num 20 35 NA NA 57.9 ...
## $ B_avg_opp_SIG_STR_att : num 84 50 NA NA 50.6 ...
## $ B_avg_opp_SIG_STR_landed : num 45 16.5 NA NA 28.4 ...
## $ B_avg_TOTAL_STR_att : num 76.5 113.5 NA NA 170.4 ...
## $ B_avg_TOTAL_STR_landed : num 41 68.5 NA NA 105.6 ...
## $ B_avg_opp_TOTAL_STR_att : num 114 68.5 NA NA 74.4 ...
## $ B_avg_opp_TOTAL_STR_landed : num 64 29 NA NA 44.2 ...
## $ B_avg_TD_att : num 1.5 2.5 NA NA 5.38 ...
## $ B_avg_TD_landed : num 1 1.5 NA NA 1.5 ...
## $ B_avg_opp_TD_att : num 9 0.5 NA NA 2 ...
## $ B_avg_opp_TD_landed : num 6.5 0.5 NA NA 0.625 ...
## $ B_avg_HEAD_att : num 39.5 46 NA NA 77.4 ...
## $ B_avg_HEAD_landed : num 11 20 NA NA 31.4 ...
## $ B_avg_opp_HEAD_att : num 63 36 NA NA 41.6 ...
## $ B_avg_opp_HEAD_landed : num 27.5 7.5 NA NA 22.6 ...
## $ B_avg_BODY_att : num 7.5 12 NA NA 31.2 ...
## $ B_avg_BODY_landed : num 7 8 NA NA 26.2 ...
## $ B_avg_opp_BODY_att : num 12 8 NA NA 7.69 ...
## $ B_avg_opp_BODY_landed : num 9 3 NA NA 4.94 ...
## $ B_avg_LEG_att : num 3 7.5 NA NA 0.625 ...
## $ B_avg_LEG_landed : num 2 7 NA NA 0.375 ...
## $ B_avg_opp_LEG_att : num 9 6 NA NA 1.38 ...
## $ B_avg_opp_LEG_landed : num 8.5 6 NA NA 0.875 ...
## $ B_avg_DISTANCE_att : num 35 58 NA NA 33.6 ...
## $ B_avg_DISTANCE_landed : num 12.5 30 NA NA 11 ...
## $ B_avg_opp_DISTANCE_att : num 43.5 48 NA NA 32.1 ...
## $ B_avg_opp_DISTANCE_landed : num 17.5 15.5 NA NA 13.9 ...
## $ B_avg_CLINCH_att : num 10.5 0.5 NA NA 39.1 ...
## $ B_avg_CLINCH_landed : num 4.5 0.5 NA NA 28.8 ...
## $ B_avg_opp_CLINCH_att : num 4 0.5 NA NA 13.3 ...
## $ B_avg_opp_CLINCH_landed : num 3 0.5 NA NA 10.8 ...
## $ B_avg_GROUND_att : num 4.5 7 NA NA 36.6 ...
## $ B_avg_GROUND_landed : num 3 4.5 NA NA 18.1 ...
## $ B_avg_opp_GROUND_att : num 36.5 1.5 NA NA 5.25 ...
## $ B_avg_opp_GROUND_landed : num 24.5 0.5 NA NA 3.75 ...
## $ B_avg_CTRL_time.seconds. : num 34 220 NA NA 390 ...
## $ B_avg_opp_CTRL_time.seconds.: num 277.5 24.5 NA NA 156.3 ...
## $ B_total_time_fought.seconds.: num 532 578 NA NA 764 ...
## $ B_total_rounds_fought : int 4 4 0 0 11 10 28 23 0 0 ...
## $ B_total_title_bouts : int 0 0 0 0 1 0 0 0 0 0 ...
## $ B_current_win_streak : int 0 2 0 0 3 4 0 0 0 0 ...
## $ B_current_lose_streak : int 1 0 0 0 0 0 1 1 0 0 ...
## $ B_longest_win_streak : int 1 2 0 0 3 4 1 5 0 0 ...
## $ B_wins : int 1 2 0 0 4 4 4 8 0 0 ...
## $ B_losses : int 1 0 0 0 1 0 6 2 0 0 ...
## $ B_draw : int 0 0 0 0 0 0 0 0 0 0 ...
## $ B_win_by_Decision_Majority : int 0 0 0 0 0 0 1 0 0 0 ...
## $ B_win_by_Decision_Split : int 0 1 0 0 0 0 0 2 0 0 ...
## $ B_win_by_Decision_Unanimous : int 0 0 0 0 1 2 1 1 0 0 ...
## $ B_win_by_KO.TKO : int 0 1 0 0 2 0 2 3 0 0 ...
## $ B_win_by_Submission : int 1 0 0 0 1 2 0 2 0 0 ...
## $ B_win_by_TKO_Doctor_Stoppage: int 0 0 0 0 0 0 0 0 0 0 ...
## $ B_Stance : chr "Orthodox" "Orthodox" "Orthodox" "Southpaw" ...
## $ B_Height_cms : num 165 188 188 152 180 ...
## $ B_Reach_cms : num 170 193 190 155 183 ...
## $ B_Weight_lbs : num 135 205 241 115 135 145 170 185 135 125 ...
## $ R_avg_KD : num 1 1.031 0.547 NA 0 ...
## $ R_avg_opp_KD : num 0 0.0625 0.1875 NA 0.000977 ...
## $ R_avg_SIG_STR_pct : num 0.5 0.577 0.539 NA 0.403 ...
## $ R_avg_opp_SIG_STR_pct : num 0.46 0.381 0.599 NA 0.555 ...
## $ R_avg_TD_pct : num 0 0.406 0 NA 0.512 ...
## $ R_avg_opp_TD_pct : num 0 0.116 0.312 NA 0.629 ...
## $ R_avg_SUB_ATT : num 0 0.25 0 NA 0.231 ...
## $ R_avg_opp_SUB_ATT : num 0 1.1875 0.25 NA 0.0312 ...
## $ R_avg_REV : num 0 0.375 0 NA 0.0312 ...
## $ R_avg_opp_REV : num 0 0.25 0 NA 0.5 0.5 0.25 0 0 0.5 ...
## $ R_avg_SIG_STR_att : num 34 77.6 59.2 NA 109.3 ...
## $ R_avg_SIG_STR_landed : num 17 43.2 30.4 NA 44.4 ...
## $ R_avg_opp_SIG_STR_att : num 13 69.2 43.8 NA 148.8 ...
## $ R_avg_opp_SIG_STR_landed : num 6 27.6 24.8 NA 84.6 ...
## $ R_avg_TOTAL_STR_att : num 35 93.1 70.5 NA 137.2 ...
## $ R_avg_TOTAL_STR_landed : num 18 57.2 41.4 NA 70.2 ...
## $ R_avg_opp_TOTAL_STR_att : num 16 98.3 50.2 NA 172.5 ...
## $ R_avg_opp_TOTAL_STR_landed : num 9 52.5 30.9 NA 106.7 ...
## $ R_avg_TD_att : num 0 1.2812 0.0312 NA 2.2617 ...
## $ R_avg_TD_landed : num 0 0.781 0 NA 1.262 ...
## $ R_avg_opp_TD_att : num 3 4.69 2.84 NA 3.14 ...
## $ R_avg_opp_TD_landed : num 0 0.438 1.75 NA 1.771 ...
## $ R_avg_HEAD_att : num 32 71.1 42.5 NA 86.4 ...
## $ R_avg_HEAD_landed : num 15 38.1 16.8 NA 26 ...
## [list output truncated]
The code below will exclude Null values
ufc = na.omit(ufc)
Use substring to extract the year from date and store the values as ‘year’.
ufc$year <- substr(ufc$date, 1, 4)
Eliminate all fights that end in a ‘Draw’.
ufc <- filter(ufc, Winner != 'Draw')
Convert the data type of ‘date’ from Character to Date.
class(ufc$date)
## [1] "character"
ufc <- ufc %>% mutate(date = as.Date(date, format = "%Y-%m-%d"))
class(ufc$date)
## [1] "Date"
Here is where we create our dependent variable ‘win_dummy’. We will set Red = 1 and Blue = 0. This will be used in our statistical models. This will also show us the proportion of Red vs Blue wins throughout the dataset. Below are a few insights into the distribution of win_dummy.
ufc$win_dummy = ifelse(ufc$Winner == 'Red', 1,0)
Below we observe the gap between Red vs Blue wins since 2010, expressed as percentage of wins for Red and Blue Corners, compiled by year totals.
ufc_2010 <- ufc %>% filter(date >= "2010-01-01")
g <- ufc_2010 %>%
group_by(win_dummy, year) %>%
summarise(cnt = n()) %>%
group_by(year) %>%
mutate(freq = round(cnt / sum(cnt), 3))
## `summarise()` regrouping output by 'win_dummy' (override with `.groups` argument)
g <- as.data.frame(g)
g$win_dummy <- as.character(g$win_dummy)
g$freq <- as.numeric(g$freq)
g$year <- as.numeric(as.vector(g$year))
gg1 <- ggplot(data=g, aes(x=year, y=freq, group=win_dummy, fill=win_dummy)) +
geom_bar(stat="identity",
width = 0.5,
position=position_dodge())+
scale_fill_manual(values=c(
"darkblue",
"red")) +
theme(axis.line = element_line(linetype = "solid"),
panel.grid.minor = element_line(linetype = "blank"),
axis.text = element_text(colour = "gray18"),
panel.background = element_rect(fill = "gray64"),
plot.background = element_rect(fill = "aliceblue"))+labs(title = "Red vs Blue Win Percentage (Grouped by Year)",
x = "Year", y = "Percentage", fill = "Red/Blue")
gg1 + scale_x_continuous(breaks = c(2010, 2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021))
We see from the graph above that the Red fighter has won consitently more than the Blue fighter, for win percentage per year. We also observe that from 2017-2021 that the Blue fighters win percentge has increased and is closer to the Red fighter win percentage.
From our graph below, we see that Red Fighters Average Control Time on the ground (in seconds) has decreased steadily, even after 2010. We also see that Control Time was not recorded until 2000, but the original dataset includes fights back to 1997 and UFC fights span back to 1993. This suggest that all fighters ground defense and wrestling have steadily improved over the years. We will use 2016-2021 data from here on.
R_avg_CTRL_time.seconds_history <- ufc %>%
group_by(year) %>%
summarise_at(vars(R_avg_CTRL_time.seconds.), list(name = mean))
R_avg_CTRL_time.seconds_history <- as.data.frame(R_avg_CTRL_time.seconds_history)
R_avg_CTRL_time.seconds_history$year <- as.numeric(R_avg_CTRL_time.seconds_history$year)
r_ctrl <- ggplot(R_avg_CTRL_time.seconds_history, aes(year,name, group=year)) +
geom_bar(stat='identity', fill='red')+labs(y = "Red Fighter Average Control Time on Ground (seconds)")
r_ctrl + scale_x_continuous(breaks = c(1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021)) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
Now we will filter out our dataset
ufc_2 <- ufc %>% filter(date >= "2016-01-01")
Between 2016-2021, 0=Red, 1=Blue.
table(ufc_2$win_dummy)
##
## 0 1
## 820 1068
gg2 <- ggplot(data=ufc_2, aes(weight_class)) + geom_bar(fill = "purple")+labs(x = "Weight Class", y = "Number of Fights (2016-2021)")
gg2 + coord_flip()
R_avg_SUB_ATT_wc <- ufc_2 %>%
group_by(weight_class) %>%
summarise_at(vars(R_avg_SUB_ATT), list(name = mean))
R_avg_SUB_ATT_wc <- as.data.frame(R_avg_SUB_ATT_wc)
ggplot(data=R_avg_SUB_ATT_wc, aes(x=weight_class,y=name)) + geom_bar(stat='identity', fill = "purple") + coord_flip()+labs(y = "Red Average Submissions Attempted (per fight)")+labs(title = "2016-2021", x = "Weight Class")
From the two graphs above we see which weight classes have more fights. We also see that the number of submission attempts decreases as the Weight of the fighter increases. Heavyweight fighters attempet the least amount of submisisons
We conclude that the competition between the Red and Blue fighter has increased over time and the UFC is now more competitive over time. We also observe that not each weight class is the same and that some fighter may fight differnently depending on the wieght class they are in
We will now use ANOVA test to determine if Age and Submissions are strongly correlated to the Wins for the Red Fighter. We will only use data where Red Fighter won
Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures (such as the “variation” among and between groups) used to analyze the differences among means. ANOVA was developed by the statistician Ronald Fisher. ANOVA is based on the law of total variance, where the observed variance in a particular variable is partitioned into components attributable to different sources of variation. In its simplest form, ANOVA provides a statistical test of whether two or more population means are equal, and therefore generalizes the t-test beyond two means. In other words, the ANOVA is used to test the difference between two or more means.
For more on ANOVA link here: https://statsandr.com/blog/anova-in-r/
In this case, if the Age variable below gives us a p-value below 0.05, we will determine that the Age variable is a useful predictor in determining the winner of a fight and that there is correlation.
NOTE: The smaller the p-value (Pr(>F)), the stronger the correlation
This ANOVA test gives a p-value of 6.68e-10, which is less than the alpha level of 0.05, so we conclude the the Age of the Red Fighter is strongly correlated to the success of Red winning.
aov1 = aov(ufc_2$win_dummy ~ ufc_2$R_age)
summary(aov1)
## Df Sum Sq Mean Sq F value Pr(>F)
## ufc_2$R_age 1 9.3 9.282 38.51 6.68e-10 ***
## Residuals 1886 454.6 0.241
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The same can be observed when testing the Blue fighters age, p-value of 5.74e-09, which is also less than alpha of 0.05.
aov2 = aov(ufc_2$win_dummy ~ ufc_2$B_age)
summary(aov2)
## Df Sum Sq Mean Sq F value Pr(>F)
## ufc_2$B_age 1 8.3 8.271 34.24 5.74e-09 ***
## Residuals 1886 455.6 0.242
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
And now we can observe the distribution of Red Fighter Wins and Red Fighters Age in the histogram graph below. We see that there is bell curve distribution and a potential linear relationship between wins and age
ufc_r_age <- ufc_2 %>% filter(ufc_2$win_dummy != 0)
ggplot(ufc_r_age, aes(x=R_age))+
geom_histogram(color="black", fill="red")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(ufc_r_age, aes(x=B_age))+
geom_histogram(color="black", fill="blue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can run the same process below for Red Fighters Average Submissions when Red Fighter won
aov3 = aov(ufc_2$win_dummy ~ ufc_2$R_avg_SUB_ATT)
summary(aov3)
## Df Sum Sq Mean Sq F value Pr(>F)
## ufc_2$R_avg_SUB_ATT 1 1.3 1.2962 5.285 0.0216 *
## Residuals 1886 462.6 0.2453
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ggplot(ufc_r_age, aes(x=R_avg_SUB_ATT))+
geom_histogram(color="black", fill="purple")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.