Required packages [R code]

library(dplyr)
library(tidyr)
library(readr)
library(outliers)
library(lubridate)
library(forecast)
library(magrittr)
library(deducorrect)
library(outliers)

Executive Summary

In Data Preprocessing ,firstly two datasets fighters and fights consisting of data about fighters and matches played by each fighter ,having 3569 instances , occurring since 1993 -2016 were loaded using readcsv function. On joining the two datasets, next we filtered the required variables and observations by subsetting them. The subset data was then transformed into appropriate dataformats and factors.Missing values and outliers were treated.And Scalling was performed on numeric data.

Data

https://github.com/jslucf/UFC-Fight-Card-Analysis/find/master The datasets used comprises of UFC shows data from 1993 to 2016.This dataset is scraped from Sherdog.com.Datasets used are Fighters.csv anf fights.csv.Fighters dataset consist of 1561 observations with 7 variables and Fights dataset consist of 3569 obs with 15 variables. Total Variables used after subsetting: Eid: event id number Event_name: name of the event Event_date: date of event occurrence Fid: Fighter Id Method: Technique used by fighter (Decision, TKO, KO, NC(14 levels)) Round: No. of rounds Time: duration of Match Name: Name of the fighter Birth_date: Date of birth of fighter Height: Height of fighter Weight: Weight of fighter in pounds Class: type of fighter BMI: BMI of each fighter Age: current age of fighter Name: Fighter name Fighter: Sequence of fighter(F1/F2) Result: Outcome of match: Win/Lose/Draw/NC

Importing datasets

fighters <- read.csv("C:/Users/adity/Desktop/ALL UFC FIGHTERS Sheet1.csv")
fights <- read.csv("C:/Users/adity/Desktop/ALL UFC FIGHTS Sheet2.csv")

head(fighters,5)
##                                 url   fid               name
## 1     /fighter/Conor-McGregor-29688 29688     Conor McGregor
## 2          /fighter/Jon-Jones-27944 27944          Jon Jones
## 3         /fighter/Holly-Holm-75125 75125         Holly Holm
## 4      /fighter/Dominick-Cruz-12107 12107      Dominick Cruz
## 5 /fighter/Demetrious-Johnson-45452 45452 Demetrious Johnson
##                      nick birth_date height weight      association
## 1               Notorious  7/14/1988     68    145      SBG Ireland
## 2                   Bones  7/19/1987     76    205 Jackson-Wink MMA
## 3 The Preacher's Daughter 10/17/1981     68    135 Jackson-Wink MMA
## 4           The Dominator   9/3/1985     68    134     Alliance MMA
## 5            Mighty Mouse  8/13/1986     63    125   AMC Pankration
##               class                locality       country
## 1     Featherweight                  Dublin       Ireland
## 2 Light Heavyweight     Rochester, New York United States
## 3      Bantamweight Albuquerque, New Mexico United States
## 4      Bantamweight   San Diego, California United States
## 5         Flyweight    Kirkland, Washington United States
head(fights,5)
##                         pageurl eid mid            event_name
## 1 /events/UFC-1-The-Beginning-7   7   8 UFC 1 - The Beginning
## 2 /events/UFC-1-The-Beginning-7   7   7 UFC 1 - The Beginning
## 3 /events/UFC-1-The-Beginning-7   7   6 UFC 1 - The Beginning
## 4 /events/UFC-1-The-Beginning-7   7   5 UFC 1 - The Beginning
## 5 /events/UFC-1-The-Beginning-7   7   4 UFC 1 - The Beginning
##                        event_org event_date
## 1 Ultimate Fighting Championship 11/12/1993
## 2 Ultimate Fighting Championship 11/12/1993
## 3 Ultimate Fighting Championship 11/12/1993
## 4 Ultimate Fighting Championship 11/12/1993
## 5 Ultimate Fighting Championship 11/12/1993
##                                        event_place                  f1pageurl
## 1 McNichols Arena, Denver, Colorado, United States   /fighter/Royce-Gracie-19
## 2 McNichols Arena, Denver, Colorado, United States  /fighter/Jason-DeLucia-22
## 3 McNichols Arena, Denver, Colorado, United States   /fighter/Royce-Gracie-19
## 4 McNichols Arena, Denver, Colorado, United States /fighter/Gerard-Gordeau-15
## 5 McNichols Arena, Denver, Colorado, United States    /fighter/Ken-Shamrock-4
##                    f2pageurl         f1name         f2name f1result f2result
## 1 /fighter/Gerard-Gordeau-15   Royce Gracie Gerard Gordeau      win     loss
## 2  /fighter/Trent-Jenkins-23  Jason DeLucia  Trent Jenkins      win     loss
## 3    /fighter/Ken-Shamrock-4   Royce Gracie   Ken Shamrock      win     loss
## 4   /fighter/Kevin-Rosier-17 Gerard Gordeau   Kevin Rosier      win     loss
## 5  /fighter/Patrick-Smith-21   Ken Shamrock  Patrick Smith      win     loss
##   f1fid f2fid     method         method_d                  ref round time
## 1    19    15 Submission Rear-Naked Choke          Helio Vigio     1 1:44
## 2    22    23 Submission Rear-Naked Choke Joao Alberto Barreto     1 0:52
## 3    19     4 Submission Rear-Naked Choke          Helio Vigio     1 0:57
## 4    15    17        TKO  Corner Stoppage Joao Alberto Barreto     1 0:59
## 5     4    21 Submission        Heel Hook          Helio Vigio     1 1:49

Understand

The Functions used to summarize the dataset at required intervals using str(), dim() and displaying names().From summary we can see,columns event_date and birth_date isnt in date format.Columns like class, method , mutated column result and Fighter was also converted to factor.

#dimension of both datasets are checked
dim(fighters)
## [1] 1561   11
dim(fights)
## [1] 3569   20
str(fighters, list.len = 6)
## 'data.frame':    1561 obs. of  11 variables:
##  $ url        : chr  "/fighter/Conor-McGregor-29688" "/fighter/Jon-Jones-27944" "/fighter/Holly-Holm-75125" "/fighter/Dominick-Cruz-12107" ...
##  $ fid        : int  29688 27944 75125 12107 45452 73073 8390 2245 11506 38393 ...
##  $ name       : chr  "Conor McGregor" "Jon Jones" "Holly Holm" "Dominick Cruz" ...
##  $ nick       : chr  "Notorious" "Bones" "The Preacher's Daughter" "The Dominator" ...
##  $ birth_date : chr  "7/14/1988" "7/19/1987" "10/17/1981" "9/3/1985" ...
##  $ height     : int  68 76 68 68 63 66 76 71 67 66 ...
##   [list output truncated]
str(fights, list.len = 6)
## 'data.frame':    3569 obs. of  20 variables:
##  $ pageurl    : chr  "/events/UFC-1-The-Beginning-7" "/events/UFC-1-The-Beginning-7" "/events/UFC-1-The-Beginning-7" "/events/UFC-1-The-Beginning-7" ...
##  $ eid        : int  7 7 7 7 7 7 7 7 8 8 ...
##  $ mid        : int  8 7 6 5 4 3 2 1 15 14 ...
##  $ event_name : chr  "UFC 1 - The Beginning" "UFC 1 - The Beginning" "UFC 1 - The Beginning" "UFC 1 - The Beginning" ...
##  $ event_org  : chr  "Ultimate Fighting Championship" "Ultimate Fighting Championship" "Ultimate Fighting Championship" "Ultimate Fighting Championship" ...
##  $ event_date : chr  "11/12/1993" "11/12/1993" "11/12/1993" "11/12/1993" ...
##   [list output truncated]
#remove Unwanted columns
fights = fights[, -c(1,5,7,8,9)]
fighters = fighters[,-c(1,4,8,10)]

#convert fights table time to numeric , event date to date, method to factor
fights$method <- as.factor(fights$method)
fights$event_date<-as.Date(as.character(fights$event_date), format = "%m/%d/%Y")
fights$round<- as.factor(fights$round)

#duplicated dataframe
fighters_subset= fighters
#transformation in fighter table
fighters_subset$birth_date <- as.Date(as.character(fighters_subset$birth_date), format= "%m/%d/%Y")
#remove column country
fighters_subset <- fighters_subset[, -7]
fighters_subset$class <- as.factor(fighters_subset$class)
levels(fighters_subset$class)
##  [1] "Atomweight"        "Bantamweight"      "Featherweight"    
##  [4] "Flyweight"         "Heavyweight"       "Light Heavyweight"
##  [7] "Lightweight"       "Middleweight"      "N/A"              
## [10] "Strawweight"       "Super Heavyweight" "Welterweight"
str(fighters_subset, list.len = 6)
## 'data.frame':    1561 obs. of  6 variables:
##  $ fid       : int  29688 27944 75125 12107 45452 73073 8390 2245 11506 38393 ...
##  $ name      : chr  "Conor McGregor" "Jon Jones" "Holly Holm" "Dominick Cruz" ...
##  $ birth_date: Date, format: "1988-07-14" "1987-07-19" ...
##  $ height    : int  68 76 68 68 63 66 76 71 67 66 ...
##  $ weight    : int  145 205 135 134 125 135 242 170 145 145 ...
##  $ class     : Factor w/ 12 levels "Atomweight","Bantamweight",..: 3 6 2 2 4 2 5 12 3 3 ...

Tidy & Manipulate Data I

#create a column BMI using height and weight and new column Age using Birth_date column
new_var <- mutate(fighters_subset, BMI = (((fighters_subset$weight/fighters_subset$height)/fighters_subset$height)*708))
#View(new_var)
new_var1 <- mutate(new_var, Age = as.integer(2020- year(fighters_subset$birth_date)))
head(new_var1,5)
##     fid               name birth_date height weight             class      BMI
## 1 29688     Conor McGregor 1988-07-14     68    145     Featherweight 22.20156
## 2 27944          Jon Jones 1987-07-19     76    205 Light Heavyweight 25.12812
## 3 75125         Holly Holm 1981-10-17     68    135      Bantamweight 20.67042
## 4 12107      Dominick Cruz 1985-09-03     68    134      Bantamweight 20.51730
## 5 45452 Demetrious Johnson 1986-08-13     63    125         Flyweight 22.29781
##   Age
## 1  32
## 2  33
## 3  39
## 4  35
## 5  34
#rename column name for joining
names(fights)[9]<- paste("fid")

mergedfight <- fights %>% left_join(new_var1, by ="fid")
mergedfight<- mergedfight%>%select(eid,mid,event_name,event_date,f1name:fid,method,round:name,birth_date: Age) 
dim(mergedfight)
## [1] 3685   19
str(mergedfight, list.len = 6)
## 'data.frame':    3685 obs. of  19 variables:
##  $ eid       : int  7 7 7 7 7 7 7 7 7 7 ...
##  $ mid       : int  8 8 7 6 6 5 4 3 3 2 ...
##  $ event_name: chr  "UFC 1 - The Beginning" "UFC 1 - The Beginning" "UFC 1 - The Beginning" "UFC 1 - The Beginning" ...
##  $ event_date: Date, format: "1993-11-12" "1993-11-12" ...
##  $ f1name    : chr  "Royce Gracie" "Royce Gracie" "Jason DeLucia" "Royce Gracie" ...
##  $ f2name    : chr  "Gerard Gordeau" "Gerard Gordeau" "Trent Jenkins" "Ken Shamrock" ...
##   [list output truncated]

Tidy & Manipulate Data II

mergedfight<-mergedfight%>% select(eid,mid,event_name,event_date,f1name:fid,method,round:name,birth_date: Age)

#subsetted by date
subset_date<- mergedfight[mergedfight$event_date>"2013-01-01",]

#tidy dataset
#f1name and f2name have fighter no and name and f1result have fighter result
names(subset_date)[5]<- paste("f1")
names(subset_date)[6]<- paste("f2")
tablet1<- subset_date %>% gather('f1','f2', key = "Fighter", value = "Name")

names(tablet1)[5]<- paste("F1")
names(tablet1)[6]<- paste("F2")
tablet2<- tablet1 %>% gather('F1','F2', key = "Fighter", value = "Result")
tablet2$Fighter <- as.factor(tablet2$Fighter)
tablet2$Result <- as.factor(tablet2$Result)
#subsetting data having events recorded after mentioned date
tablet3<- tablet2[tablet2$event_date > "2015-01-01", ]
#match id was removed since its not required
tablet3 <- tablet3[,-2]
head(tablet3, 5)
##       eid                  event_name event_date   fid   method round time
## 915 38841 UFC 182 - Jones vs. Cormier 2015-01-03 27944 Decision     5 5:00
## 916 38841 UFC 182 - Jones vs. Cormier 2015-01-03 15105 Decision     3 5:00
## 917 38841 UFC 182 - Jones vs. Cormier 2015-01-03 33095 Decision     3 5:00
## 918 38841 UFC 182 - Jones vs. Cormier 2015-01-03 64413 Decision     3 5:00
## 919 38841 UFC 182 - Jones vs. Cormier 2015-01-03 11292       NC     3 5:00
##                name birth_date height weight             class      BMI Age
## 915       Jon Jones 1987-07-19     76    205 Light Heavyweight 25.12812  33
## 916  Donald Cerrone 1983-03-29     71    155       Lightweight 21.76949  37
## 917    Brad Tavares 1987-12-21     71    185      Middleweight 25.98294  33
## 918 Kyoji Horiguchi 1990-10-12     65    125         Flyweight 20.94675  30
## 919  Hector Lombard 1978-02-02     69    170      Welterweight 25.28040  42
##                Name Fighter Result
## 915       Jon Jones      F1    win
## 916  Donald Cerrone      F1    win
## 917    Brad Tavares      F1    win
## 918 Kyoji Horiguchi      F1    win
## 919  Hector Lombard      F1     NC

Fights Dataset consisting the events details is untidy because the observations F1name, F2name,F1result and F2result were populated in the header of columns.Thus using the Tidy dataset definition from Hadley Wickhams principles values were allocated to each cell and each observation was given to its row.

Scan I

#find missing values
colSums(is.na(tablet3))
##        eid event_name event_date        fid     method      round       time 
##          0          0          0          0          0          0          0 
##       name birth_date     height     weight      class        BMI        Age 
##         12         68         28         12         12         28         68 
##       Name    Fighter     Result 
##          0          0          0
#missing values present in various columns so treat them
tablet3$height<- as.numeric(tablet3$height)
tablet3$weight<- as.numeric(tablet3$weight)
tablet3$height[which(is.na(tablet3$height))] <- mean(tablet3$height, na.rm = TRUE)
tablet3$weight[which(is.na(tablet3$weight))] <- mean(tablet3$weight, na.rm = TRUE)
tablet3$BMI[which(is.na(tablet3$BMI))] <- mean(tablet3$BMI, na.rm = TRUE)
tablet3$Age[which(is.na(tablet3$Age))] <- mean(tablet3$Age, na.rm = TRUE)
colSums(is.na(tablet3))
##        eid event_name event_date        fid     method      round       time 
##          0          0          0          0          0          0          0 
##       name birth_date     height     weight      class        BMI        Age 
##         12         68          0          0         12          0          0 
##       Name    Fighter     Result 
##          0          0          0
#find special values
tablet3_duplicate <- tablet3
which(!is.finite(tablet3$Age))
## integer(0)
which(!is.finite(tablet3$fid))
## integer(0)
which(!is.finite(tablet3$BMI))
## integer(0)
#sapply(tablet3,is.special())
#sapply(tablet3,is.na())
levels(tablet3$Result)
## [1] "draw" "loss" "NC"   "win"
levels(tablet3$Fighter)
## [1] "F1" "F2"

Explanation: After joining two datasets, we created two variables BMI and AGE. But there are few missing values present in various columns. Thus using which() function locations were found and mean was calculated for the values having Nan values. After treating missing values, dataset was checked with special values using is.finite() and is.special().

Scan II

Explanation:In this section, outliers present in numeric values of dataset are identified using capping technique.

cap <- function (x){ 
  quantiles <- quantile( x, c( .05 , 0.25 , 0.75 , .95) )
  x[ x < quantiles[2] - 1.5 * IQR(x) ]<- quantiles[1]
  x[ x > quantiles[3] +1.5*IQR(x) ] <- quantiles[4] 
  x
}

tablet3$height <- tablet3$height %>% cap()
par(mfrow = c(1,2))
boxplot(tablet3_duplicate$height)
boxplot(tablet3$height)

tablet3$weight <- tablet3$weight %>% cap()
par(mfrow = c(1,2))
boxplot(tablet3_duplicate$weight)
boxplot(tablet3$weight)

tablet3$BMI <- tablet3$BMI%>% cap()
par(mfrow = c(1,2))
boxplot(tablet3_duplicate$BMI)
boxplot(tablet3$BMI)

tablet3$Age <- tablet3$Age %>% cap()
par(mfrow = c(1,2))
boxplot(tablet3_duplicate$Age)
boxplot(tablet3$Age)

#Transformation

hist(tablet3$BMI, breaks = 10, col = blues9)

hist(sqrt(tablet3$BMI), breaks= 10, col = blues9)

Explanation: From the transformation we did, we removed the skewedness from the data and normalized it. Similarly, we can do it other variables.