Overview


This vignette will briefly how to apply a Multiple linear regression model to a boxing database made available on Kaggle https://www.kaggle.com/slonsky/boxing-bouts/downloads/boxing-bouts.zip/1. The primary purpose of this regression analysis is to:

  1. Identify the strength of predictors (identify the effect that independent variables have on a dependent variable)
  2. Forecast an effect (Understand how the dependent variable changes with changes to independent variables)

Import the data & loading the neccesary libraries

Use the link above to download the database to a local folder. Make sure that the file is extracted and is in a native .csv format. Make sure the working directory is set to where the file is extracted.

library(tidyverse)
library(dbplyr)
library(ggplot2)

setwd('/Users/desmond/Documents/36103/AT1')
boxdata <- read_csv('bouts_out_new.csv')
glimpse(boxdata)
## Observations: 387,427
## Variables: 26
## $ age_A    <int> 35, 26, 28, 25, 25, 24, 23, 23, 36, 27, 22, 21, 21, 2...
## $ age_B    <int> 27, 31, 26, 29, 35, 31, 31, 31, 23, 22, 28, 40, 32, 3...
## $ height_A <int> 179, 175, 176, 175, 175, 175, 175, 175, 173, 177, 175...
## $ height_B <int> 175, 185, 175, 174, 170, 175, 175, 177, 175, 175, 177...
## $ reach_A  <int> 178, 179, NA, 179, 179, 179, 179, 179, 183, 183, 179,...
## $ reach_B  <int> 179, 185, 179, 180, 170, 178, 188, 175, 179, 179, 175...
## $ stance_A <chr> "orthodox", "orthodox", "orthodox", "orthodox", "orth...
## $ stance_B <chr> "orthodox", "orthodox", "orthodox", "orthodox", "orth...
## $ weight_A <int> 160, 164, 154, 155, 155, NA, 155, 155, 152, NA, 154, ...
## $ weight_B <int> 160, 164, 154, 155, NA, NA, 155, NA, NA, NA, 153, 154...
## $ won_A    <int> 37, 48, 23, 46, 45, 44, 43, 42, 44, 26, 40, 39, 38, 3...
## $ won_B    <int> 49, 50, 47, 31, 40, 32, 19, 22, 42, 41, 30, 46, 33, 2...
## $ lost_A   <int> 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ lost_B   <int> 1, 2, 1, 3, 4, 1, 1, 3, 0, 0, 4, 7, 4, 4, 4, 4, 11, 1...
## $ drawn_A  <int> 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ drawn_B  <int> 1, 1, 1, 0, 0, 0, 2, 0, 1, 1, 0, 1, 1, 2, 0, 2, 2, 6,...
## $ kos_A    <int> 33, 34, 13, 32, 32, 31, 31, 30, 26, 14, 29, 29, 28, 2...
## $ kos_B    <int> 34, 32, 33, 19, 33, 28, 12, 18, 30, 30, 18, 39, 28, 1...
## $ result   <chr> "draw", "win_A", "win_B", "win_A", "win_A", "win_A", ...
## $ decision <chr> "SD", "UD", "KO", "KO", "UD", "KO", "SD", "TKO", "MD"...
## $ judge1_A <int> 110, 120, NA, 47, 118, NA, 115, 89, 116, 112, NA, 119...
## $ judge1_B <int> 118, 108, NA, 48, 110, NA, 113, 82, 112, 115, NA, 109...
## $ judge2_A <int> 115, 120, NA, 49, 119, NA, 117, 88, 114, 109, NA, 119...
## $ judge2_B <int> 113, 108, NA, 46, 109, NA, 111, 83, 114, 118, NA, 109...
## $ judge3_A <int> 114, 120, NA, 48, 117, NA, 113, 89, 117, 111, NA, 118...
## $ judge3_B <int> 114, 108, NA, 47, 111, NA, 115, 82, 111, 116, NA, 110...

Prepare the data

The glimpse function allowed us to see a short form summary of our data set. When this data set was imported, it imported the stance, result and decisions as character strings. We need to change them to leveled factors so we can apply a classification model. Lets set up the data dictionary as per below:

Variable Type of Data Description
age_x INT Years
height_x INT Height (cm)
reach_x INT Reach from one hand to another
stance_x Factor Orthodox / Southpaw
weight_x INT Pounds (lbs)
won_x INT Number of past wins
lost_x INT Number of past losses
drawn_x INT Number of past draws
kos_x INT Number of past wins by Knockout
result Factor Bout result (win_A/win_B/draw)
decision Factor Types of judged decisions
boxdata$stance_A <- as.factor(boxdata$stance_A)
boxdata$stance_B <- as.factor(boxdata$stance_B)
boxdata$result <- as.factor(boxdata$result)
boxdata$decision <- as.factor(boxdata$decision)
glimpse(boxdata)
## Observations: 387,427
## Variables: 26
## $ age_A    <int> 35, 26, 28, 25, 25, 24, 23, 23, 36, 27, 22, 21, 21, 2...
## $ age_B    <int> 27, 31, 26, 29, 35, 31, 31, 31, 23, 22, 28, 40, 32, 3...
## $ height_A <int> 179, 175, 176, 175, 175, 175, 175, 175, 173, 177, 175...
## $ height_B <int> 175, 185, 175, 174, 170, 175, 175, 177, 175, 175, 177...
## $ reach_A  <int> 178, 179, NA, 179, 179, 179, 179, 179, 183, 183, 179,...
## $ reach_B  <int> 179, 185, 179, 180, 170, 178, 188, 175, 179, 179, 175...
## $ stance_A <fct> orthodox, orthodox, orthodox, orthodox, orthodox, ort...
## $ stance_B <fct> orthodox, orthodox, orthodox, orthodox, orthodox, ort...
## $ weight_A <int> 160, 164, 154, 155, 155, NA, 155, 155, 152, NA, 154, ...
## $ weight_B <int> 160, 164, 154, 155, NA, NA, 155, NA, NA, NA, 153, 154...
## $ won_A    <int> 37, 48, 23, 46, 45, 44, 43, 42, 44, 26, 40, 39, 38, 3...
## $ won_B    <int> 49, 50, 47, 31, 40, 32, 19, 22, 42, 41, 30, 46, 33, 2...
## $ lost_A   <int> 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ lost_B   <int> 1, 2, 1, 3, 4, 1, 1, 3, 0, 0, 4, 7, 4, 4, 4, 4, 11, 1...
## $ drawn_A  <int> 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ drawn_B  <int> 1, 1, 1, 0, 0, 0, 2, 0, 1, 1, 0, 1, 1, 2, 0, 2, 2, 6,...
## $ kos_A    <int> 33, 34, 13, 32, 32, 31, 31, 30, 26, 14, 29, 29, 28, 2...
## $ kos_B    <int> 34, 32, 33, 19, 33, 28, 12, 18, 30, 30, 18, 39, 28, 1...
## $ result   <fct> draw, win_A, win_B, win_A, win_A, win_A, win_A, win_A...
## $ decision <fct> SD, UD, KO, KO, UD, KO, SD, TKO, MD, UD, TKO, UD, TKO...
## $ judge1_A <int> 110, 120, NA, 47, 118, NA, 115, 89, 116, 112, NA, 119...
## $ judge1_B <int> 118, 108, NA, 48, 110, NA, 113, 82, 112, 115, NA, 109...
## $ judge2_A <int> 115, 120, NA, 49, 119, NA, 117, 88, 114, 109, NA, 119...
## $ judge2_B <int> 113, 108, NA, 46, 109, NA, 111, 83, 114, 118, NA, 109...
## $ judge3_A <int> 114, 120, NA, 48, 117, NA, 113, 89, 117, 111, NA, 118...
## $ judge3_B <int> 114, 108, NA, 47, 111, NA, 115, 82, 111, 116, NA, 110...

Boxing is separated by different weight classes allowing fighters to have a fair fight. For this vignette, lets take a closer look at the Middle weight (154-160lb or 72.5-75kg inclusive).

boxmid <- as.data.frame(subset(boxdata, weight_A >= 154 & weight_A <= 160, select=c(age_A:decision)))
boxmid$result <- factor(boxmid$result, levels = c("win_A", "draw", "win_B"))

#lets seperate the data from opponent A & opponent B.
boxmid_A <- select(boxmid, age_A, height_A, reach_A, stance_A, weight_A, won_A, lost_A, drawn_A, kos_A, result, decision)
boxmid_B <- select(boxmid, age_B, height_B, reach_B, stance_B, weight_B, won_B, lost_B, drawn_B, kos_B, result, decision)
boxmid_Awin <-as.data.frame(subset(boxmid_A, result == "win_A"))
boxmid_Bwin <-as.data.frame(subset(boxmid_B, result == "win_B"))

lets plot all the variables against the data to see what we have for opponent A and if there is anything exciting. This is also a check for collinearity.

pairs(boxmid_A[, c("age_A","height_A","reach_A","stance_A","weight_A","won_A","lost_A","drawn_A","kos_A")])

Analysis


Lets make a linear regression model for all of opponent A’s and how their past record is a function of all the variables available.

data_A.lm = lm(formula = won_A ~ age_A + height_A + reach_A + stance_A + weight_A + lost_A + drawn_A + kos_A , data = boxmid)
summary(data_A.lm) 
## 
## Call:
## lm(formula = won_A ~ age_A + height_A + reach_A + stance_A + 
##     weight_A + lost_A + drawn_A + kos_A, data = boxmid)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -67.319  -4.694  -1.151   2.673 151.602 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -11.60316   15.18647  -0.764   0.4449    
## age_A              0.08654    0.04756   1.819   0.0689 .  
## height_A          -0.08118    0.05687  -1.427   0.1535    
## reach_A           -0.15795    0.03774  -4.185 2.91e-05 ***
## stance_Asouthpaw  -0.94934    0.50906  -1.865   0.0623 .  
## weight_A           0.34924    0.08586   4.068 4.85e-05 ***
## lost_A             0.89848    0.04241  21.184  < 2e-16 ***
## drawn_A            1.69510    0.03771  44.954  < 2e-16 ***
## kos_A              1.27686    0.01490  85.690  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.99 on 3773 degrees of freedom
##   (7512 observations deleted due to missingness)
## Multiple R-squared:  0.8588, Adjusted R-squared:  0.8585 
## F-statistic:  2869 on 8 and 3773 DF,  p-value: < 2.2e-16

From this summary, we can see that the variables: age, height and stance have a larger P-Value and therefore is creating more noise than predictive power. Lets see if the visual representation of the regression model will give us further insight.

plot(data_A.lm)

Further investigation


From the data above its understandable that the Height and Age of a boxer only creates noise for the model however, as i am also in this weight category, i wonder how my career as a professional boxer will go?

boxmid_Awin$height_factor <- as.factor(as.integer(boxmid_Awin$height_A / 10)*10)

datawithoutNA <- boxmid_Awin %>% filter(!height_factor == "NA")

fill <- "#4271AE"
line <- "#1F3552"
p1 <- ggplot(data = datawithoutNA, aes(x=height_factor, y=age_A, palette = "blues")) + geom_boxplot(fill = fill, colour = line, alpha = 0.7) 
p1 + 
  scale_x_discrete("Height (cm)") +
  scale_y_continuous("Age") +
  geom_jitter(width = 0.1, alpha = 0.1)

Unfortunately, as i’m 180cm, the peak of my career would have been around 5 years ago.

Conclusion


This is a simple approach to applying a linear regression model to a data set and understanding which variables have a key role in a models predictive capabilities. Having a result of 0.8588 multiple R-squared value proves that the model has low percentage of error.