This vignette will briefly how to apply a Multiple linear regression model to a boxing database made available on Kaggle https://www.kaggle.com/slonsky/boxing-bouts/downloads/boxing-bouts.zip/1. The primary purpose of this regression analysis is to:
Use the link above to download the database to a local folder. Make sure that the file is extracted and is in a native .csv format. Make sure the working directory is set to where the file is extracted.
library(tidyverse)
library(dbplyr)
library(ggplot2)
setwd('/Users/desmond/Documents/36103/AT1')
boxdata <- read_csv('bouts_out_new.csv')
glimpse(boxdata)
## Observations: 387,427
## Variables: 26
## $ age_A <int> 35, 26, 28, 25, 25, 24, 23, 23, 36, 27, 22, 21, 21, 2...
## $ age_B <int> 27, 31, 26, 29, 35, 31, 31, 31, 23, 22, 28, 40, 32, 3...
## $ height_A <int> 179, 175, 176, 175, 175, 175, 175, 175, 173, 177, 175...
## $ height_B <int> 175, 185, 175, 174, 170, 175, 175, 177, 175, 175, 177...
## $ reach_A <int> 178, 179, NA, 179, 179, 179, 179, 179, 183, 183, 179,...
## $ reach_B <int> 179, 185, 179, 180, 170, 178, 188, 175, 179, 179, 175...
## $ stance_A <chr> "orthodox", "orthodox", "orthodox", "orthodox", "orth...
## $ stance_B <chr> "orthodox", "orthodox", "orthodox", "orthodox", "orth...
## $ weight_A <int> 160, 164, 154, 155, 155, NA, 155, 155, 152, NA, 154, ...
## $ weight_B <int> 160, 164, 154, 155, NA, NA, 155, NA, NA, NA, 153, 154...
## $ won_A <int> 37, 48, 23, 46, 45, 44, 43, 42, 44, 26, 40, 39, 38, 3...
## $ won_B <int> 49, 50, 47, 31, 40, 32, 19, 22, 42, 41, 30, 46, 33, 2...
## $ lost_A <int> 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ lost_B <int> 1, 2, 1, 3, 4, 1, 1, 3, 0, 0, 4, 7, 4, 4, 4, 4, 11, 1...
## $ drawn_A <int> 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ drawn_B <int> 1, 1, 1, 0, 0, 0, 2, 0, 1, 1, 0, 1, 1, 2, 0, 2, 2, 6,...
## $ kos_A <int> 33, 34, 13, 32, 32, 31, 31, 30, 26, 14, 29, 29, 28, 2...
## $ kos_B <int> 34, 32, 33, 19, 33, 28, 12, 18, 30, 30, 18, 39, 28, 1...
## $ result <chr> "draw", "win_A", "win_B", "win_A", "win_A", "win_A", ...
## $ decision <chr> "SD", "UD", "KO", "KO", "UD", "KO", "SD", "TKO", "MD"...
## $ judge1_A <int> 110, 120, NA, 47, 118, NA, 115, 89, 116, 112, NA, 119...
## $ judge1_B <int> 118, 108, NA, 48, 110, NA, 113, 82, 112, 115, NA, 109...
## $ judge2_A <int> 115, 120, NA, 49, 119, NA, 117, 88, 114, 109, NA, 119...
## $ judge2_B <int> 113, 108, NA, 46, 109, NA, 111, 83, 114, 118, NA, 109...
## $ judge3_A <int> 114, 120, NA, 48, 117, NA, 113, 89, 117, 111, NA, 118...
## $ judge3_B <int> 114, 108, NA, 47, 111, NA, 115, 82, 111, 116, NA, 110...
The glimpse function allowed us to see a short form summary of our data set. When this data set was imported, it imported the stance, result and decisions as character strings. We need to change them to leveled factors so we can apply a classification model. Lets set up the data dictionary as per below:
| Variable | Type of Data | Description |
|---|---|---|
| age_x | INT | Years |
| height_x | INT | Height (cm) |
| reach_x | INT | Reach from one hand to another |
| stance_x | Factor | Orthodox / Southpaw |
| weight_x | INT | Pounds (lbs) |
| won_x | INT | Number of past wins |
| lost_x | INT | Number of past losses |
| drawn_x | INT | Number of past draws |
| kos_x | INT | Number of past wins by Knockout |
| result | Factor | Bout result (win_A/win_B/draw) |
| decision | Factor | Types of judged decisions |
boxdata$stance_A <- as.factor(boxdata$stance_A)
boxdata$stance_B <- as.factor(boxdata$stance_B)
boxdata$result <- as.factor(boxdata$result)
boxdata$decision <- as.factor(boxdata$decision)
glimpse(boxdata)
## Observations: 387,427
## Variables: 26
## $ age_A <int> 35, 26, 28, 25, 25, 24, 23, 23, 36, 27, 22, 21, 21, 2...
## $ age_B <int> 27, 31, 26, 29, 35, 31, 31, 31, 23, 22, 28, 40, 32, 3...
## $ height_A <int> 179, 175, 176, 175, 175, 175, 175, 175, 173, 177, 175...
## $ height_B <int> 175, 185, 175, 174, 170, 175, 175, 177, 175, 175, 177...
## $ reach_A <int> 178, 179, NA, 179, 179, 179, 179, 179, 183, 183, 179,...
## $ reach_B <int> 179, 185, 179, 180, 170, 178, 188, 175, 179, 179, 175...
## $ stance_A <fct> orthodox, orthodox, orthodox, orthodox, orthodox, ort...
## $ stance_B <fct> orthodox, orthodox, orthodox, orthodox, orthodox, ort...
## $ weight_A <int> 160, 164, 154, 155, 155, NA, 155, 155, 152, NA, 154, ...
## $ weight_B <int> 160, 164, 154, 155, NA, NA, 155, NA, NA, NA, 153, 154...
## $ won_A <int> 37, 48, 23, 46, 45, 44, 43, 42, 44, 26, 40, 39, 38, 3...
## $ won_B <int> 49, 50, 47, 31, 40, 32, 19, 22, 42, 41, 30, 46, 33, 2...
## $ lost_A <int> 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ lost_B <int> 1, 2, 1, 3, 4, 1, 1, 3, 0, 0, 4, 7, 4, 4, 4, 4, 11, 1...
## $ drawn_A <int> 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ drawn_B <int> 1, 1, 1, 0, 0, 0, 2, 0, 1, 1, 0, 1, 1, 2, 0, 2, 2, 6,...
## $ kos_A <int> 33, 34, 13, 32, 32, 31, 31, 30, 26, 14, 29, 29, 28, 2...
## $ kos_B <int> 34, 32, 33, 19, 33, 28, 12, 18, 30, 30, 18, 39, 28, 1...
## $ result <fct> draw, win_A, win_B, win_A, win_A, win_A, win_A, win_A...
## $ decision <fct> SD, UD, KO, KO, UD, KO, SD, TKO, MD, UD, TKO, UD, TKO...
## $ judge1_A <int> 110, 120, NA, 47, 118, NA, 115, 89, 116, 112, NA, 119...
## $ judge1_B <int> 118, 108, NA, 48, 110, NA, 113, 82, 112, 115, NA, 109...
## $ judge2_A <int> 115, 120, NA, 49, 119, NA, 117, 88, 114, 109, NA, 119...
## $ judge2_B <int> 113, 108, NA, 46, 109, NA, 111, 83, 114, 118, NA, 109...
## $ judge3_A <int> 114, 120, NA, 48, 117, NA, 113, 89, 117, 111, NA, 118...
## $ judge3_B <int> 114, 108, NA, 47, 111, NA, 115, 82, 111, 116, NA, 110...
Boxing is separated by different weight classes allowing fighters to have a fair fight. For this vignette, lets take a closer look at the Middle weight (154-160lb or 72.5-75kg inclusive).
boxmid <- as.data.frame(subset(boxdata, weight_A >= 154 & weight_A <= 160, select=c(age_A:decision)))
boxmid$result <- factor(boxmid$result, levels = c("win_A", "draw", "win_B"))
#lets seperate the data from opponent A & opponent B.
boxmid_A <- select(boxmid, age_A, height_A, reach_A, stance_A, weight_A, won_A, lost_A, drawn_A, kos_A, result, decision)
boxmid_B <- select(boxmid, age_B, height_B, reach_B, stance_B, weight_B, won_B, lost_B, drawn_B, kos_B, result, decision)
boxmid_Awin <-as.data.frame(subset(boxmid_A, result == "win_A"))
boxmid_Bwin <-as.data.frame(subset(boxmid_B, result == "win_B"))
lets plot all the variables against the data to see what we have for opponent A and if there is anything exciting. This is also a check for collinearity.
pairs(boxmid_A[, c("age_A","height_A","reach_A","stance_A","weight_A","won_A","lost_A","drawn_A","kos_A")])
Lets make a linear regression model for all of opponent A’s and how their past record is a function of all the variables available.
data_A.lm = lm(formula = won_A ~ age_A + height_A + reach_A + stance_A + weight_A + lost_A + drawn_A + kos_A , data = boxmid)
summary(data_A.lm)
##
## Call:
## lm(formula = won_A ~ age_A + height_A + reach_A + stance_A +
## weight_A + lost_A + drawn_A + kos_A, data = boxmid)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67.319 -4.694 -1.151 2.673 151.602
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.60316 15.18647 -0.764 0.4449
## age_A 0.08654 0.04756 1.819 0.0689 .
## height_A -0.08118 0.05687 -1.427 0.1535
## reach_A -0.15795 0.03774 -4.185 2.91e-05 ***
## stance_Asouthpaw -0.94934 0.50906 -1.865 0.0623 .
## weight_A 0.34924 0.08586 4.068 4.85e-05 ***
## lost_A 0.89848 0.04241 21.184 < 2e-16 ***
## drawn_A 1.69510 0.03771 44.954 < 2e-16 ***
## kos_A 1.27686 0.01490 85.690 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.99 on 3773 degrees of freedom
## (7512 observations deleted due to missingness)
## Multiple R-squared: 0.8588, Adjusted R-squared: 0.8585
## F-statistic: 2869 on 8 and 3773 DF, p-value: < 2.2e-16
From this summary, we can see that the variables: age, height and stance have a larger P-Value and therefore is creating more noise than predictive power. Lets see if the visual representation of the regression model will give us further insight.
plot(data_A.lm)
From the data above its understandable that the Height and Age of a boxer only creates noise for the model however, as i am also in this weight category, i wonder how my career as a professional boxer will go?
boxmid_Awin$height_factor <- as.factor(as.integer(boxmid_Awin$height_A / 10)*10)
datawithoutNA <- boxmid_Awin %>% filter(!height_factor == "NA")
fill <- "#4271AE"
line <- "#1F3552"
p1 <- ggplot(data = datawithoutNA, aes(x=height_factor, y=age_A, palette = "blues")) + geom_boxplot(fill = fill, colour = line, alpha = 0.7)
p1 +
scale_x_discrete("Height (cm)") +
scale_y_continuous("Age") +
geom_jitter(width = 0.1, alpha = 0.1)
Unfortunately, as i’m 180cm, the peak of my career would have been around 5 years ago.
This is a simple approach to applying a linear regression model to a data set and understanding which variables have a key role in a models predictive capabilities. Having a result of 0.8588 multiple R-squared value proves that the model has low percentage of error.