library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
pl <- read_csv("C:/Users/bfunk/Downloads/E0.csv")
## Rows: 380 Columns: 120
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Div, Date, HomeTeam, AwayTeam, FTR, HTR, Referee
## dbl (112): FTHG, FTAG, HTHG, HTAG, HS, AS, HST, AST, HF, AF, HC, AC, HY, AY...
## time (1): Time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Here I create a home win binary column to work with, removing the draw from the equation just asking if the team won or not. This data dive I will try to determine what variables affects home wins the strongest and create a logistical regression model to predict home wins.
pl <- pl |>
mutate(home_win = if_else(FTR == "H", 1, 0))
Plotting each game by if the home team won or not, and how many shots they took in that game with 1 being a victory and 0 being any other result.
pl |>
ggplot(aes(x = HST, y = home_win)) +
geom_jitter(width = 0, height = 0.1, shape = "O", size= 3) +
labs(x = "Home Shots on Target",
y = "Home Win") +
scale_y_continuous(breaks = c(0, 1))
I added away shots and home betting odds to the log model. Two more variables that I believe affect winning.
The coefficients tell me that home shots positively affect winning but the lower the bet 365 odds or away shots the higher chance there is for the home team to win. this is about what we would expect.
model <- glm(home_win ~ HST + AST + B365H,
data = pl,
family = binomial(link = "logit"))
model$coefficients
## (Intercept) HST AST B365H
## 0.1686033 0.3473562 -0.3107811 -0.4223579
summary(model)
##
## Call:
## glm(formula = home_win ~ HST + AST + B365H, family = binomial(link = "logit"),
## data = pl)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.16860 0.50313 0.335 0.737544
## HST 0.34736 0.06160 5.639 1.71e-08 ***
## AST -0.31078 0.06869 -4.524 6.06e-06 ***
## B365H -0.42236 0.12615 -3.348 0.000814 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 513.82 on 379 degrees of freedom
## Residual deviance: 382.20 on 376 degrees of freedom
## AIC: 390.2
##
## Number of Fisher Scoring iterations: 5
Null deviance = 513.8
Residual deviance = 382.2
AIC = 390.2
The residual deviance is much lower than the null deviance, telling me the predictors improve the model vs not adding them.
exp(coef(model))
## (Intercept) HST AST B365H
## 1.1836505 1.4153207 0.7328743 0.6554994
Running the coefficients tells me for each HST 41% increase in odds while each away shot brings a 23.7% decrease as expected the more shots you give up the more likely you are to lose. B365H 34.5% decrease in odds per unit increase. This tells me according to these coefficients HST is the most influential in this model, followed by B365 home odds.
coef(summary(model))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.1686033 0.50313186 0.3351076 7.375440e-01
## HST 0.3473562 0.06159630 5.6392377 1.708046e-08
## AST -0.3107811 0.06869199 -4.5242699 6.060434e-06
## B365H -0.4223579 0.12614842 -3.3481026 8.136688e-04
All 3 p values are very small, meaning HST AST and B365H are significant, we can reject all 3 null hypothesis which each respective variable in association with home results.
beta_hst <- coef(summary(model))["HST", "Estimate"]
se_hst <- coef(summary(model))["HST", "Std. Error"]
ci_hst <- c(beta_hst - 1.96 * se_hst,
beta_hst + 1.96 * se_hst)
ci_hst
## [1] 0.2266274 0.4680849
exp(ci_hst)
## [1] 1.254362 1.596933
log odds scale .227-.468 - HST is significant
odds ratio scale 1.254-1.597 - home shots increase home win odds by 25-60%
The 95% confidence interval for the HST coefficient is .227-.468 meaning since it does not contain 0 HST is significant when predicting home wins. The odds ratio scale tells us HST increase home win chance by 25-60%