library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ✔ readr 2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(stats)
library(ggthemes)
library(purrr)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
data <- read.csv("~/Documents/Rdocs/matches.csv", stringsAsFactors = TRUE)
head(data)
For this logistic regression model, I will use the toss decision column as the binary target variable, we can convert the “field” to 1 and “bat” to 0. The reason for selecting this binary variable is toss decision has more variance than super over in this dataset. Lets build the logistic regression model using result margin, target runs and result as explanatory variables. As season is categorical lets convert that to factor and filter the missing values in the dataset .
data <- data %>%
mutate(toss_decision_binary = ifelse(toss_decision == "field", 1, 0))
data <- data %>%
filter(!is.na(result_margin) & !is.na(target_runs))
data$result <- as.factor(data$result)
str(data)
## 'data.frame': 1076 obs. of 21 variables:
## $ id : int 335982 335983 335984 335985 335986 335987 335988 335989 335990 335991 ...
## $ season : Factor w/ 17 levels "2007/08","2009",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ city : Factor w/ 36 levels "Abu Dhabi","Ahmedabad",..: 3 8 11 27 24 19 17 9 17 8 ...
## $ date : Factor w/ 823 levels "2008-04-18","2008-04-19",..: 1 2 2 3 3 4 5 6 7 8 ...
## $ match_type : Factor w/ 8 levels "3rd Place Play-Off",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ player_of_match : Factor w/ 291 levels "A Chandila","A Kumble",..: 38 151 152 175 58 263 278 158 288 114 ...
## $ venue : Factor w/ 58 levels "Arun Jaitley Stadium",..: 24 41 17 56 15 47 43 28 43 41 ...
## $ team1 : Factor w/ 19 levels "Chennai Super Kings",..: 17 7 4 11 9 14 2 1 2 7 ...
## $ team2 : Factor w/ 19 levels "Chennai Super Kings",..: 9 1 14 17 2 7 4 11 14 11 ...
## $ toss_winner : Factor w/ 19 levels "Chennai Super Kings",..: 17 1 14 11 2 7 2 11 14 11 ...
## $ toss_decision : Factor w/ 2 levels "bat","field": 2 1 1 1 1 1 1 2 2 2 ...
## $ winner : Factor w/ 19 levels "Chennai Super Kings",..: 9 1 4 17 9 14 4 1 14 7 ...
## $ result : Factor w/ 4 levels "no result","runs",..: 2 2 4 4 4 4 4 2 4 2 ...
## $ result_margin : int 140 33 9 5 5 6 9 6 3 66 ...
## $ target_runs : int 223 241 130 166 111 167 143 209 215 183 ...
## $ target_overs : num 20 20 20 20 20 20 20 20 20 20 ...
## $ super_over : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ method : Factor w/ 1 level "D/L": NA NA NA NA NA NA NA NA NA NA ...
## $ umpire1 : Factor w/ 62 levels "A Deshmukh","A Nand Kishore",..: 8 35 6 52 11 6 25 19 8 6 ...
## $ umpire2 : Factor w/ 62 levels "A Deshmukh","A Nand Kishore",..: 42 53 16 15 25 41 5 16 31 5 ...
## $ toss_decision_binary: num 1 0 0 0 0 0 0 1 1 1 ...
The toss_decision_binary is mutated to binary variable. Now lets build logistic regression model.
summary(data %>% select(toss_decision_binary, result_margin, target_runs, result))
## toss_decision_binary result_margin target_runs result
## Min. :0.0000 Min. : 1.00 Min. : 43.0 no result: 0
## 1st Qu.:0.0000 1st Qu.: 6.00 1st Qu.:146.0 runs :498
## Median :1.0000 Median : 8.00 Median :166.0 tie : 0
## Mean :0.6431 Mean : 17.26 Mean :165.8 wickets :578
## 3rd Qu.:1.0000 3rd Qu.: 20.00 3rd Qu.:187.0
## Max. :1.0000 Max. :146.00 Max. :288.0
table(data$toss_decision_binary)
##
## 0 1
## 384 692
Toss decision has 384 “field” and 692 “bat” decisions. Lets model the toss decision with the explanatory variables result_margin, target_runs, and result.
model <- glm(toss_decision_binary ~ result_margin + target_runs + result,
data = data, family = binomial)
summary(model)
##
## Call:
## glm(formula = toss_decision_binary ~ result_margin + target_runs +
## result, family = binomial, data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.875662 0.385253 -2.273 0.023 *
## result_margin -0.001293 0.003674 -0.352 0.725
## target_runs 0.008452 0.002162 3.910 9.22e-05 ***
## resultwickets 0.175578 0.157407 1.115 0.265
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1402.2 on 1075 degrees of freedom
## Residual deviance: 1386.2 on 1072 degrees of freedom
## AIC: 1394.2
##
## Number of Fisher Scoring iterations: 4
These coefficients indicates the effect of the a variable on the likelihood of choosing to field after winning the toss.
In logistic regression, each coefficient represents the odds of choosing to field (as opposite to bat) based on the predictor.
coef_summary <- summary(model)$coefficients
coef_summary
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.875661663 0.385252521 -2.2729551 2.302889e-02
## result_margin -0.001292919 0.003673809 -0.3519288 7.248916e-01
## target_runs 0.008452262 0.002161518 3.9103366 9.216761e-05
## resultwickets 0.175578184 0.157406839 1.1154419 2.646611e-01
The estimate of -0.87 with a standard error of 0.38 and a z-value of -2.27 indicates that the intercept indicates a baseline log-odds of -0.87. The significant z-value suggests this baseline probability is meaningful in the context of the model. The probability will be ” e to the power of log-odds divided by 1+e to the power of log-odds”. So the probability would be around 30%.
It is -0.012, lower result margins may be associated with a preference to field. A negative coefficient means that as the result margin increases, the probability of choosing to field decreases. This could suggest that teams see a tactical advantage in fielding when previous games have been won or lost by close margins.
0.008 coefficient here would indicate that higher target scores might relate to the decision to field possibly to use against high scoring conditions.
The coefficient for each result category indicates its effect on the toss decision relative to the base result type. A 0.17 coefficient for “wickets” result would suggest that teams are more inclined to field when previous matches have been won by wickets, likely due to strategic considerations in similar match encounters.
Now lets build a confidence interval for this model based on 0.03 standard error in result margin.
confint(model)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -1.635834987 -0.123675807
## result_margin -0.008407196 0.006039256
## target_runs 0.004244723 0.012728498
## resultwickets -0.133351410 0.484074250
ci_result_margin <- confint(model)["result_margin", ]
## Waiting for profiling to be done...
cat("95% Confidence Interval for result_margin:", ci_result_margin)
## 95% Confidence Interval for result_margin: -0.008407196 0.006039256
The confidence interval for the result margin coefficient ranges from -0.008 to 0.006. This interval means that we are 95% confident that the true effect of a one-unit increase in result margin on the log odds of a toss decision lies between -0.008 and 0.006.
As there is 0 in this interval it suggests that result margin might not have a statistically significant impact on the toss decision. There may be no strong evidence to conclude that result_margin is meaningfully associated with the likelihood of choosing to field or bat after winning the toss.