Lets start loading libraries required for this notebook.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” forcats 1.0.0 âś” stringr 1.5.1
## âś” lubridate 1.9.3 âś” tibble 3.2.1
## âś” purrr 1.0.2 âś” tidyr 1.3.1
## âś” readr 2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(stats)
library(ggthemes)
library(purrr)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:purrr':
##
## some
##
## The following object is masked from 'package:dplyr':
##
## recode
library(pwr)
library(broom)
Lets give data input and structure of the data to get basic idea of the data.
matches_data <- read.csv("~/Documents/Rdocs/matches.csv", stringsAsFactors = TRUE)
str(data)
## function (..., list = character(), package = NULL, lib.loc = NULL, verbose = getOption("verbose"),
## envir = .GlobalEnv, overwrite = TRUE)
Lets convert these to factors and build a linear model.
matches_data$toss_decision <- as.factor(matches_data$toss_decision)
matches_data$super_over <- as.factor(matches_data$super_over)
lm_model <- lm(result_margin ~ target_runs + toss_decision + result, data=matches_data)
# Display model summary
summary(lm_model)
##
## Call:
## lm(formula = result_margin ~ target_runs + toss_decision + result,
## data = matches_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.082 -7.971 -1.549 4.033 111.332
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.95182 3.26674 1.516 0.130
## target_runs 0.14146 0.01766 8.012 2.94e-15 ***
## toss_decisionfield -0.41587 1.13671 -0.366 0.715
## resultwickets -20.25238 1.17633 -17.217 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.73 on 1072 degrees of freedom
## (19 observations deleted due to missingness)
## Multiple R-squared: 0.3395, Adjusted R-squared: 0.3377
## F-statistic: 183.7 on 3 and 1072 DF, p-value: < 2.2e-16
Despite the low R-squared value, the model has some statistically significant predictors. However, the low R-squared value suggests that much of the variation in result margin is not explained by toss decision, target runs, and result alone.
Lets check for multicolllinearity and plot the visuals for better analysis.
vif(lm_model)
## target_runs toss_decision result
## 1.192633 1.014956 1.177402
par(mfrow=c(2,2))
plot(lm_model)
1.Residuals vs Fitted Plot: This plot helps to check the assumption of linearity and homoscedasticity. Ideally, it supposed to be a random scatter of points. But there appears to be a slight pattern, indicating potential non-linearity or non-constant variance (heteroscedasticity) of residuals.
2.Normal Q-Q Plot: The purpose of this plot is to verify the assumption that residuals are normally distributed. If residuals are normally distributed, they should fall approximately along the reference line. The Q-Q plot deviates from the line at the ends, suggesting that residuals may have heavy tails and may not be normally distributed.
3.Scale-Location Plot: This plot shows if residuals are spread equally along the ranges of predictors (homoscedasticity). The pattern seen here, with the spread increasing for larger fitted values, indicates possible heteroscedasticity.
4.Residuals vs Leverage Plot: This helps to identify influential cases that might have an undue influence on the model. The plot indicates a few points with higher leverage, but they don’t appear to be influential enough to impact the regression line significantly.
coef_summary <- summary(lm_model)$coefficients
coef_summary
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.9518153 3.2667374 1.5158290 1.298574e-01
## target_runs 0.1414642 0.0176573 8.0116541 2.935078e-15
## toss_decisionfield -0.4158705 1.1367104 -0.3658544 7.145459e-01
## resultwickets -20.2523842 1.1763288 -17.2166014 7.757482e-59
Coefficient for target_runs: As the coefficient for target_runs is positive, it would suggest that higher target scores lead to a higher result margin, but this relation is not so strong as the coefficient is only 0.14. This might imply that teams tend to win by larger margins when chasing high scores sometimes.
Coefficient Insight: The coefficient for target_runs indicates that for every additional run in the target, the expected result margin increases by 0.14 runs.
resultwic_coef <- coef(lm_model)["resultwickets"]
cat("For each one-run increase in result margin, the result factor is expected to change by", resultwic_coef, "units.")
## For each one-run increase in result margin, the result factor is expected to change by -20.25238 units.
The model indicates that the result wickets coefficient is -20.25. This means that, holding all else constant, each additional run in result margin is associated with a decrease in the result factor with wickets by approximately 20 times. This is an unexpected relationship as one might assume that higher-result margin could command win by wickets i.e., the defending team wins, but the data suggests otherwise. This negative relationship could be due to a variety of market factors and requires further investigation.