library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
data <- read.csv("~/Documents/Rdocs/matches.csv", stringsAsFactors = TRUE)

head(data)

Lets include 3 variables based on potential relationships with the response variable, result_margin.

1.target_runs (already included last time): A continuous predictor likely to influence result_margin. 2.super_over:A binary predictor, its a match predictor when runs of both teams tie. It indicates if the match went into a tie-breaking over. Since tied games often have low result margins, so this might impact result_margin. 3.toss_decision: Converted to binary (bat = 1, field = 0) to see if the decision to bat or field first influences the outcome margin. It has weak relation as checked last time.

Interaction Term between target_runs and super_over: Testing if the relationship between target_runs and result_margin differs when the match goes to a super over.

# Remove missing values for these variables
data <- na.omit(data[, c("result_margin", "target_runs", "super_over", "toss_decision")])

# Lets convert 'super_over' to a binary numeric variable (N=0 or Y=1)
data$super_over <- as.numeric(data$super_over == "Y")

# Lets convert 'toss_decision' to binary (bat = 1, field = 0)
data$toss_decision <- ifelse(data$toss_decision == "bat", 1, 0)

lr_model <- lm(result_margin ~ target_runs + super_over + toss_decision + target_runs:super_over, data = data)
summary(lr_model)
## 
## Call:
## lm(formula = result_margin ~ target_runs + super_over + toss_decision + 
##     target_runs:super_over, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.762 -11.929  -5.659   5.039 116.952 
## 
## Coefficients: (2 not defined because of singularities)
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -26.21774    3.19065  -8.217 5.96e-16 ***
## target_runs              0.25946    0.01838  14.119  < 2e-16 ***
## super_over                    NA         NA      NA       NA    
## toss_decision            1.28726    1.28241   1.004    0.316    
## target_runs:super_over        NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.02 on 1073 degrees of freedom
## Multiple R-squared:  0.1569, Adjusted R-squared:  0.1553 
## F-statistic: 99.85 on 2 and 1073 DF,  p-value: < 2.2e-16

target_runs: Keeps the relationship between the target score and result margin; it’s reasonable to expect higher target runs lead to wider winning margins. super_over: As a tie-breaking game is usually close, including this will help us to expect result with lower margins. toss_decision: Whether a team chose to bat first might influence their margin of win/loss.

Interaction term (target_runs): Lets test if the relationship between target score and result margin changes in a super over scenario.

If the VIF value is above 5. Then there are some potential multi collinearity issues. Lets check by diagnosing various plots.

These 5 plots are discussed in class. Lets use them and identify issues in this model.

Residuals vs. Fitted values: Assesses linearity and homoscedasticity (constant variance). Residuals vs. Predictor values: Checks if residuals are randomly distributed across predictor variables. Residual Histogram: Examines the normality of residuals. Q-Q Plot: Checks if residuals follow a normal distribution. Cook’s Distance by Observation: Identifies influential points that may disproportionately affect the model.

Lets plot 5 visualizations to diagnose this model.

model <- lm(result_margin ~ target_runs + super_over + toss_decision, data = data)

plot(model$fitted.values, model$residuals, 
     main = "Residuals vs Fitted Values",
     xlab = "Fitted Values",
     ylab = "Residuals")
abline(h = 0, col = "red")

Residuals show that there is a pattern, it suggests that the model does not adequately capture the relationship, and additional predictors might be needed. A lack of pattern indicates that linearity and homoscedasticity assumptions are likely met.

plot(data$target_runs, model$residuals, 
     main = "Residuals vs Target Runs",
     xlab = "Target Runs",
     ylab = "Residuals")
abline(h = 0, col = "red")

Residuals should show no systematic pattern across target_runs. If residuals fan out or form a shape, it may indicate heteroscedasticity or a non-linear relationship.As we can see a shape in this plot, we can say that the assumption has been met.Patterns indicate possible issues requiring attention.

hist(model$residuals, breaks = 20, 
     main = "Histogram of Residuals",
     xlab = "Residuals")

The histogram shows significant skewness suggest deviations from normality. Deviations prompt data transformations or robust regression methods.

qqnorm(model$residuals)
qqline(model$residuals, col = "red")

Points close to the line indicate normality, while deviations, particularly at the ends, suggest non-normality. Here, we can see some points at the end are deviating so it suggests some deviations. If most points align with the line, this supports the assumption of normally distributed residuals.

plot(cooks.distance(model), 
     main = "Cook's Distance by Observation",
     xlab = "Observation", 
     ylab = "Cook's Distance")
abline(h = 4/(nrow(data)-length(model$coefficients)-1), col = "red") 

Observations with high Cook’s Distance are potentially influential and should be examined. Observations significantly exceeding this line may need removal or further examination. Here influential points are minimal. Numerous or extreme values suggest the model may be sensitive to specific data points, possibly skewing results.

This requires some cleaning of the data.