T
my_data <- read.csv('C:/Users/dell/Downloads/Cleaned_Ball_By_Ball.csv')
# Convert categorical variables to factors
my_data$Team_Batting <- as.factor(my_data$Team_Batting)
my_data$Team_Bowling <- as.factor(my_data$Team_Bowling)
my_data$Extra_Type <- as.factor(my_data$Extra_Type)
# Building the linear model
# Let's predict 'Runs_Scored' based on 'Over_id', 'Striker_Batting_Position', and 'Extra_Type'
model <- lm(Runs_Scored ~ Over_id + Striker_Batting_Position + Extra_Type, data = my_data)
# Summary of the model to see the coefficients and statistics
summary(model)
##
## Call:
## lm(formula = Runs_Scored ~ Over_id + Striker_Batting_Position +
## Extra_Type, data = my_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0238 -1.0976 -0.3630 0.0651 6.0881
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1763065 0.0805970 -2.188 0.0287 *
## Over_id 0.0446378 0.0008946 49.899 < 2e-16 ***
## Striker_Batting_Position -0.0868636 0.0024260 -35.806 < 2e-16 ***
## Extra_TypeByes -0.2248474 0.2825428 -0.796 0.4261
## Extra_Typelegbyes 0.0386980 0.0861635 0.449 0.6533
## Extra_TypeLegbyes -0.0905972 0.1296725 -0.699 0.4848
## Extra_TypeNo Extras 1.3088896 0.0800915 16.342 < 2e-16 ***
## Extra_Typenoballs 1.2623682 0.1028452 12.274 < 2e-16 ***
## Extra_TypeNoballs 1.5256843 0.2617795 5.828 5.62e-09 ***
## Extra_Typepenalty -0.1856202 1.5583803 -0.119 0.9052
## Extra_Typewides 0.0520834 0.0835568 0.623 0.5331
## Extra_TypeWides -0.1147498 0.1102747 -1.041 0.2981
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.556 on 150439 degrees of freedom
## Multiple R-squared: 0.04716, Adjusted R-squared: 0.04709
## F-statistic: 677 on 11 and 150439 DF, p-value: < 2.2e-16
Runs_Scored is modeled as a function of Over_id, Striker_Batting_Position, and Extra_Type.
Residuals: The residuals are the differences between observed values and the values predicted by the model. The summary gives the minimum, first quartile (25th percentile), median, third quartile (75th percentile), and maximum residuals. These values suggest that there is a wide range of residuals, with at least one large outlier (maximum of 6.0881).
Coefficients:
Estimate: This is the estimated effect of each variable on the response variable (Runs_Scored). For instance, for every one-unit increase in Over_id, Runs_Scored is expected to increase by 0.0446 runs, holding other variables constant. Std. Error: The standard error of the estimate, which measures the variability of the estimate. t value: The t-statistic is the coefficient divided by its standard error. It is used to test the null hypothesis that the coefficient equals zero (no effect). Pr(>|t|): The p-value associated with the t-statistic. A low p-value (typically ≤ 0.05) indicates that you can reject the null hypothesis. In the table, Over_id, Striker_Batting_Position, and several levels of Extra_Type are significant predictors of Runs_Scored. Signif. Codes: This is a key to the significance levels, with ’***’ indicating a high level of significance.
Residual standard error (RSE): This is a measure of the quality of the linear regression fit. In this model, the RSE is 1.556 on 150,439 degrees of freedom, indicating the average distance that the observed values fall from the regression line.
Multiple R-squared: This is the proportion of the variance in the dependent variable that’s predictable from the independent variables. Here, it’s 0.04716, indicating that approximately 4.716% of the variability in Runs_Scored can be explained by the model.
F-statistic: This tests the null hypothesis that all regression coefficients are equal to zero, meaning that the model as a whole is significant. The F-statistic is 677, and the very small p-value (< 2.2e-16) suggests that the model is indeed statistically significant.
The model indicates that both Over_id and Striker_Batting_Position are significant predictors of runs scored, as well as several types of extras. However, the low R-squared values suggest that the model does not explain a large portion of the variability in the runs scored. This could be due to the complex nature of cricket where runs scored are influenced by many factors, not all of which are included in the model.
# Assuming 'model' is your linear model object from lm()
# Diagnostic plots
par(mfrow=c(2,2)) # Setting up the plotting area for multiple plots
plot(model) # This command produces four plots for diagnostics
## Warning: not plotting observations with leverage one:
## 16404
# Checking for normality of residuals with a QQ plot
qqnorm(model$residuals)
qqline(model$residuals)
# Plotting residuals against fitted values to check for homoscedasticity
plot(model$fitted.values, model$residuals)
abline(h=0, col="red")
# Calculate Variance Inflation Factor (VIF) to check for multicollinearity
library(car)
## Loading required package: carData
vif(model)
## GVIF Df GVIF^(1/(2*Df))
## Over_id 1.600407 1 1.265072
## Striker_Batting_Position 1.603837 1 1.266427
## Extra_Type 1.006109 9 1.000338
# Breusch-Pagan test to check for heteroscedasticity
library(lmtest)
## Warning: package 'lmtest' was built under R version 4.3.2
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.3.2
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(model)
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 2779, df = 11, p-value < 2.2e-16
# Durbin-Watson test to check for autocorrelation in the residuals
dwtest(model)
##
## Durbin-Watson test
##
## data: model
## DW = 1.9249, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is greater than 0
Interpreting the diagnostic results:
Diagnostic Plots: If you see any patterns or systematic structures in the Residuals vs Fitted plot, this may indicate non-linearity or heteroscedasticity. The Normal Q-Q plot should show the residuals falling along a straight line if they are normally distributed.
Homoscedasticity: The plot of residuals against fitted values should show no clear pattern if the variance of the residuals is constant (homoscedasticity).
Cook’s Distance: Look for points with a Cook’s distance that stands out from the rest, which could indicate influential observations.
VIF: Values above 5 or 10 might suggest multicollinearity between the predictors.
Breusch-Pagan Test: A significant p-value indicates potential problems with heteroscedasticity.
Durbin-Watson Test: Values close to 2 suggest there is no autocorrelation in the residuals, while values departing significantly from 2 could indicate autocorrelation.
For the Over_id coefficient:
The estimate is 0.0446 with a highly significant p-value (less than 2e-16). This means for each additional over in the match, the model predicts an increase of 0.0446 runs scored, holding other variables constant. This is logical in cricket since as the game progresses, teams may become more aggressive in scoring, especially towards the end of their innings. The significance suggests a reliable positive relationship between the over number and the runs scored. This interpretation assumes the diagnostic plots and tests show that the model meets the assumptions of linear regression. If the diagnostics indicate problems, you would need to address these, possibly by transforming variables, considering interaction effects, adding polynomial terms, or using a different type of model