my_data <- read.csv('C:/Users/dell/Downloads/Ball_By_Ball.csv')
The “Runs_Scored” column Will be a valuable continuous (ordered integer) response variable. This column represents the number of runs scored on each ball, and it is a key indicator of a team’s performance and the outcome of the match.
As for the categorical explanatory variable that has high potential to influence the “Runs_Scored,” I will consider “Team_Batting.” The team that is currently batting can significantly impact the runs scored on each ball. Different teams may have different strategies, strengths, and weaknesses, which can influence their scoring patterns. Therefore, “Team_Batting” can be a valuable categorical variable to explore its impact on the runs scored.
To perform an ANOVA test using R , I consider the following null hypothesis (H0) and perform the test:
Null Hypothesis (H0): The mean “Runs_Scored” is the same for all the different categories in the “Team_Batting” variable. In other words, there is no statistically significant difference in the mean runs scored by different teams while batting.
Now, let’s perform the ANOVA test using R and summarize the results:
The ANOVA output will help us to determine whether the choice of team (the categorical variable “Team_Batting”) significantly affects the runs scored (“Runs_Scored”) during the cricket match.
To interpret the results:
If the p-value (Pr(>F)) is less than your chosen significance level, you would reject the null hypothesis. This means that there is a statistically significant difference in the mean runs scored by different teams while batting.
If the p-value is greater than your significance level, you would fail to reject the null hypothesis. This suggests that there is no statistically significant difference in the mean runs scored by different teams.
# We will use the 'aov' function for one-way ANOVA
anova_result <- aov(Runs_Scored ~ Team_Batting, data = my_data)
# Summarize the ANOVA results
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## Team_Batting 20 316 15.78 6.214 <2e-16 ***
## Residuals 150430 382102 2.54
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA results show the following information:
Df (Degrees of Freedom): There are two degrees of freedom components in the analysis:
Sum Sq (Sum of Squares): The total sum of squares (SST) for the model is 316, and the sum of squares for the residuals (SSE) is 382,102.
Mean Sq (Mean Sum of Squares): This is calculated by dividing the sum of squares by the corresponding degrees of freedom. For “Team_Batting,” it’s 15.78, and for “Residuals,” it’s 2.54.
F value: The F-value is the ratio of the variance explained by the “Team_Batting” variable to the variance unexplained (residual variance). It is calculated as the Mean Sq for “Team_Batting” divided by the Mean Sq for “Residuals.” In this case, F = 15.78 / 2.54 = 6.214.
Pr(>F): This is the p-value associated with the F-value. It is an extremely small value, as indicated by “<2e-16,” which is essentially zero.
Now, let’s interpret these results:
The degrees of freedom for “Team_Batting” are 20, which corresponds to the number of teams or categories within the “Team_Batting” variable.
The F-value is 6.214, which is relatively large. This suggests that there is a significant difference in the mean runs scored by different teams while batting.
The extremely small p-value (“<2e-16”) indicates that the probability of observing such a significant difference in mean runs scored by different teams if the null hypothesis were true is essentially zero.
The “Signif. codes” indicate the level of significance. In this case, the “Pr(>F)” value is so small that it is marked as ’***,’ indicating an extremely high level of significance (p < 0.001).
Based on these results, you would reject the null hypothesis. The data provides strong evidence that the choice of team (the categorical variable “Team_Batting”) has a significant impact on the runs scored (“Runs_Scored”) during the cricket match. In other words, there is a statistically significant difference in the mean runs scored by different teams while batting.
The ANOVA test results indicate that there is a statistically significant difference in the mean runs scored by different cricket teams while batting. This information is relevant and meaningful for various stakeholders who may be interested in the data:
Like for example: Team Management: 1)For cricket team management, these results suggest that team performance during batting can vary significantly between different teams. This insight can be used to identify areas for improvement and formulate strategies to enhance batting performance. Teams that consistently score higher may want to analyze their successful strategies and maintain their strengths, while teams with lower scores may use this information to focus on weaknesses and make improvements.
2)Coaches and trainers can use this information to tailor their training sessions to address specific areas where teams are underperforming. They can work on improving the batting skills and strategies to optimize run-scoring potential
In summary, the statistically significant difference in mean runs scored by different teams while batting indicates that team selection, strategies, and preparation can have a considerable impact on a team’s performance. It suggests that cricket enthusiasts and stakeholders should consider these variations when assessing team strengths and making decisions related to the sport, such as team management, coaching, and fan engagement.
To find another continuous (or ordered integer) column of data that influence the response variable “Runs_Scored” with a roughly linear relationship, I consider the “Bowler_Wicket” column. This column represents the number of wickets taken by the bowler.
The reasoning behind this choice is that the number of wickets taken by the bowler can significantly impact the performance of the batting team. A bowler who takes more wickets can disrupt the batting order and limit the runs scored by the opposing team. Therefore, there may be a roughly linear relationship between the “Bowler_Wicket” and “Runs_Scored,” where an increase in the number of wickets taken may correspond to fewer runs scored.
lets explore this relationship by creating a scatter plot and calculating the correlation coefficient between “Bowler_Wicket” and “Runs_Scored” to determine if it is roughly linear.
# Create a scatter plot
plot(my_data$Bowler_Wicket, my_data$Runs_Scored,
xlab = "Bowler Wickets", ylab = "Runs Scored",
main = "Scatter Plot of Bowler Wickets vs. Runs Scored")
# Calculate the correlation coefficient
correlation_coefficient <- cor(my_data$Bowler_Wicket, my_data$Runs_Scored)
cat("Correlation Coefficient:", correlation_coefficient, "\n")
## Correlation Coefficient: -0.1651523
The output “Correlation Coefficient: -0.1641578” indicates that the calculated correlation coefficient between “Noballs” and “Runs_Scored” is approximately -0.1642.
This correlation coefficient is negative, suggesting a weak, negative linear relationship between the number of no-balls bowled and the runs scored. Similar to the correlation between “Bowler_Wicket” and “Runs_Scored,” the relationship is weak and indicates that as the number of no-balls bowled increases, there is a slight tendency for the runs scored by the batting team to decrease.
While the relationship is not very strong, it provides some insight into the potential influence of no-balls on the runs scored in cricket matches.
# Build a linear regression model
model <- lm(Runs_Scored ~ Bowler_Wicket, data = my_data)
# Summary of the model
summary(model)
##
## Call:
## lm(formula = Runs_Scored ~ Bowler_Wicket, data = my_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2789 -1.2789 -0.2789 0.0000 4.7211
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.278923 0.004147 308.40 <2e-16 ***
## Bowler_Wicket -1.278923 0.019691 -64.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.572 on 150449 degrees of freedom
## Multiple R-squared: 0.02728, Adjusted R-squared: 0.02727
## F-statistic: 4219 on 1 and 150449 DF, p-value: < 2.2e-16
The linear regression results indicate the relationship between the predictor variable “Bowler_Wicket” (the number of wickets taken by the bowler) and the response variable “Runs_Scored” (the runs scored in cricket matches). Here’s an interpretation of the key findings:
Overall, the results suggest that there is a statistically significant, but weak, negative linear relationship between the number of wickets taken by the bowler and the runs scored in cricket matches. This means that, on average, taking more wickets is associated with a decrease in runs scored, but the predictor “Bowler_Wicket” explains only a small proportion of the variation in runs scored. Other factors not included in the model may also influence runs scored in cricket matches.
To assess the linear regression model, we can perform hypothesis tests, check diagnostic plots, interpret the coefficients, and provide recommendations based on the context of the data. Let’s start with the hypothesis tests:
Hypothesis Tests:
In a simple linear regression model, the primary hypothesis tests to consider are related to the coefficients:
These null hypotheses can be tested using the t-statistic and p-values associated with each coefficient in the model. A low p-value (< 0.05) would lead to rejecting the null hypothesis.
Diagnostic Plots:
Diagnostic plots help identify issues with the model. Common diagnostic plots include: 1. Residuals vs. Fitted Plot: Checks for heteroscedasticity. 2. Normal Q-Q Plot: Assesses the normality of residuals. 3. Residuals vs. Predictor Plot: Identifies patterns or outliers.
Now, let’s interpret the coefficients :
The coefficient for “Bowler_Wicket” is approximately -1.271060, and it is statistically significant (p-value < 0.05). This means that for every additional wicket taken by the bowler, the runs scored by the batting team decrease by approximately 1.271 runs on average.
The negative coefficient suggests that taking more wickets is associated with a reduction in runs scored. In practical terms, this may imply that stronger bowling performance (more wickets) leads to better control over the batting team’s scoring ability.
Recommendations: Based on the model, it might be beneficial for teams to focus on improving their bowling performance by increasing the number of wickets taken. This can help restrict the opposition’s scoring and improve the chances of winning matches.
Here’s how I test the hypotheses, create diagnostic plots, and provide a summary of the results in R:
The results from the hypothesis tests and diagnostic plots will help in evaluating the model’s quality and its suitability for the data at hand.
# Perform hypothesis tests
# Test for the Intercept
summary(model)$coefficients["(Intercept)", "Pr(>|t|)"]
## [1] 0
# Test for Bowler_Wicket
summary(model)$coefficients["Bowler_Wicket", "Pr(>|t|)"]
## [1] 0
# Create diagnostic plots
par(mfrow = c(2, 2))
plot(model)
The p-values for both the Intercept and the “Bowler_Wicket” coefficient are equal to 0. A p-value of 0 typically indicates that the p-value is very close to zero but not exactly zero.
In this context, a p-value close to zero is expected because the model is quite significant, and both the Intercept and the “Bowler_Wicket” coefficient have a strong influence on predicting runs scored.
This result suggests that there is a strong statistical relationship between the number of wickets taken by the bowler and the runs scored, and both the Intercept and “Bowler_Wicket” coefficient are highly significant predictors in the model.
To summarize:
The low p-values (< 0.05) for both coefficients indicate that both the Intercept and “Bowler_Wicket” are statistically significant in predicting runs scored. The coefficient for “Bowler_Wicket” is negative, implying that taking more wickets is associated with a decrease in runs scored, and this relationship is statistically significant. Given the significance of the model, it reinforces the idea that improving the bowling performance (measured by the number of wickets taken) can lead to better control over the opposition’s scoring in cricket matches, which may be a valuable insight for teams and strategists.
my_data <- my_data[!is.na(my_data$Striker_Batting_Position), ]
To include another variable in the regression model, I use the “Team_Batting” variable, which was part of the ANOVA test earlier. The reason for including this variable is to examine whether the team’s batting performance, as indicated by “Team_Batting,” has an additional effect on the runs scored beyond the influence of the number of wickets taken by the bowler.
Here’s how i create a multiple linear regression model that includes both “Bowler_Wicket” and “Team_Batting” as predictor variables:
The model now includes two predictor variables: “Bowler_Wicket” and “Team_Batting.” This will help assess how well these variables combined explain the variation in runs scored.
# Build a multiple linear regression model
model_multiple <- lm(Runs_Scored ~ Bowler_Wicket + Team_Batting, data = my_data)
# Summary of the model
summary(model_multiple)
##
## Call:
## lm(formula = Runs_Scored ~ Bowler_Wicket + Team_Batting, data = my_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3234 -1.2587 -0.2883 0.0041 4.8285
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.238658 0.012655 97.876 < 2e-16 ***
## Bowler_Wicket -1.270161 0.020664 -61.467 < 2e-16 ***
## Team_Batting10 -0.066669 0.024716 -2.697 0.006988 **
## Team_Batting11 0.009123 0.022190 0.411 0.680986
## Team_Batting12 0.049741 0.041408 1.201 0.229661
## Team_Batting13 0.049616 0.038195 1.299 0.193940
## Team_Batting2 0.084786 0.017653 4.803 1.57e-06 ***
## Team_Batting3 0.068704 0.017740 3.873 0.000108 ***
## Team_Batting4 0.053235 0.017694 3.009 0.002624 **
## Team_Batting5 0.023550 0.018354 1.283 0.199472
## Team_Batting6 0.020007 0.017836 1.122 0.261989
## Team_Batting7 0.027074 0.017437 1.553 0.120503
## Team_Batting8 0.027395 0.020771 1.319 0.187209
## Team_Batting9 -0.067190 0.041384 -1.624 0.104471
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.568 on 136576 degrees of freedom
## Multiple R-squared: 0.02743, Adjusted R-squared: 0.02734
## F-statistic: 296.3 on 13 and 136576 DF, p-value: < 2.2e-16
I consider adding an interaction between “Bowler_Wicket” and “Team_Batting” to see if the effect of the number of wickets taken by the bowler depends on the team’s batting performance. The interaction term (“Bowler_Wicket * Team_Batting”) will help assess whether the effect of wickets taken by the bowler on runs scored is different for different teams, which can provide insights into the interplay between bowling performance and team-specific batting abilities.
Here’s interaction term in the model:
# Build a model with interaction
model_interaction <- lm(Runs_Scored ~ Bowler_Wicket * Team_Batting, data = my_data)
# Summary of the model
summary(model_interaction)
##
## Call:
## lm(formula = Runs_Scored ~ Bowler_Wicket * Team_Batting, data = my_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3259 -1.2581 -0.2891 0.0000 4.8334
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.237266 0.012901 95.902 < 2e-16 ***
## Bowler_Wicket -1.237266 0.062733 -19.723 < 2e-16 ***
## Team_Batting10 -0.070342 0.025322 -2.778 0.00547 **
## Team_Batting11 0.009513 0.022680 0.419 0.67490
## Team_Batting12 0.051827 0.042202 1.228 0.21942
## Team_Batting13 0.051944 0.039156 1.327 0.18464
## Team_Batting2 0.088597 0.018045 4.910 9.13e-07 ***
## Team_Batting3 0.071615 0.018114 3.954 7.70e-05 ***
## Team_Batting4 0.055679 0.018099 3.076 0.00210 **
## Team_Batting5 0.024575 0.018764 1.310 0.19030
## Team_Batting6 0.020853 0.018240 1.143 0.25295
## Team_Batting7 0.028259 0.017828 1.585 0.11295
## Team_Batting8 0.028579 0.021265 1.344 0.17898
## Team_Batting9 -0.070710 0.042392 -1.668 0.09532 .
## Bowler_Wicket:Team_Batting10 0.070342 0.116679 0.603 0.54660
## Bowler_Wicket:Team_Batting11 -0.009513 0.109818 -0.087 0.93097
## Bowler_Wicket:Team_Batting12 -0.051827 0.218903 -0.237 0.81285
## Bowler_Wicket:Team_Batting13 -0.051944 0.178117 -0.292 0.77057
## Bowler_Wicket:Team_Batting2 -0.088597 0.087179 -1.016 0.30951
## Bowler_Wicket:Team_Batting3 -0.071615 0.089745 -0.798 0.42488
## Bowler_Wicket:Team_Batting4 -0.055679 0.086150 -0.646 0.51809
## Bowler_Wicket:Team_Batting5 -0.024575 0.090348 -0.272 0.78562
## Bowler_Wicket:Team_Batting6 -0.020853 0.087220 -0.239 0.81104
## Bowler_Wicket:Team_Batting7 -0.028259 0.085658 -0.330 0.74148
## Bowler_Wicket:Team_Batting8 -0.028579 0.099415 -0.287 0.77376
## Bowler_Wicket:Team_Batting9 0.070710 0.195784 0.361 0.71798
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.568 on 136564 degrees of freedom
## Multiple R-squared: 0.02745, Adjusted R-squared: 0.02728
## F-statistic: 154.2 on 25 and 136564 DF, p-value: < 2.2e-16
The multiple linear regression results show the relationship between the response variable “Runs_Scored” and the predictor variables “Bowler_Wicket” and “Team_Batting,” as well as their interaction terms. Here’s an interpretation of the key findings:
Residuals: - The residuals represent the differences between the observed values of “Runs_Scored” and the values predicted by the model. These differences range from approximately -1.3259 to 4.8334. The “Min” value represents the largest negative deviation, and the “Max” value represents the largest positive deviation from the predicted values.
Coefficients: - Intercept: The intercept (estimated as 1.237266) is the predicted value of “Runs_Scored” when both “Bowler_Wicket” and “Team_Batting” are zero. In this context, it may not have a meaningful interpretation. - Bowler_Wicket: The coefficient for “Bowler_Wicket” (estimated as -1.237266) represents the change in the predicted “Runs_Scored” for each additional wicket taken by the bowler. - Team_Batting: There are multiple coefficients for “Team_Batting,” each representing the change in the predicted “Runs_Scored” associated with a specific team’s batting performance. Some of these coefficients are statistically significant, while others are not. - Interaction Terms: The interaction terms between “Bowler_Wicket” and each “Team_Batting” variable assess whether the effect of wickets taken by the bowler on runs scored depends on the team’s batting performance.
Significance: - Several coefficients for “Team_Batting” are statistically significant with low p-values (< 0.05), indicating that the batting performance of specific teams has a significant impact on runs scored. - Interaction terms between “Bowler_Wicket” and “Team_Batting” variables do not appear to be statistically significant for most teams (p-values > 0.05).
Residual Standard Error: - The residual standard error (approximately 1.568) measures the spread or variability of the data points around the regression line.
R-squared: - The multiple R-squared value (0.02745) represents the proportion of the variance in “Runs_Scored” that can be explained by the combination of “Bowler_Wicket,” “Team_Batting,” and their interaction terms. In this case, approximately 2.75% of the variation in runs scored can be attributed to these factors.
Adjusted R-squared: - The adjusted R-squared (0.02728) is a modified version of R-squared that adjusts for the number of predictor variables in the model. It is slightly lower than the R-squared, which is typical in multiple regression.
F-statistic: - The F-statistic (154.2) tests the overall significance of the regression model. In this case, the very low p-value (< 2.2e-16) indicates that the model as a whole is highly significant.
Interpretation: - The coefficients for “Bowler_Wicket” and most “Team_Batting” variables remain consistent with the earlier model. An additional finding is that the batting performance of specific teams, as represented by “Team_Batting” variables, significantly affects runs scored. - Interaction terms between “Bowler_Wicket” and “Team_Batting” variables do not appear to be statistically significant, suggesting that the effect of wickets taken by the bowler on runs scored does not significantly depend on the team’s batting performance.