library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
data <- read.csv("~/Documents/Rdocs/matches.csv", stringsAsFactors = TRUE)

head(data)
##       id  season       city       date match_type player_of_match
## 1 335982 2007/08  Bangalore 2008-04-18     League     BB McCullum
## 2 335983 2007/08 Chandigarh 2008-04-19     League      MEK Hussey
## 3 335984 2007/08      Delhi 2008-04-19     League     MF Maharoof
## 4 335985 2007/08     Mumbai 2008-04-20     League      MV Boucher
## 5 335986 2007/08    Kolkata 2008-04-20     League       DJ Hussey
## 6 335987 2007/08     Jaipur 2008-04-21     League       SR Watson
##                                        venue                       team1
## 1                      M Chinnaswamy Stadium Royal Challengers Bangalore
## 2 Punjab Cricket Association Stadium, Mohali             Kings XI Punjab
## 3                           Feroz Shah Kotla            Delhi Daredevils
## 4                           Wankhede Stadium              Mumbai Indians
## 5                               Eden Gardens       Kolkata Knight Riders
## 6                     Sawai Mansingh Stadium            Rajasthan Royals
##                         team2                 toss_winner toss_decision
## 1       Kolkata Knight Riders Royal Challengers Bangalore         field
## 2         Chennai Super Kings         Chennai Super Kings           bat
## 3            Rajasthan Royals            Rajasthan Royals           bat
## 4 Royal Challengers Bangalore              Mumbai Indians           bat
## 5             Deccan Chargers             Deccan Chargers           bat
## 6             Kings XI Punjab             Kings XI Punjab           bat
##                        winner  result result_margin target_runs target_overs
## 1       Kolkata Knight Riders    runs           140         223           20
## 2         Chennai Super Kings    runs            33         241           20
## 3            Delhi Daredevils wickets             9         130           20
## 4 Royal Challengers Bangalore wickets             5         166           20
## 5       Kolkata Knight Riders wickets             5         111           20
## 6            Rajasthan Royals wickets             6         167           20
##   super_over method   umpire1        umpire2
## 1          N   <NA> Asad Rauf    RE Koertzen
## 2          N   <NA> MR Benson     SL Shastri
## 3          N   <NA> Aleem Dar GA Pratapkumar
## 4          N   <NA>  SJ Davis      DJ Harper
## 5          N   <NA> BF Bowden    K Hariharan
## 6          N   <NA> Aleem Dar      RB Tiffin

ANOVA Test

Lets try result margin and city variables for performing anova test

# Lets select relevant columns and remove missing data
anova_data <- data %>%
  select(result_margin, city) %>%
  na.omit()

anova_model <- aov(result_margin ~ city, data = anova_data)
summary(anova_model)
##              Df Sum Sq Mean Sq F value Pr(>F)
## city         35  20186   576.7    1.25  0.153
## Residuals   992 457610   461.3

Response Variable: The most valuable continuous variable in this dataset is “result_margin” The result margin reflects the intent of teams and potentially the intensity of a match, making it essential for both users and team management.

Explanatory Variable: We consider “city” as the explanatory variable, it has many attributes which makes it suitable for performing anova test. The city can significantly influence the winning chances of match, as home conditions for a team helps that team as they have more practice in that ground(like csk has more wins in matches played in chennai)

The ANOVA test output indicates the following:

Df: Degrees of freedom, which represent the number of independent pieces of information used to estimate a parameter. In this case, there is 35 degree of freedom for the city factor and 992 degrees of freedom for the residuals.

Sum Sq: The sum of squares, which measures the variability in the data. For the city factor, the sum of squares is 20186, and for the residuals, it is 457610.

Mean Sq: The mean square, obtained by dividing the sum of squares by the degrees of freedom. It represents the variance explained by the specific factor or residuals. For the city factor, the mean square is 576.7, and for the residuals, it is 461.3.

F value: The F-statistic, which is calculated as the ratio of the mean square for the factor to the mean square for the residuals. It tests whether there is a significant difference between means of the group. Here, the F value is 1.25.

Pr(>F): The p-value associated with the F-statistic, which indicates the probability of observing the data if the null hypothesis (no difference between group means) is true. A lower p-value suggests stronger evidence against the null hypothesis. In this case, the p-value is 0.153, which is small.

Interpretation:

The p-value is highly significant (greater than the conventional significance level of 0.05), indicating not so strong evidence against the null hypothesis.

Therefore, we do not reject the null hypothesis and conclude that there is no significant difference in the mean number of result_margin across different cities.

The venue of a match doesn’t influence the result margin it receives.The home city conditions may help winning the match, but the result margin does not get affected.

This result is important for team management, players, audience. As there is not enough evidence to conclude that the city of the match has affect on result margin. we can say that no matter of what the venue of the match is, the result margin does not depend on it.

Lets try a visualization to find the result margin by city.

ggplot(anova_data, aes(x = city, y = result_margin)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title = "Result Margin by City", x = "City", y = "Result Margin")

We can see that the response is very random. We can’t come to a conclusion based on the city to result margin as almost all the cities has got same mean result margin.

Next, let’s consider a continuous variable that might influence the result margin.

Continuous Explanatory Variable: For this, let’s consider “target_runs” as we can say that the target runs could influence the pressure on players which results in higher result margin.

Now, let’s build a linear regression model on these.

Linear Regression

linear_data <- data %>%
  select(result_margin, target_runs) %>%
  na.omit()

linear_model <- lm(result_margin ~ target_runs, data = linear_data)

summary(linear_model)
## 
## Call:
## lm(formula = result_margin ~ target_runs, data = linear_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.020 -11.980  -5.342   4.951 116.591 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -25.41264    3.08821  -8.229 5.42e-16 ***
## target_runs   0.25738    0.01826  14.096  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.02 on 1074 degrees of freedom
## Multiple R-squared:  0.1561, Adjusted R-squared:  0.1553 
## F-statistic: 198.7 on 1 and 1074 DF,  p-value: < 2.2e-16

The linear regression model output provides the following information:

Residuals: These are the differences between the observed values of the dependent variable (result_margin) and the values predicted by the model. They provide insights into how well the model fits the data.

Coefficients: These represent the estimated coefficients of the linear regression model.

Intercept: The intercept estimate is -25.412 It represents the predicted result margin when the target runs are zero. As zero target runs is only when the match has not happened, thus the value is in negatives.

target runs: The coefficient estimate for target runs is 0.25. It indicates the average change in the result margin for one increase in target runs

Significance codes: These indicate the level of significance of each coefficient.

Residual standard error: This represents the standard deviation of the residuals, which provides a measure of the model’s accuracy in predicting the dependent variable.

R-squared: This indicates the proportion of variance in the result margin that is explained by the target runs. In this case, the adjusted R-squared is 0.15, suggesting that only about 1% of the variance in the reslt margin explained by the target runs.

F-statistic: This tests the overall significance of the model. The p-value associated with the F-statistic is 2.2e-16, indicating that the model is statistically significant.

Interpretation: The model explains a small proportion of the variance in the result margin indicating that target runs alone may not be a strong predictor of the result margin.

However, the model is statistically significant (p-value of 2.2e-16), suggesting that there is a significant relationship between both variables.

ggplot(data, aes(x = target_runs, y = result_margin)) +
  geom_point(alpha = 0.5, color = "blue") + 
  geom_smooth(method = "lm", se = TRUE, color = "red") + 
  labs(title = "Relationship Between Target runs and Result margin",
       x = "Target runs",y = "Result margins") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 19 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 19 rows containing missing values or values outside the scale range
## (`geom_point()`).

This linear regression model explains that the increased target runs often tends to have bigger result margin.

Conclusion:

Toss decision influence: Players and team management should consider toss selection carefully, as it affects the other factors although it has minor influence on result margin

Target runs influence: The relation between target runs and result margin seem to be strong. So players and management should consider posting high target to get bigger result margin.

By understanding these relationships, management can make important decisions to enhance the matches winning with high margins which tend to be important in later levels of game as it increases net runrate of the team.