R Markdown

data <- read.csv('C:/Users/dell/Downloads/Cleaned_Ball_By_Ball.csv')
summary(data)

##     MatcH_id          Over_id         Ball_id        Innings_No   
##  Min.   : 335987   Min.   : 1.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 419154   1st Qu.: 5.00   1st Qu.:2.000   1st Qu.:1.000  
##  Median : 548382   Median :10.00   Median :4.000   Median :1.000  
##  Mean   : 636208   Mean   :10.14   Mean   :3.617   Mean   :1.482  
##  3rd Qu.: 829742   3rd Qu.:15.00   3rd Qu.:5.000   3rd Qu.:2.000  
##  Max.   :1082650   Max.   :20.00   Max.   :9.000   Max.   :4.000  
##  Team_Batting       Team_Bowling       Striker_Batting_Position
##  Length:150451      Length:150451      Min.   : 1.000          
##  Class :character   Class :character   1st Qu.: 2.000          
##  Mode  :character   Mode  :character   Median : 3.000          
##                                        Mean   : 3.438          
##                                        3rd Qu.: 5.000          
##                                        Max.   :11.000          
##   Extra_Type         Runs_Scored      Extra_runs          Wides       
##  Length:150451      Min.   :0.000   Min.   :0.00000   Min.   :0.0000  
##  Class :character   1st Qu.:0.000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Mode  :character   Median :1.000   Median :0.00000   Median :0.0000  
##                     Mean   :1.222   Mean   :0.06899   Mean   :0.0375  
##                     3rd Qu.:1.000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##                     Max.   :6.000   Max.   :5.00000   Max.   :5.0000  
##     Legbyes             Byes             Noballs           Penalty       
##  Min.   :0.00000   Min.   :0.000000   Min.   :0.00000   Min.   :0.0e+00  
##  1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.0e+00  
##  Median :0.00000   Median :0.000000   Median :0.00000   Median :0.0e+00  
##  Mean   :0.02223   Mean   :0.004885   Mean   :0.00434   Mean   :3.3e-05  
##  3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.0e+00  
##  Max.   :5.00000   Max.   :4.000000   Max.   :5.00000   Max.   :5.0e+00  
##  Bowler_Extras       Out_type             Caught            Bowled        
##  Min.   :0.00000   Length:150451      Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.00000   Class :character   1st Qu.:0.00000   1st Qu.:0.000000  
##  Median :0.00000   Mode  :character   Median :0.00000   Median :0.000000  
##  Mean   :0.04184                      Mean   :0.02907   Mean   :0.009186  
##  3rd Qu.:0.00000                      3rd Qu.:0.00000   3rd Qu.:0.000000  
##  Max.   :5.00000                      Max.   :1.00000   Max.   :1.000000  
##     Run_out              LBW            Retired_hurt         Stumped        
##  Min.   :0.000000   Min.   :0.000000   Min.   :0.00e+00   Min.   :0.000000  
##  1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.00e+00   1st Qu.:0.000000  
##  Median :0.000000   Median :0.000000   Median :0.00e+00   Median :0.000000  
##  Mean   :0.005018   Mean   :0.003024   Mean   :5.98e-05   Mean   :0.001615  
##  3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:0.00e+00   3rd Qu.:0.000000  
##  Max.   :1.000000   Max.   :1.000000   Max.   :1.00e+00   Max.   :1.000000  
##  caught_and_bowled    hit_wicket       ObstructingFeild  Bowler_Wicket    
##  Min.   :0.000000   Min.   :0.00e+00   Min.   :0.0e+00   Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.00e+00   1st Qu.:0.0e+00   1st Qu.:0.00000  
##  Median :0.000000   Median :0.00e+00   Median :0.0e+00   Median :0.00000  
##  Mean   :0.001402   Mean   :5.98e-05   Mean   :6.6e-06   Mean   :0.04435  
##  3rd Qu.:0.000000   3rd Qu.:0.00e+00   3rd Qu.:0.0e+00   3rd Qu.:0.00000  
##  Max.   :1.000000   Max.   :1.00e+00   Max.   :1.0e+00   Max.   :1.00000  
##   Match_Date            Season        Striker       Non_Striker   
##  Length:150451      Min.   :2008   Min.   :  1.0   Min.   :  1.0  
##  Class :character   1st Qu.:2010   1st Qu.: 40.0   1st Qu.: 40.0  
##  Mode  :character   Median :2012   Median : 96.0   Median : 96.0  
##                     Mean   :2012   Mean   :136.5   Mean   :135.6  
##                     3rd Qu.:2015   3rd Qu.:208.0   3rd Qu.:208.0  
##                     Max.   :2017   Max.   :497.0   Max.   :497.0  
##      Bowler      Striker_match_SK   StrikerSK     NonStriker_match_SK
##  Min.   :  1.0   Min.   :12694    Min.   :  0.0   Min.   :12694      
##  1st Qu.: 77.0   1st Qu.:16173    1st Qu.: 39.0   1st Qu.:16173      
##  Median :174.0   Median :19672    Median : 95.0   Median :19672      
##  Mean   :194.1   Mean   :19675    Mean   :135.5   Mean   :19675      
##  3rd Qu.:310.0   3rd Qu.:23127    3rd Qu.:207.0   3rd Qu.:23127      
##  Max.   :497.0   Max.   :26685    Max.   :496.0   Max.   :26685      
##  NONStriker_SK   Fielder_match_SK   Fielder_SK      Bowler_match_SK
##  Min.   :  0.0   Min.   :   -1    Min.   : -1.000   Min.   :12697  
##  1st Qu.: 39.0   1st Qu.:   -1    1st Qu.: -1.000   1st Qu.:16175  
##  Median : 95.0   Median :   -1    Median : -1.000   Median :19674  
##  Mean   :134.6   Mean   :  690    Mean   :  4.527   Mean   :19677  
##  3rd Qu.:207.0   3rd Qu.:   -1    3rd Qu.: -1.000   3rd Qu.:23131  
##  Max.   :496.0   Max.   :26680    Max.   :496.000   Max.   :26685  
##    BOWLER_SK     PlayerOut_match_SK BattingTeam_SK   BowlingTeam_SK  
##  Min.   :  0.0   Min.   :   -1.0    Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 76.0   1st Qu.:   -1.0    1st Qu.: 2.000   1st Qu.: 2.000  
##  Median :173.0   Median :   -1.0    Median : 4.000   Median : 4.000  
##  Mean   :193.1   Mean   :  970.3    Mean   : 4.346   Mean   : 4.333  
##  3rd Qu.:309.0   3rd Qu.:   -1.0    3rd Qu.: 6.000   3rd Qu.: 6.000  
##  Max.   :496.0   Max.   :26685.0    Max.   :12.000   Max.   :12.000  
##   Keeper_Catch      Player_out_sk      MatchDateSK      
##  Min.   :0.000000   Min.   : -1.000   Min.   :20080418  
##  1st Qu.:0.000000   1st Qu.:  0.000   1st Qu.:20100411  
##  Median :0.000000   Median :  0.000   Median :20120520  
##  Mean   :0.000432   Mean   :  1.101   Mean   :20125288  
##  3rd Qu.:0.000000   3rd Qu.:  0.000   3rd Qu.:20150420  
##  Max.   :1.000000   Max.   :496.000   Max.   :20170521

Response Variable

Total Runs in an Over: We can calculate the total runs scored in each over. This is a continuous variable that includes runs scored from the bat plus any extras. It’s a significant metric in cricket, as it reflects the scoring rate and the effectiveness of the batting team.

Explanatory Variable

Over_id: This is a categorical variable representing the over number in an innings. The over number can influence the total runs scored due to various factors like powerplay restrictions, bowler changes, and the progression of the batting team’s strategy.

Proposed Analysis

The analysis would involve aggregating the total runs scored in each over and then examining how the scoring varies across different overs in an innings. This can provide insights into scoring patterns and strategies employed by teams at different stages of an innings. This approach offers a valuable perspective on the dynamics of a cricket match and can help in understanding how the phase of the game (early, middle, or late overs) influences scoring rates.

Hypothesis

Null Hypothesis (H0) H0: The average total runs scored in an over is the same across all overs in an innings. This means that the over number (Over_id) has no significant effect on the total runs scored in that over. Alternative Hypothesis (H1) H1: The average total runs scored in an over differs among overs in an innings. This implies that the over number (Over_id) does have a significant effect on the total runs scored in that over.

ANOVA Test in R

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Calculating total runs per over
total_runs_per_over <- data %>%
  group_by(MatcH_id, Innings_No, Over_id) %>%
  summarise(Total_Runs = sum(Runs_Scored + Extra_runs + Wides + Noballs + Legbyes), .groups = 'drop')

# ANOVA Test
anova_result <- aov(Total_Runs ~ as.factor(Over_id), data = total_runs_per_over)
summary(anova_result)

##                       Df Sum Sq Mean Sq F value Pr(>F)    
## as.factor(Over_id)    19  23529  1238.4   57.79 <2e-16 ***
## Residuals          24368 522139    21.4                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

interpretation and results

1)Df (Degrees of Freedom):

as.factor(Over_id): Df = 19, indicating 20 different levels or categories in Over_id (as it’s 20 overs in a T20 match or the first 20 overs of an ODI or Test match).

Since the p-value is much lower than 0.05, you can reject the null hypothesis (H0). This means there is statistically significant evidence to conclude that the average total runs scored in an over differs among overs in an innings. In other words, the over number (Over_id) has a significant effect on the total runs scored in that over. The extremely low p-value and high F value indicate that the differences in average runs scored per over are unlikely to be due to random chance. Instead, they are likely influenced by the over number, which could be due to factors like batting and bowling strategies changing as the innings progresses.

In summary, your ANOVA test shows that the over number is a significant factor influencing the total runs scored in an over during a cricket match.

Realtime application of the above result.

The results from the ANOVA test, showing a significant effect of the over number (Over_id) on the total runs scored in an over, offer several actionable insights for different stakeholders in the context of cricket. These include cricket teams and coaches, sports analysts, and even bettors or fantasy cricket players. Here’s what it might mean for them:

Cricket Teams and Coaches:

Strategic Planning: Understanding how scoring rates vary by over can help teams plan their batting and bowling strategies more effectively. For instance, identifying overs where scoring is typically high might influence decisions on when to deploy key bowlers or when batsmen should accelerate scoring.

Player Roles: Teams can tailor the roles of players based on these insights. Aggressive batsmen might be better utilized in overs where scoring is typically higher, while more economical bowlers could be saved for overs where bats typically score less. Sports Analysts:

Match Analysis: The insight that certain overs have different scoring patterns can enrich match analyses. Analysts can delve deeper into what happens in these overs – whether it’s due to specific bowlers, batting strategies, or field settings.

Player Evaluation: This data can be used to evaluate players in more nuanced ways, like assessing a bowler’s effectiveness in specific overs or a batsman’s ability to capitalize in high-scoring overs. Bettors and Fantasy Cricket Players:

Betting Strategies: Bettors can use this information to make informed bets, such as predicting total runs in a match or during specific phases of the game. Fantasy Team Selection: Fantasy cricket players might select players who are likely to perform better in high-scoring overs, maximizing their points and effectiveness in fantasy leagues.

Visualization of results

To visualize the results showing how the total runs scored in an over varies with the over number(over_id), a boxplot would be an effective choice. A boxplot can display the distribution of total runs across different overs, highlighting any variations or trends. It’s particularly useful for showing the range, median, and any potential outliers in the data.

library(ggplot2)


# Create a boxplot
ggplot(total_runs_per_over, aes(x = as.factor(Over_id), y = Total_Runs)) +
  geom_boxplot() +
  labs(title = "Total Runs Scored per Over Across Different Overs",
       x = "Over Number",
       y = "Total Runs") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))  # Rotate x-axis labels if needed

## Analysis and results:

X-axis (Over Number): Each box represents an over in the innings, from the 1st to the 20th over.

Y-axis (Total Runs): This shows the total number of runs scored in each over.

Boxplot Elements:

The bottom and top of the box represent the first quartile (25th percentile) and third quartile (75th percentile), respectively. This indicates the range in which the middle 50% of the data lies for each over. The line within the box indicates the median runs scored for each over, which is the 50th percentile of the data.

The whiskers (lines extending from the top and bottom of each box) typically extend to the highest and lowest values within 1.5 times the interquartile range from the first and third quartiles. Points beyond the whiskers are considered outliers. Outliers are represented as individual points beyond the whiskers. These are over scores that are unusually high or low compared to the rest of the data in that over.

Variability and Trends:

We can observe that the median runs per over seem to be fairly consistent across different overs, with a slight upward trend in later overs (which may represent the end-of-innings push for runs). The number of outliers (points outside the whiskers) tends to increase in the later overs, suggesting greater variability in runs scored as the game progresses. This might be due to teams either accelerating their scoring rate or losing wickets in an attempt to score quickly.

Strategic Insights:

Teams could potentially exploit this information by planning their strategic moves, such as when to take powerplays, deploy certain bowlers, or accelerate batting. This visualization is useful for cricket analysts, coaches, and players to understand and strategize around the scoring patterns throughout an innings. The data can also be indicative of when teams tend to take more risks or when bowlers are under more pressure.

Building a linear regression model

library(dplyr)

average_runs_per_over <- total_runs_per_over %>%
  group_by(Over_id) %>%
  summarise(Avg_Total_Runs = mean(Total_Runs), .groups = 'drop')

# Build the linear regression model
model <- lm(Avg_Total_Runs ~ Over_id, data = average_runs_per_over)

# Summary of the model to get the coefficients and statistics
summary(model)

## 
## Call:
## lm(formula = Avg_Total_Runs ~ Over_id, data = average_runs_per_over)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8553 -0.5488 -0.1694  0.6348  0.9352 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.98674    0.31209  22.387 1.36e-14 ***
## Over_id      0.13596    0.02605   5.219 5.80e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6718 on 18 degrees of freedom
## Multiple R-squared:  0.6021, Adjusted R-squared:   0.58 
## F-statistic: 27.23 on 1 and 18 DF,  p-value: 5.799e-05

Visualization of the linear regression model

ggplot(average_runs_per_over, aes(x = Over_id, y = Avg_Total_Runs)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "Linear Regression of Average Total Runs on Over Number",
       x = "Over Number",
       y = "Average Total Runs") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

## Conclusion

The linear regression model indicates that there is a significant positive relationship between the over number and the average total runs scored. As the over progresses, there is a significant increase in runs scored per over. This model can be used to predict the total runs scored in an over based on its number in the innings. However, the residuals suggest that the model is not perfect, and there may be overs where the actual runs scored deviate from the model’s predictions.

Final Data dive

Sai Dheeraj Kanaparthi

2023-11-29