data <- read.csv('C:/Users/dell/Downloads/Cleaned_Ball_By_Ball.csv')
summary(data)
## MatcH_id Over_id Ball_id Innings_No
## Min. : 335987 Min. : 1.00 Min. :1.000 Min. :1.000
## 1st Qu.: 419154 1st Qu.: 5.00 1st Qu.:2.000 1st Qu.:1.000
## Median : 548382 Median :10.00 Median :4.000 Median :1.000
## Mean : 636208 Mean :10.14 Mean :3.617 Mean :1.482
## 3rd Qu.: 829742 3rd Qu.:15.00 3rd Qu.:5.000 3rd Qu.:2.000
## Max. :1082650 Max. :20.00 Max. :9.000 Max. :4.000
## Team_Batting Team_Bowling Striker_Batting_Position
## Length:150451 Length:150451 Min. : 1.000
## Class :character Class :character 1st Qu.: 2.000
## Mode :character Mode :character Median : 3.000
## Mean : 3.438
## 3rd Qu.: 5.000
## Max. :11.000
## Extra_Type Runs_Scored Extra_runs Wides
## Length:150451 Min. :0.000 Min. :0.00000 Min. :0.0000
## Class :character 1st Qu.:0.000 1st Qu.:0.00000 1st Qu.:0.0000
## Mode :character Median :1.000 Median :0.00000 Median :0.0000
## Mean :1.222 Mean :0.06899 Mean :0.0375
## 3rd Qu.:1.000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :6.000 Max. :5.00000 Max. :5.0000
## Legbyes Byes Noballs Penalty
## Min. :0.00000 Min. :0.000000 Min. :0.00000 Min. :0.0e+00
## 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.0e+00
## Median :0.00000 Median :0.000000 Median :0.00000 Median :0.0e+00
## Mean :0.02223 Mean :0.004885 Mean :0.00434 Mean :3.3e-05
## 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.0e+00
## Max. :5.00000 Max. :4.000000 Max. :5.00000 Max. :5.0e+00
## Bowler_Extras Out_type Caught Bowled
## Min. :0.00000 Length:150451 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.00000 Class :character 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.00000 Mode :character Median :0.00000 Median :0.000000
## Mean :0.04184 Mean :0.02907 Mean :0.009186
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :5.00000 Max. :1.00000 Max. :1.000000
## Run_out LBW Retired_hurt Stumped
## Min. :0.000000 Min. :0.000000 Min. :0.00e+00 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.00e+00 1st Qu.:0.000000
## Median :0.000000 Median :0.000000 Median :0.00e+00 Median :0.000000
## Mean :0.005018 Mean :0.003024 Mean :5.98e-05 Mean :0.001615
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.00e+00 3rd Qu.:0.000000
## Max. :1.000000 Max. :1.000000 Max. :1.00e+00 Max. :1.000000
## caught_and_bowled hit_wicket ObstructingFeild Bowler_Wicket
## Min. :0.000000 Min. :0.00e+00 Min. :0.0e+00 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.00e+00 1st Qu.:0.0e+00 1st Qu.:0.00000
## Median :0.000000 Median :0.00e+00 Median :0.0e+00 Median :0.00000
## Mean :0.001402 Mean :5.98e-05 Mean :6.6e-06 Mean :0.04435
## 3rd Qu.:0.000000 3rd Qu.:0.00e+00 3rd Qu.:0.0e+00 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.00e+00 Max. :1.0e+00 Max. :1.00000
## Match_Date Season Striker Non_Striker
## Length:150451 Min. :2008 Min. : 1.0 Min. : 1.0
## Class :character 1st Qu.:2010 1st Qu.: 40.0 1st Qu.: 40.0
## Mode :character Median :2012 Median : 96.0 Median : 96.0
## Mean :2012 Mean :136.5 Mean :135.6
## 3rd Qu.:2015 3rd Qu.:208.0 3rd Qu.:208.0
## Max. :2017 Max. :497.0 Max. :497.0
## Bowler Striker_match_SK StrikerSK NonStriker_match_SK
## Min. : 1.0 Min. :12694 Min. : 0.0 Min. :12694
## 1st Qu.: 77.0 1st Qu.:16173 1st Qu.: 39.0 1st Qu.:16173
## Median :174.0 Median :19672 Median : 95.0 Median :19672
## Mean :194.1 Mean :19675 Mean :135.5 Mean :19675
## 3rd Qu.:310.0 3rd Qu.:23127 3rd Qu.:207.0 3rd Qu.:23127
## Max. :497.0 Max. :26685 Max. :496.0 Max. :26685
## NONStriker_SK Fielder_match_SK Fielder_SK Bowler_match_SK
## Min. : 0.0 Min. : -1 Min. : -1.000 Min. :12697
## 1st Qu.: 39.0 1st Qu.: -1 1st Qu.: -1.000 1st Qu.:16175
## Median : 95.0 Median : -1 Median : -1.000 Median :19674
## Mean :134.6 Mean : 690 Mean : 4.527 Mean :19677
## 3rd Qu.:207.0 3rd Qu.: -1 3rd Qu.: -1.000 3rd Qu.:23131
## Max. :496.0 Max. :26680 Max. :496.000 Max. :26685
## BOWLER_SK PlayerOut_match_SK BattingTeam_SK BowlingTeam_SK
## Min. : 0.0 Min. : -1.0 Min. : 0.000 Min. : 0.000
## 1st Qu.: 76.0 1st Qu.: -1.0 1st Qu.: 2.000 1st Qu.: 2.000
## Median :173.0 Median : -1.0 Median : 4.000 Median : 4.000
## Mean :193.1 Mean : 970.3 Mean : 4.346 Mean : 4.333
## 3rd Qu.:309.0 3rd Qu.: -1.0 3rd Qu.: 6.000 3rd Qu.: 6.000
## Max. :496.0 Max. :26685.0 Max. :12.000 Max. :12.000
## Keeper_Catch Player_out_sk MatchDateSK
## Min. :0.000000 Min. : -1.000 Min. :20080418
## 1st Qu.:0.000000 1st Qu.: 0.000 1st Qu.:20100411
## Median :0.000000 Median : 0.000 Median :20120520
## Mean :0.000432 Mean : 1.101 Mean :20125288
## 3rd Qu.:0.000000 3rd Qu.: 0.000 3rd Qu.:20150420
## Max. :1.000000 Max. :496.000 Max. :20170521
Total Runs in an Over: We can calculate the total runs scored in each over. This is a continuous variable that includes runs scored from the bat plus any extras. It’s a significant metric in cricket, as it reflects the scoring rate and the effectiveness of the batting team.
Over_id: This is a categorical variable representing the over number in an innings. The over number can influence the total runs scored due to various factors like powerplay restrictions, bowler changes, and the progression of the batting team’s strategy.
The analysis would involve aggregating the total runs scored in each over and then examining how the scoring varies across different overs in an innings. This can provide insights into scoring patterns and strategies employed by teams at different stages of an innings. This approach offers a valuable perspective on the dynamics of a cricket match and can help in understanding how the phase of the game (early, middle, or late overs) influences scoring rates.
Null Hypothesis (H0) H0: The average total runs scored in an over is the same across all overs in an innings. This means that the over number (Over_id) has no significant effect on the total runs scored in that over. Alternative Hypothesis (H1) H1: The average total runs scored in an over differs among overs in an innings. This implies that the over number (Over_id) does have a significant effect on the total runs scored in that over.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Calculating total runs per over
total_runs_per_over <- data %>%
group_by(MatcH_id, Innings_No, Over_id) %>%
summarise(Total_Runs = sum(Runs_Scored + Extra_runs + Wides + Noballs + Legbyes), .groups = 'drop')
# ANOVA Test
anova_result <- aov(Total_Runs ~ as.factor(Over_id), data = total_runs_per_over)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Over_id) 19 23529 1238.4 57.79 <2e-16 ***
## Residuals 24368 522139 21.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
1)Df (Degrees of Freedom):
as.factor(Over_id): Df = 19, indicating 20 different levels or categories in Over_id (as it’s 20 overs in a T20 match or the first 20 overs of an ODI or Test match).
Since the p-value is much lower than 0.05, you can reject the null hypothesis (H0). This means there is statistically significant evidence to conclude that the average total runs scored in an over differs among overs in an innings. In other words, the over number (Over_id) has a significant effect on the total runs scored in that over. The extremely low p-value and high F value indicate that the differences in average runs scored per over are unlikely to be due to random chance. Instead, they are likely influenced by the over number, which could be due to factors like batting and bowling strategies changing as the innings progresses.
In summary, your ANOVA test shows that the over number is a significant factor influencing the total runs scored in an over during a cricket match.
The results from the ANOVA test, showing a significant effect of the over number (Over_id) on the total runs scored in an over, offer several actionable insights for different stakeholders in the context of cricket. These include cricket teams and coaches, sports analysts, and even bettors or fantasy cricket players. Here’s what it might mean for them:
Cricket Teams and Coaches:
Strategic Planning: Understanding how scoring rates vary by over can help teams plan their batting and bowling strategies more effectively. For instance, identifying overs where scoring is typically high might influence decisions on when to deploy key bowlers or when batsmen should accelerate scoring.
Player Roles: Teams can tailor the roles of players based on these insights. Aggressive batsmen might be better utilized in overs where scoring is typically higher, while more economical bowlers could be saved for overs where bats typically score less. Sports Analysts:
Match Analysis: The insight that certain overs have different scoring patterns can enrich match analyses. Analysts can delve deeper into what happens in these overs – whether it’s due to specific bowlers, batting strategies, or field settings.
Player Evaluation: This data can be used to evaluate players in more nuanced ways, like assessing a bowler’s effectiveness in specific overs or a batsman’s ability to capitalize in high-scoring overs. Bettors and Fantasy Cricket Players:
Betting Strategies: Bettors can use this information to make informed bets, such as predicting total runs in a match or during specific phases of the game. Fantasy Team Selection: Fantasy cricket players might select players who are likely to perform better in high-scoring overs, maximizing their points and effectiveness in fantasy leagues.
To visualize the results showing how the total runs scored in an over varies with the over number(over_id), a boxplot would be an effective choice. A boxplot can display the distribution of total runs across different overs, highlighting any variations or trends. It’s particularly useful for showing the range, median, and any potential outliers in the data.
library(ggplot2)
# Create a boxplot
ggplot(total_runs_per_over, aes(x = as.factor(Over_id), y = Total_Runs)) +
geom_boxplot() +
labs(title = "Total Runs Scored per Over Across Different Overs",
x = "Over Number",
y = "Total Runs") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) # Rotate x-axis labels if needed
## Analysis and results:
X-axis (Over Number): Each box represents an over in the innings, from the 1st to the 20th over.
Y-axis (Total Runs): This shows the total number of runs scored in each over.
Boxplot Elements:
The bottom and top of the box represent the first quartile (25th percentile) and third quartile (75th percentile), respectively. This indicates the range in which the middle 50% of the data lies for each over. The line within the box indicates the median runs scored for each over, which is the 50th percentile of the data.
The whiskers (lines extending from the top and bottom of each box) typically extend to the highest and lowest values within 1.5 times the interquartile range from the first and third quartiles. Points beyond the whiskers are considered outliers. Outliers are represented as individual points beyond the whiskers. These are over scores that are unusually high or low compared to the rest of the data in that over.
We can observe that the median runs per over seem to be fairly consistent across different overs, with a slight upward trend in later overs (which may represent the end-of-innings push for runs). The number of outliers (points outside the whiskers) tends to increase in the later overs, suggesting greater variability in runs scored as the game progresses. This might be due to teams either accelerating their scoring rate or losing wickets in an attempt to score quickly.
Strategic Insights:
Teams could potentially exploit this information by planning their strategic moves, such as when to take powerplays, deploy certain bowlers, or accelerate batting. This visualization is useful for cricket analysts, coaches, and players to understand and strategize around the scoring patterns throughout an innings. The data can also be indicative of when teams tend to take more risks or when bowlers are under more pressure.
library(dplyr)
average_runs_per_over <- total_runs_per_over %>%
group_by(Over_id) %>%
summarise(Avg_Total_Runs = mean(Total_Runs), .groups = 'drop')
# Build the linear regression model
model <- lm(Avg_Total_Runs ~ Over_id, data = average_runs_per_over)
# Summary of the model to get the coefficients and statistics
summary(model)
##
## Call:
## lm(formula = Avg_Total_Runs ~ Over_id, data = average_runs_per_over)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8553 -0.5488 -0.1694 0.6348 0.9352
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.98674 0.31209 22.387 1.36e-14 ***
## Over_id 0.13596 0.02605 5.219 5.80e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6718 on 18 degrees of freedom
## Multiple R-squared: 0.6021, Adjusted R-squared: 0.58
## F-statistic: 27.23 on 1 and 18 DF, p-value: 5.799e-05
ggplot(average_runs_per_over, aes(x = Over_id, y = Avg_Total_Runs)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(title = "Linear Regression of Average Total Runs on Over Number",
x = "Over Number",
y = "Average Total Runs") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Conclusion
The linear regression model indicates that there is a significant positive relationship between the over number and the average total runs scored. As the over progresses, there is a significant increase in runs scored per over. This model can be used to predict the total runs scored in an over based on its number in the innings. However, the residuals suggest that the model is not perfect, and there may be overs where the actual runs scored deviate from the model’s predictions.