my_data <- read.csv('C:/Users/dell/Downloads/Ball_By_Ball.csv')
summary(my_data)
## MatcH_id Over_id Ball_id Innings_No
## Min. : 335987 Min. : 1.00 Min. :1.000 Min. :1.000
## 1st Qu.: 419154 1st Qu.: 5.00 1st Qu.:2.000 1st Qu.:1.000
## Median : 548382 Median :10.00 Median :4.000 Median :1.000
## Mean : 636208 Mean :10.14 Mean :3.617 Mean :1.482
## 3rd Qu.: 829742 3rd Qu.:15.00 3rd Qu.:5.000 3rd Qu.:2.000
## Max. :1082650 Max. :20.00 Max. :9.000 Max. :4.000
##
## Team_Batting Team_Bowling Striker_Batting_Position
## Length:150451 Length:150451 Min. : 1.000
## Class :character Class :character 1st Qu.: 2.000
## Mode :character Mode :character Median : 3.000
## Mean : 3.584
## 3rd Qu.: 5.000
## Max. :11.000
## NA's :13861
## Extra_Type Runs_Scored Extra_runs Wides
## Length:150451 Min. :0.000 Min. :0.00000 Min. :0.0000
## Class :character 1st Qu.:0.000 1st Qu.:0.00000 1st Qu.:0.0000
## Mode :character Median :1.000 Median :0.00000 Median :0.0000
## Mean :1.222 Mean :0.06899 Mean :0.0375
## 3rd Qu.:1.000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :6.000 Max. :5.00000 Max. :5.0000
##
## Legbyes Byes Noballs Penalty
## Min. :0.00000 Min. :0.000000 Min. :0.00000 Min. :0.0e+00
## 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.0e+00
## Median :0.00000 Median :0.000000 Median :0.00000 Median :0.0e+00
## Mean :0.02223 Mean :0.004885 Mean :0.00434 Mean :3.3e-05
## 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.0e+00
## Max. :5.00000 Max. :4.000000 Max. :5.00000 Max. :5.0e+00
##
## Bowler_Extras Out_type Caught Bowled
## Min. :0.00000 Length:150451 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.00000 Class :character 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.00000 Mode :character Median :0.00000 Median :0.000000
## Mean :0.04184 Mean :0.02907 Mean :0.009186
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :5.00000 Max. :1.00000 Max. :1.000000
##
## Run_out LBW Retired_hurt Stumped
## Min. :0.000000 Min. :0.000000 Min. :0.00e+00 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.00e+00 1st Qu.:0.000000
## Median :0.000000 Median :0.000000 Median :0.00e+00 Median :0.000000
## Mean :0.005018 Mean :0.003024 Mean :5.98e-05 Mean :0.001615
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.00e+00 3rd Qu.:0.000000
## Max. :1.000000 Max. :1.000000 Max. :1.00e+00 Max. :1.000000
##
## caught_and_bowled hit_wicket ObstructingFeild Bowler_Wicket
## Min. :0.000000 Min. :0.00e+00 Min. :0.0e+00 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.00e+00 1st Qu.:0.0e+00 1st Qu.:0.00000
## Median :0.000000 Median :0.00e+00 Median :0.0e+00 Median :0.00000
## Mean :0.001402 Mean :5.98e-05 Mean :6.6e-06 Mean :0.04435
## 3rd Qu.:0.000000 3rd Qu.:0.00e+00 3rd Qu.:0.0e+00 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.00e+00 Max. :1.0e+00 Max. :1.00000
##
## Match_Date Season Striker Non_Striker
## Length:150451 Min. :2008 Min. : 1.0 Min. : 1.0
## Class :character 1st Qu.:2010 1st Qu.: 40.0 1st Qu.: 40.0
## Mode :character Median :2012 Median : 96.0 Median : 96.0
## Mean :2012 Mean :136.5 Mean :135.6
## 3rd Qu.:2015 3rd Qu.:208.0 3rd Qu.:208.0
## Max. :2017 Max. :497.0 Max. :497.0
##
## Bowler Player_Out Fielders Striker_match_SK
## Min. : 1.0 Min. : 1.0 Min. : 1.0 Min. :12694
## 1st Qu.: 77.0 1st Qu.: 41.0 1st Qu.: 47.0 1st Qu.:16173
## Median :174.0 Median :107.0 Median :111.0 Median :19672
## Mean :194.1 Mean :148.6 Mean :155.4 Mean :19675
## 3rd Qu.:310.0 3rd Qu.:236.0 3rd Qu.:237.5 3rd Qu.:23127
## Max. :497.0 Max. :497.0 Max. :497.0 Max. :26685
## NA's :143013 NA's :145100
## StrikerSK NonStriker_match_SK NONStriker_SK Fielder_match_SK
## Min. : 0.0 Min. :12694 Min. : 0.0 Min. : -1
## 1st Qu.: 39.0 1st Qu.:16173 1st Qu.: 39.0 1st Qu.: -1
## Median : 95.0 Median :19672 Median : 95.0 Median : -1
## Mean :135.5 Mean :19675 Mean :134.6 Mean : 690
## 3rd Qu.:207.0 3rd Qu.:23127 3rd Qu.:207.0 3rd Qu.: -1
## Max. :496.0 Max. :26685 Max. :496.0 Max. :26680
##
## Fielder_SK Bowler_match_SK BOWLER_SK PlayerOut_match_SK
## Min. : -1.000 Min. :12697 Min. : 0.0 Min. : -1.0
## 1st Qu.: -1.000 1st Qu.:16175 1st Qu.: 76.0 1st Qu.: -1.0
## Median : -1.000 Median :19674 Median :173.0 Median : -1.0
## Mean : 4.527 Mean :19677 Mean :193.1 Mean : 970.3
## 3rd Qu.: -1.000 3rd Qu.:23131 3rd Qu.:309.0 3rd Qu.: -1.0
## Max. :496.000 Max. :26685 Max. :496.0 Max. :26685.0
##
## BattingTeam_SK BowlingTeam_SK Keeper_Catch Player_out_sk
## Min. : 0.000 Min. : 0.000 Min. :0.000000 Min. : -1.000
## 1st Qu.: 2.000 1st Qu.: 2.000 1st Qu.:0.000000 1st Qu.: 0.000
## Median : 4.000 Median : 4.000 Median :0.000000 Median : 0.000
## Mean : 4.346 Mean : 4.333 Mean :0.000432 Mean : 1.101
## 3rd Qu.: 6.000 3rd Qu.: 6.000 3rd Qu.:0.000000 3rd Qu.: 0.000
## Max. :12.000 Max. :12.000 Max. :1.000000 Max. :496.000
##
## MatchDateSK
## Min. :20080418
## 1st Qu.:20100411
## Median :20120520
## Mean :20125288
## 3rd Qu.:20150420
## Max. :20170521
##
$$ Hypothesis 1) “Is there a significant difference in the mean number of extras (wides, legbyes, byes, and no-balls) scored by different teams in the same match?”
Null Hypothesis (H0): There is no significant difference in the mean number of extras scored by different teams in the same match. Alternative Hypothesis (H1): There is a significant difference in the mean number of extras scored by different teams in the same match.
Alpha (α) represents the probability of making a Type I error (rejecting the null hypothesis when it’s true. Let (α)= 0.05.
Power (1 - β) represents the probability of correctly rejecting a false null hypothesis (true effect detection. power=0.8
Minimum meaningful effect size: Here is how I would determine the minimum effect size for the hypothesis comparing extras conceded by teams, based on a difference of at least 2 extras per over:
In cricket, an extra (wide, no ball, etc) gives the batting team 1 run as a penalty against the bowling team.
A difference of 2 extras per over would mean the two teams concede extras at very different rates while bowling.
In a typical innings of 20 overs:
With 10 wickets in a typical innings, this equates to around 4 more extras conceded per wicket by Team B compared to Team A.
So for this hypothesis, a minimum meaningful effect size is a difference of 4 extras conceded per wicket between the teams. This reflects a substantial difference of at least 2 extras per over between their bowling accuracy and skills. $$
alpha=0.05
# Subsetting the data for analysis
subset_data <- my_data[, c("Runs_Scored", "Striker_Batting_Position", "Innings_No")]
# Grouping data by Striker_Batting_Position
grouped_data <- split(subset_data$Runs_Scored, subset_data$Striker_Batting_Position)
# Performing independent t-test
t_test_result <- t.test(grouped_data$`1`, grouped_data$`2`)
print(t_test_result)
##
## Welch Two Sample t-test
##
## data: grouped_data$`1` and grouped_data$`2`
## t = -0.32117, df = 51025, p-value = 0.7481
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.03291616 0.02364746
## sample estimates:
## mean of x mean of y
## 1.216989 1.221623
# Checks the p-value to determine statistical significance
if (t_test_result$p.value < alpha) {
cat("Reject the null hypothesis: There is a significant difference in mean runs scored.")
} else {
cat("Fail to reject the null hypothesis: There is no significant difference in mean runs scored.")
}
## Fail to reject the null hypothesis: There is no significant difference in mean runs scored.
$$ Hypothesis 2: “Is there a significant difference in the mean runs scored by players in different innings (1st inning and 2nd inning) of the match?”
Null Hypothesis (H0): There is no significant difference in the mean runs scored by players in the 1st inning and 2nd inning of the match. Alternative Hypothesis (H1): There is a significant difference in the mean runs scored by players in the 1st inning and 2nd inning of the match.
Alpha (α) represents the probability of making a Type I error (rejecting the null hypothesis when it’s true. Let (α)= 0.05.
Power (1 - β) represents the probability of correctly rejecting a false null hypothesis (true effect detection. power=0.8
Minimum meaningful effect size Okay, with a typical innings of 20 overs, here is how I would determine the minimum effect size based on a difference of at least 2 runs per over:
In cricket, overs are sets of 6 balls bowled by one bowler.
A difference of 2 runs per over would mean at least 12 more runs scored on average in one innings compared to the other.
In a typical innings of 20 overs:
The difference in total runs between the innings is 140 - 100 = 40 runs
With about 5 wickets in a 20 over innings, this equates to around 8 more runs scored per wicket in the 2nd innings compared to the 1st.
So with a typical 20 over innings, a minimum meaningful effect size for this test would be a difference of 8 runs per wicket between the 1st and 2nd innings. This effect size reflects an average increase of at least 2 runs scored per over between the two innings in a 20 over match. $$
aplha=0.05
# Subsetting the data for analysis
subset_data <- my_data[, c("Innings_No", "Runs_Scored")]
# Separate data for the 1st inning and 2nd inning
inning1_data <- subset_data[subset_data$Innings_No == 1, "Runs_Scored"]
inning2_data <- subset_data[subset_data$Innings_No == 2, "Runs_Scored"]
# Perform independent samples t-test
t_test_result <- t.test(inning1_data, inning2_data)
print(t_test_result)
##
## Welch Two Sample t-test
##
## data: inning1_data and inning2_data
## t = 2.6118, df = 149616, p-value = 0.009009
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.005360813 0.037601904
## sample estimates:
## mean of x mean of y
## 1.232108 1.210627
# Check the p-value to determine statistical significance
if (t_test_result$p.value < aplha) {
cat("Reject the null hypothesis: There is a significant difference in mean runs scored.")
} else {
cat("Fail to reject the null hypothesis: There is no significant difference in mean runs scored.")
}
## Reject the null hypothesis: There is a significant difference in mean runs scored.
$$ Null Hypothesis (H0): There is no significant difference in the mean number of extras scored by different teams in the same match.
In the context of comparing the mean number of extras scored by different teams in the same match, a Fisher’s exact test may not be the most appropriate statistical test. It’s designed to assess the independence or association between two categorical variables and is commonly used in situations where you have a small sample size. Fisher’s exact test is typically used for analyzing categorical data, whereas my hypothesis involves continuous data (the number of extras scored).
In summary, Fisher’s exact test is a valuable tool for specific types of categorical data analysis but is not suitable for comparing means of continuous data or assessing differences in averages between groups, as my hypothesis requires. To appropriately analyze my hypothesis, an independent samples t-test or a non-parametric equivalent would be more appropriate and provide meaningful results for continuous data analysis. \[ \] Null Hypothesis (H0): There is no significant difference in the mean runs scored by players in the 1st inning and 2nd inning of the match.
Fisher’s test is typically used for contingency tables, where we’re analyzing categorical data. In the context of my hypothesis, I am comparing the mean runs scored in different innings, which is a continuous variable. For this type of analysis, an independent samples t-test is more appropriate. $$
# Visualisation For hypothesis 1(t test):
# Create a side-by-side boxplot
boxplot(inning1_data, inning2_data, names = c("1st Inning", "2nd Inning"),
col = c("lightblue", "lightgreen"), main = "Runs Scored in 1st and 2nd Innings",
ylab = "Runs Scored")
# Add significance indication
if (t_test_result$p.value < 0.05) {
text(1.5, max(boxplot.stats(inning1_data)$out), "*", cex = 2)
}
# Visualisation For hypothesis 2 (t test):
barplot(c(t_test_result$p.value, 1 - t_test_result$p.value),
names.arg = c("p-value", "1 - p-value"),
col = c("lightblue", "lightgreen"),
main = "T-Test Results for Extras Scored by Different Teams",
ylab = "Probability")
# Add significance indication
if (t_test_result$p.value < 0.05) {
text(1, t_test_result$p.value + 0.02, "*", cex = 2)
}
\[ Visualisation of Fishers exact test on my hypothesis can'nt be done as its not possible to perform fishers test, since fishers statistical test is not sutable test for my hypotheses and a t test instead would be best way to explore and disprove my hypotheses. \]