week_7

Reading Ball by ball Data set

my_data <- read.csv('C:/Users/dell/Downloads/Ball_By_Ball.csv')
summary(my_data)

##     MatcH_id          Over_id         Ball_id        Innings_No   
##  Min.   : 335987   Min.   : 1.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 419154   1st Qu.: 5.00   1st Qu.:2.000   1st Qu.:1.000  
##  Median : 548382   Median :10.00   Median :4.000   Median :1.000  
##  Mean   : 636208   Mean   :10.14   Mean   :3.617   Mean   :1.482  
##  3rd Qu.: 829742   3rd Qu.:15.00   3rd Qu.:5.000   3rd Qu.:2.000  
##  Max.   :1082650   Max.   :20.00   Max.   :9.000   Max.   :4.000  
##                                                                   
##  Team_Batting       Team_Bowling       Striker_Batting_Position
##  Length:150451      Length:150451      Min.   : 1.000          
##  Class :character   Class :character   1st Qu.: 2.000          
##  Mode  :character   Mode  :character   Median : 3.000          
##                                        Mean   : 3.584          
##                                        3rd Qu.: 5.000          
##                                        Max.   :11.000          
##                                        NA's   :13861           
##   Extra_Type         Runs_Scored      Extra_runs          Wides       
##  Length:150451      Min.   :0.000   Min.   :0.00000   Min.   :0.0000  
##  Class :character   1st Qu.:0.000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Mode  :character   Median :1.000   Median :0.00000   Median :0.0000  
##                     Mean   :1.222   Mean   :0.06899   Mean   :0.0375  
##                     3rd Qu.:1.000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##                     Max.   :6.000   Max.   :5.00000   Max.   :5.0000  
##                                                                       
##     Legbyes             Byes             Noballs           Penalty       
##  Min.   :0.00000   Min.   :0.000000   Min.   :0.00000   Min.   :0.0e+00  
##  1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.0e+00  
##  Median :0.00000   Median :0.000000   Median :0.00000   Median :0.0e+00  
##  Mean   :0.02223   Mean   :0.004885   Mean   :0.00434   Mean   :3.3e-05  
##  3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.0e+00  
##  Max.   :5.00000   Max.   :4.000000   Max.   :5.00000   Max.   :5.0e+00  
##                                                                          
##  Bowler_Extras       Out_type             Caught            Bowled        
##  Min.   :0.00000   Length:150451      Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.00000   Class :character   1st Qu.:0.00000   1st Qu.:0.000000  
##  Median :0.00000   Mode  :character   Median :0.00000   Median :0.000000  
##  Mean   :0.04184                      Mean   :0.02907   Mean   :0.009186  
##  3rd Qu.:0.00000                      3rd Qu.:0.00000   3rd Qu.:0.000000  
##  Max.   :5.00000                      Max.   :1.00000   Max.   :1.000000  
##                                                                           
##     Run_out              LBW            Retired_hurt         Stumped        
##  Min.   :0.000000   Min.   :0.000000   Min.   :0.00e+00   Min.   :0.000000  
##  1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.00e+00   1st Qu.:0.000000  
##  Median :0.000000   Median :0.000000   Median :0.00e+00   Median :0.000000  
##  Mean   :0.005018   Mean   :0.003024   Mean   :5.98e-05   Mean   :0.001615  
##  3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:0.00e+00   3rd Qu.:0.000000  
##  Max.   :1.000000   Max.   :1.000000   Max.   :1.00e+00   Max.   :1.000000  
##                                                                             
##  caught_and_bowled    hit_wicket       ObstructingFeild  Bowler_Wicket    
##  Min.   :0.000000   Min.   :0.00e+00   Min.   :0.0e+00   Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.00e+00   1st Qu.:0.0e+00   1st Qu.:0.00000  
##  Median :0.000000   Median :0.00e+00   Median :0.0e+00   Median :0.00000  
##  Mean   :0.001402   Mean   :5.98e-05   Mean   :6.6e-06   Mean   :0.04435  
##  3rd Qu.:0.000000   3rd Qu.:0.00e+00   3rd Qu.:0.0e+00   3rd Qu.:0.00000  
##  Max.   :1.000000   Max.   :1.00e+00   Max.   :1.0e+00   Max.   :1.00000  
##                                                                           
##   Match_Date            Season        Striker       Non_Striker   
##  Length:150451      Min.   :2008   Min.   :  1.0   Min.   :  1.0  
##  Class :character   1st Qu.:2010   1st Qu.: 40.0   1st Qu.: 40.0  
##  Mode  :character   Median :2012   Median : 96.0   Median : 96.0  
##                     Mean   :2012   Mean   :136.5   Mean   :135.6  
##                     3rd Qu.:2015   3rd Qu.:208.0   3rd Qu.:208.0  
##                     Max.   :2017   Max.   :497.0   Max.   :497.0  
##                                                                   
##      Bowler        Player_Out        Fielders      Striker_match_SK
##  Min.   :  1.0   Min.   :  1.0    Min.   :  1.0    Min.   :12694   
##  1st Qu.: 77.0   1st Qu.: 41.0    1st Qu.: 47.0    1st Qu.:16173   
##  Median :174.0   Median :107.0    Median :111.0    Median :19672   
##  Mean   :194.1   Mean   :148.6    Mean   :155.4    Mean   :19675   
##  3rd Qu.:310.0   3rd Qu.:236.0    3rd Qu.:237.5    3rd Qu.:23127   
##  Max.   :497.0   Max.   :497.0    Max.   :497.0    Max.   :26685   
##                  NA's   :143013   NA's   :145100                   
##    StrikerSK     NonStriker_match_SK NONStriker_SK   Fielder_match_SK
##  Min.   :  0.0   Min.   :12694       Min.   :  0.0   Min.   :   -1   
##  1st Qu.: 39.0   1st Qu.:16173       1st Qu.: 39.0   1st Qu.:   -1   
##  Median : 95.0   Median :19672       Median : 95.0   Median :   -1   
##  Mean   :135.5   Mean   :19675       Mean   :134.6   Mean   :  690   
##  3rd Qu.:207.0   3rd Qu.:23127       3rd Qu.:207.0   3rd Qu.:   -1   
##  Max.   :496.0   Max.   :26685       Max.   :496.0   Max.   :26680   
##                                                                      
##    Fielder_SK      Bowler_match_SK   BOWLER_SK     PlayerOut_match_SK
##  Min.   : -1.000   Min.   :12697   Min.   :  0.0   Min.   :   -1.0   
##  1st Qu.: -1.000   1st Qu.:16175   1st Qu.: 76.0   1st Qu.:   -1.0   
##  Median : -1.000   Median :19674   Median :173.0   Median :   -1.0   
##  Mean   :  4.527   Mean   :19677   Mean   :193.1   Mean   :  970.3   
##  3rd Qu.: -1.000   3rd Qu.:23131   3rd Qu.:309.0   3rd Qu.:   -1.0   
##  Max.   :496.000   Max.   :26685   Max.   :496.0   Max.   :26685.0   
##                                                                      
##  BattingTeam_SK   BowlingTeam_SK    Keeper_Catch      Player_out_sk    
##  Min.   : 0.000   Min.   : 0.000   Min.   :0.000000   Min.   : -1.000  
##  1st Qu.: 2.000   1st Qu.: 2.000   1st Qu.:0.000000   1st Qu.:  0.000  
##  Median : 4.000   Median : 4.000   Median :0.000000   Median :  0.000  
##  Mean   : 4.346   Mean   : 4.333   Mean   :0.000432   Mean   :  1.101  
##  3rd Qu.: 6.000   3rd Qu.: 6.000   3rd Qu.:0.000000   3rd Qu.:  0.000  
##  Max.   :12.000   Max.   :12.000   Max.   :1.000000   Max.   :496.000  
##                                                                        
##   MatchDateSK      
##  Min.   :20080418  
##  1st Qu.:20100411  
##  Median :20120520  
##  Mean   :20125288  
##  3rd Qu.:20150420  
##  Max.   :20170521  
##

Neyman-Pearson hypothesis test

$$ Hypothesis 1) “Is there a significant difference in the mean number of extras (wides, legbyes, byes, and no-balls) scored by different teams in the same match?”

Null Hypothesis (H0): There is no significant difference in the mean number of extras scored by different teams in the same match. Alternative Hypothesis (H1): There is a significant difference in the mean number of extras scored by different teams in the same match.

Alpha (α) represents the probability of making a Type I error (rejecting the null hypothesis when it’s true. Let (α)= 0.05.

Power (1 - β) represents the probability of correctly rejecting a false null hypothesis (true effect detection. power=0.8

Minimum meaningful effect size: Here is how I would determine the minimum effect size for the hypothesis comparing extras conceded by teams, based on a difference of at least 2 extras per over:

In cricket, an extra (wide, no ball, etc) gives the batting team 1 run as a penalty against the bowling team.
A difference of 2 extras per over would mean the two teams concede extras at very different rates while bowling.
In a typical innings of 20 overs:
- Team A concedes 1 extra per over, so 20 extras in the innings
- Team B concedes 3 extras per over, so 60 extras in the innings
- The difference in extras between the teams is 60 - 20 = 40
With 10 wickets in a typical innings, this equates to around 4 more extras conceded per wicket by Team B compared to Team A.

So for this hypothesis, a minimum meaningful effect size is a difference of 4 extras conceded per wicket between the teams. This reflects a substantial difference of at least 2 extras per over between their bowling accuracy and skills. $$

alpha=0.05
# Subsetting the data for analysis
subset_data <- my_data[, c("Runs_Scored", "Striker_Batting_Position", "Innings_No")]

# Grouping data by Striker_Batting_Position
grouped_data <- split(subset_data$Runs_Scored, subset_data$Striker_Batting_Position)

# Performing independent t-test
t_test_result <- t.test(grouped_data$`1`, grouped_data$`2`)
print(t_test_result)

## 
##  Welch Two Sample t-test
## 
## data:  grouped_data$`1` and grouped_data$`2`
## t = -0.32117, df = 51025, p-value = 0.7481
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.03291616  0.02364746
## sample estimates:
## mean of x mean of y 
##  1.216989  1.221623

# Checks the p-value to determine statistical significance
if (t_test_result$p.value < alpha) {
  cat("Reject the null hypothesis: There is a significant difference in mean runs scored.")
} else {
  cat("Fail to reject the null hypothesis: There is no significant difference in mean runs scored.")
}

## Fail to reject the null hypothesis: There is no significant difference in mean runs scored.

Hypothesis 2:

$$ Hypothesis 2: “Is there a significant difference in the mean runs scored by players in different innings (1st inning and 2nd inning) of the match?”

Null Hypothesis (H0): There is no significant difference in the mean runs scored by players in the 1st inning and 2nd inning of the match. Alternative Hypothesis (H1): There is a significant difference in the mean runs scored by players in the 1st inning and 2nd inning of the match.

Alpha (α) represents the probability of making a Type I error (rejecting the null hypothesis when it’s true. Let (α)= 0.05.

Power (1 - β) represents the probability of correctly rejecting a false null hypothesis (true effect detection. power=0.8

Minimum meaningful effect size Okay, with a typical innings of 20 overs, here is how I would determine the minimum effect size based on a difference of at least 2 runs per over:

In cricket, overs are sets of 6 balls bowled by one bowler.
A difference of 2 runs per over would mean at least 12 more runs scored on average in one innings compared to the other.
In a typical innings of 20 overs:
- In the 1st innings, if the mean runs per over is 5, the total runs would be 20 * 5 = 100
- In the 2nd innings, to have at least 2 more runs per over, the mean would have to be 7 per over
- So the total runs in the 2nd innings would be 20 * 7 = 140
The difference in total runs between the innings is 140 - 100 = 40 runs
With about 5 wickets in a 20 over innings, this equates to around 8 more runs scored per wicket in the 2nd innings compared to the 1st.

So with a typical 20 over innings, a minimum meaningful effect size for this test would be a difference of 8 runs per wicket between the 1st and 2nd innings. This effect size reflects an average increase of at least 2 runs scored per over between the two innings in a 20 over match. $$

aplha=0.05
# Subsetting the data for analysis 
subset_data <- my_data[, c("Innings_No", "Runs_Scored")]

# Separate data for the 1st inning and 2nd inning
inning1_data <- subset_data[subset_data$Innings_No == 1, "Runs_Scored"]
inning2_data <- subset_data[subset_data$Innings_No == 2, "Runs_Scored"]

# Perform independent samples t-test
t_test_result <- t.test(inning1_data, inning2_data)
print(t_test_result)

## 
##  Welch Two Sample t-test
## 
## data:  inning1_data and inning2_data
## t = 2.6118, df = 149616, p-value = 0.009009
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.005360813 0.037601904
## sample estimates:
## mean of x mean of y 
##  1.232108  1.210627

# Check the p-value to determine statistical significance
if (t_test_result$p.value < aplha) {
  cat("Reject the null hypothesis: There is a significant difference in mean runs scored.")
} else {
  cat("Fail to reject the null hypothesis: There is no significant difference in mean runs scored.")
}

## Reject the null hypothesis: There is a significant difference in mean runs scored.

Fishers exact test (Perform a Fisher’s style test for significance, and interpret the p-value.)

$$ Null Hypothesis (H0): There is no significant difference in the mean number of extras scored by different teams in the same match.

In the context of comparing the mean number of extras scored by different teams in the same match, a Fisher’s exact test may not be the most appropriate statistical test. It’s designed to assess the independence or association between two categorical variables and is commonly used in situations where you have a small sample size. Fisher’s exact test is typically used for analyzing categorical data, whereas my hypothesis involves continuous data (the number of extras scored).

In summary, Fisher’s exact test is a valuable tool for specific types of categorical data analysis but is not suitable for comparing means of continuous data or assessing differences in averages between groups, as my hypothesis requires. To appropriately analyze my hypothesis, an independent samples t-test or a non-parametric equivalent would be more appropriate and provide meaningful results for continuous data analysis. \[ \] Null Hypothesis (H0): There is no significant difference in the mean runs scored by players in the 1st inning and 2nd inning of the match.

Fisher’s test is typically used for contingency tables, where we’re analyzing categorical data. In the context of my hypothesis, I am comparing the mean runs scored in different innings, which is a continuous variable. For this type of analysis, an independent samples t-test is more appropriate. $$

Visualisation of statistical tests for the Introduced null hypotheses

# Visualisation For hypothesis 1(t test):
# Create a side-by-side boxplot
boxplot(inning1_data, inning2_data, names = c("1st Inning", "2nd Inning"),
        col = c("lightblue", "lightgreen"), main = "Runs Scored in 1st and 2nd Innings",
        ylab = "Runs Scored")

# Add significance indication
if (t_test_result$p.value < 0.05) {
  text(1.5, max(boxplot.stats(inning1_data)$out), "*", cex = 2)
}

# Visualisation For hypothesis 2 (t test):
barplot(c(t_test_result$p.value, 1 - t_test_result$p.value), 
        names.arg = c("p-value", "1 - p-value"),
        col = c("lightblue", "lightgreen"), 
        main = "T-Test Results for Extras Scored by Different Teams",
        ylab = "Probability")

# Add significance indication
if (t_test_result$p.value < 0.05) {
  text(1, t_test_result$p.value + 0.02, "*", cex = 2)
}

\[ Visualisation of Fishers exact test on my hypothesis can'nt be done as its not possible to perform fishers test, since fishers statistical test is not sutable test for my hypotheses and a t test instead would be best way to explore and disprove my hypotheses. \]