R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

my_data <- read.csv('C:/Users/dell/Downloads/Ball_By_Ball.csv')
summary(my_data)
##     MatcH_id          Over_id         Ball_id        Innings_No   
##  Min.   : 335987   Min.   : 1.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 419154   1st Qu.: 5.00   1st Qu.:2.000   1st Qu.:1.000  
##  Median : 548382   Median :10.00   Median :4.000   Median :1.000  
##  Mean   : 636208   Mean   :10.14   Mean   :3.617   Mean   :1.482  
##  3rd Qu.: 829742   3rd Qu.:15.00   3rd Qu.:5.000   3rd Qu.:2.000  
##  Max.   :1082650   Max.   :20.00   Max.   :9.000   Max.   :4.000  
##                                                                   
##  Team_Batting       Team_Bowling       Striker_Batting_Position
##  Length:150451      Length:150451      Min.   : 1.000          
##  Class :character   Class :character   1st Qu.: 2.000          
##  Mode  :character   Mode  :character   Median : 3.000          
##                                        Mean   : 3.584          
##                                        3rd Qu.: 5.000          
##                                        Max.   :11.000          
##                                        NA's   :13861           
##   Extra_Type         Runs_Scored      Extra_runs          Wides       
##  Length:150451      Min.   :0.000   Min.   :0.00000   Min.   :0.0000  
##  Class :character   1st Qu.:0.000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Mode  :character   Median :1.000   Median :0.00000   Median :0.0000  
##                     Mean   :1.222   Mean   :0.06899   Mean   :0.0375  
##                     3rd Qu.:1.000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##                     Max.   :6.000   Max.   :5.00000   Max.   :5.0000  
##                                                                       
##     Legbyes             Byes             Noballs           Penalty       
##  Min.   :0.00000   Min.   :0.000000   Min.   :0.00000   Min.   :0.0e+00  
##  1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.0e+00  
##  Median :0.00000   Median :0.000000   Median :0.00000   Median :0.0e+00  
##  Mean   :0.02223   Mean   :0.004885   Mean   :0.00434   Mean   :3.3e-05  
##  3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.0e+00  
##  Max.   :5.00000   Max.   :4.000000   Max.   :5.00000   Max.   :5.0e+00  
##                                                                          
##  Bowler_Extras       Out_type             Caught            Bowled        
##  Min.   :0.00000   Length:150451      Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.00000   Class :character   1st Qu.:0.00000   1st Qu.:0.000000  
##  Median :0.00000   Mode  :character   Median :0.00000   Median :0.000000  
##  Mean   :0.04184                      Mean   :0.02907   Mean   :0.009186  
##  3rd Qu.:0.00000                      3rd Qu.:0.00000   3rd Qu.:0.000000  
##  Max.   :5.00000                      Max.   :1.00000   Max.   :1.000000  
##                                                                           
##     Run_out              LBW            Retired_hurt         Stumped        
##  Min.   :0.000000   Min.   :0.000000   Min.   :0.00e+00   Min.   :0.000000  
##  1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.00e+00   1st Qu.:0.000000  
##  Median :0.000000   Median :0.000000   Median :0.00e+00   Median :0.000000  
##  Mean   :0.005018   Mean   :0.003024   Mean   :5.98e-05   Mean   :0.001615  
##  3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:0.00e+00   3rd Qu.:0.000000  
##  Max.   :1.000000   Max.   :1.000000   Max.   :1.00e+00   Max.   :1.000000  
##                                                                             
##  caught_and_bowled    hit_wicket       ObstructingFeild  Bowler_Wicket    
##  Min.   :0.000000   Min.   :0.00e+00   Min.   :0.0e+00   Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.00e+00   1st Qu.:0.0e+00   1st Qu.:0.00000  
##  Median :0.000000   Median :0.00e+00   Median :0.0e+00   Median :0.00000  
##  Mean   :0.001402   Mean   :5.98e-05   Mean   :6.6e-06   Mean   :0.04435  
##  3rd Qu.:0.000000   3rd Qu.:0.00e+00   3rd Qu.:0.0e+00   3rd Qu.:0.00000  
##  Max.   :1.000000   Max.   :1.00e+00   Max.   :1.0e+00   Max.   :1.00000  
##                                                                           
##   Match_Date            Season        Striker       Non_Striker   
##  Length:150451      Min.   :2008   Min.   :  1.0   Min.   :  1.0  
##  Class :character   1st Qu.:2010   1st Qu.: 40.0   1st Qu.: 40.0  
##  Mode  :character   Median :2012   Median : 96.0   Median : 96.0  
##                     Mean   :2012   Mean   :136.5   Mean   :135.6  
##                     3rd Qu.:2015   3rd Qu.:208.0   3rd Qu.:208.0  
##                     Max.   :2017   Max.   :497.0   Max.   :497.0  
##                                                                   
##      Bowler        Player_Out        Fielders      Striker_match_SK
##  Min.   :  1.0   Min.   :  1.0    Min.   :  1.0    Min.   :12694   
##  1st Qu.: 77.0   1st Qu.: 41.0    1st Qu.: 47.0    1st Qu.:16173   
##  Median :174.0   Median :107.0    Median :111.0    Median :19672   
##  Mean   :194.1   Mean   :148.6    Mean   :155.4    Mean   :19675   
##  3rd Qu.:310.0   3rd Qu.:236.0    3rd Qu.:237.5    3rd Qu.:23127   
##  Max.   :497.0   Max.   :497.0    Max.   :497.0    Max.   :26685   
##                  NA's   :143013   NA's   :145100                   
##    StrikerSK     NonStriker_match_SK NONStriker_SK   Fielder_match_SK
##  Min.   :  0.0   Min.   :12694       Min.   :  0.0   Min.   :   -1   
##  1st Qu.: 39.0   1st Qu.:16173       1st Qu.: 39.0   1st Qu.:   -1   
##  Median : 95.0   Median :19672       Median : 95.0   Median :   -1   
##  Mean   :135.5   Mean   :19675       Mean   :134.6   Mean   :  690   
##  3rd Qu.:207.0   3rd Qu.:23127       3rd Qu.:207.0   3rd Qu.:   -1   
##  Max.   :496.0   Max.   :26685       Max.   :496.0   Max.   :26680   
##                                                                      
##    Fielder_SK      Bowler_match_SK   BOWLER_SK     PlayerOut_match_SK
##  Min.   : -1.000   Min.   :12697   Min.   :  0.0   Min.   :   -1.0   
##  1st Qu.: -1.000   1st Qu.:16175   1st Qu.: 76.0   1st Qu.:   -1.0   
##  Median : -1.000   Median :19674   Median :173.0   Median :   -1.0   
##  Mean   :  4.527   Mean   :19677   Mean   :193.1   Mean   :  970.3   
##  3rd Qu.: -1.000   3rd Qu.:23131   3rd Qu.:309.0   3rd Qu.:   -1.0   
##  Max.   :496.000   Max.   :26685   Max.   :496.0   Max.   :26685.0   
##                                                                      
##  BattingTeam_SK   BowlingTeam_SK    Keeper_Catch      Player_out_sk    
##  Min.   : 0.000   Min.   : 0.000   Min.   :0.000000   Min.   : -1.000  
##  1st Qu.: 2.000   1st Qu.: 2.000   1st Qu.:0.000000   1st Qu.:  0.000  
##  Median : 4.000   Median : 4.000   Median :0.000000   Median :  0.000  
##  Mean   : 4.346   Mean   : 4.333   Mean   :0.000432   Mean   :  1.101  
##  3rd Qu.: 6.000   3rd Qu.: 6.000   3rd Qu.:0.000000   3rd Qu.:  0.000  
##  Max.   :12.000   Max.   :12.000   Max.   :1.000000   Max.   :496.000  
##                                                                        
##   MatchDateSK      
##  Min.   :20080418  
##  1st Qu.:20100411  
##  Median :20120520  
##  Mean   :20125288  
##  3rd Qu.:20150420  
##  Max.   :20170521  
## 

Building three sets of variable combinations

Set 1: Calculate the “Total_Runs_Scored” column by summing “Runs_Scored,” “Extra_runs,” and “Byes.”

Set 2: Calculate the “Bowling_Economy” column by summing “Runs_Scored”, “Extra_runs”, “Noballs”, “Wides”, “Legbyes”, “Byes” and divide by “my_data$Over_id / 6”.

Set 3: Calculate the “Average Runs Per Ball” column by summimg “Runs_Scored,” “Extra_runs,” and dividing them by “over_id” * 6

# Set 1: Calculate Total Runs Per Innings
response_variable_set1 <- my_data$Total_Runs_Per_Innings  # Response variable for Set 1
explanatory_variables_set1 <- c(my_data$Runs_Scored, my_data$Extra_runs, my_data$Wides, my_data$Legbyes, my_data$Byes, my_data$Noballs)  # Explanatory variables for Set 1

# Perform calculations for Set 1
my_data$Total_Runs_Per_Innings <- my_data$Runs_Scored + my_data$Extra_runs + my_data$Wides + my_data$Legbyes + my_data$Byes + my_data$Noballs

# Set 2: Calculate Bowling Economy
response_variable_set2 <- my_data$Bowling_Economy  # Response variable for Set 2
explanatory_variables_set2 <- c(my_data$Runs_Scored, my_data$Extra_runs, my_data$Noballs, my_data$Wides, my_data$Legbyes, my_data$Byes)  # Explanatory variables for Set 2

# Perform calculations for Set 2
my_data$Bowling_Economy <- (my_data$Runs_Scored + my_data$Extra_runs + my_data$Noballs + my_data$Wides + my_data$Legbyes + my_data$Byes) / (my_data$Over_id / 6)

# Set 3: Calculate Total Runs Per Over
response_variable_set3 <- my_data$Total_Runs_Per_Over  # Response variable for Set 3
explanatory_variables_set3 <- c(my_data$Runs_Scored, my_data$Extra_runs, my_data$Wides, my_data$Legbyes, my_data$Byes, my_data$Noballs, my_data$Over_id)  # Explanatory variables for Set 3

# Perform calculations for Set 3
my_data$Total_Runs_Per_Over <- my_data$Runs_Scored + my_data$Extra_runs + my_data$Wides + my_data$Legbyes + my_data$Byes + my_data$Noballs

# Display the first few rows of the updated dataset
head(my_data)
##   MatcH_id Over_id Ball_id Innings_No Team_Batting Team_Bowling
## 1   598028      15       6          1            5            2
## 2   598028      14       1          1            5            2
## 3   598028      14       2          1            5            2
## 4   598028      14       3          1            5            2
## 5   598028      14       4          1            5            2
## 6   598028      14       5          1            5            2
##   Striker_Batting_Position Extra_Type Runs_Scored Extra_runs Wides Legbyes Byes
## 1                        6  No Extras           4          0     0       0    0
## 2                        5  No Extras           1          0     0       0    0
## 3                        3  No Extras           1          0     0       0    0
## 4                        5  No Extras           1          0     0       0    0
## 5                        3  No Extras           0          0     0       0    0
## 6                        3  No Extras           4          0     0       0    0
##   Noballs Penalty Bowler_Extras       Out_type Caught Bowled Run_out LBW
## 1       0       0             0 Not Applicable      0      0       0   0
## 2       0       0             0 Not Applicable      0      0       0   0
## 3       0       0             0 Not Applicable      0      0       0   0
## 4       0       0             0 Not Applicable      0      0       0   0
## 5       0       0             0 Not Applicable      0      0       0   0
## 6       0       0             0 Not Applicable      0      0       0   0
##   Retired_hurt Stumped caught_and_bowled hit_wicket ObstructingFeild
## 1            0       0                 0          0                0
## 2            0       0                 0          0                0
## 3            0       0                 0          0                0
## 4            0       0                 0          0                0
## 5            0       0                 0          0                0
## 6            0       0                 0          0                0
##   Bowler_Wicket Match_Date Season Striker Non_Striker Bowler Player_Out
## 1             0  4/20/2013   2013     277         104     83         NA
## 2             0  4/20/2013   2013     104           6    346         NA
## 3             0  4/20/2013   2013       6         104    346         NA
## 4             0  4/20/2013   2013     104           6    346         NA
## 5             0  4/20/2013   2013       6         104    346         NA
## 6             0  4/20/2013   2013       6         104    346         NA
##   Fielders Striker_match_SK StrikerSK NonStriker_match_SK NONStriker_SK
## 1       NA            20336       276               20333           103
## 2       NA            20333       103               20328             5
## 3       NA            20328         5               20333           103
## 4       NA            20333       103               20328             5
## 5       NA            20328         5               20333           103
## 6       NA            20328         5               20333           103
##   Fielder_match_SK Fielder_SK Bowler_match_SK BOWLER_SK PlayerOut_match_SK
## 1               -1         -1           20343        82                 -1
## 2               -1         -1           20348       345                 -1
## 3               -1         -1           20348       345                 -1
## 4               -1         -1           20348       345                 -1
## 5               -1         -1           20348       345                 -1
## 6               -1         -1           20348       345                 -1
##   BattingTeam_SK BowlingTeam_SK Keeper_Catch Player_out_sk MatchDateSK
## 1              4              1            0             0    20130420
## 2              4              1            0             0    20130420
## 3              4              1            0             0    20130420
## 4              4              1            0             0    20130420
## 5              4              1            0             0    20130420
## 6              4              1            0             0    20130420
##   Total_Runs_Per_Innings Bowling_Economy Total_Runs_Per_Over
## 1                      4       1.6000000                   4
## 2                      1       0.4285714                   1
## 3                      1       0.4285714                   1
## 4                      1       0.4285714                   1
## 5                      0       0.0000000                   0
## 6                      4       1.7142857                   4

##Ploting the visualization for Three sets of variable combinations Question: Plot a visualization for each response-explanatory relationship, and draw some conclusions based on the plot

Set 1 Ploting

# Load the necessary library for plotting
library(ggplot2)

# Set up the layout for multiple plots
par(mfrow=c(1, 3))

# Set 1: Total Runs Per Innings
plot1 <- ggplot(data = my_data, aes(x = Total_Runs_Per_Innings, y = Runs_Scored)) +
  geom_point() +
  xlab("Total Runs Per Innings") +
  ylab("Runs Scored") +
  ggtitle("Set 1: Total Runs Per Innings vs. Runs Scored")

# Display the plot
print(plot1)

Set 1: Total Runs Per Innings vs. Runs Scored

Observations: The plot shows a positive linear relationship between the total runs per innings and the runs scored. As the total runs per innings increases, runs scored also tend to increase. However, there are some outliers with high runs scored for a given total runs per innings.

Set 2 ploting:

library(ggplot2)

# Set 2: Bowling Economy
plot2 <- ggplot(data = my_data, aes(x = Bowling_Economy, y = Runs_Scored)) +
  geom_point() +
  xlab("Bowling Economy") +
  ylab("Runs Scored") +
  ggtitle("Set 2: Bowling Economy vs. Runs Scored")


# Display the plot
print(plot2)

Set 2: Bowling Economy vs. Runs Scored

Observations: The plot indicates a more scattered relationship between bowling economy and runs scored. While there is a general trend of lower runs scored with lower bowling economy, there are several outliers with high runs scored even for higher bowling economy values.

Set 3 ploting:

library(ggplot2)

# Set 3: Total Runs Per Over
plot3 <- ggplot(data = my_data, aes(x = Total_Runs_Per_Over, y = Runs_Scored)) +
  geom_point() +
  xlab("Total Runs Per Over") +
  ylab("Runs Scored") +
  ggtitle("Set 3: Total Runs Per Over vs. Runs Scored")

# Display the plot

print(plot3)

Set 3: Total Runs Per Over vs. Runs Scored

Observations: This plot also demonstrates a positive linear relationship between total runs per over and runs scored. As the total runs per over increases, runs scored tend to increase. Similar to Set 1, there are outliers with high runs scored values.

Calculating the appropriate correlation coefficient for each of these combinations

The correlation coefficient ranges from -1 to 1, where:

r = 1: Perfect positive correlation r = -1: Perfect negative correlation r = 0: No correlation

# Calculate correlation coefficients
cor1 <- cor(my_data$Total_Runs_Per_Innings, my_data$Runs_Scored, method = "pearson")
cor2 <- cor(my_data$Bowling_Economy, my_data$Runs_Scored, method = "pearson")
cor3 <- cor(my_data$Total_Runs_Per_Over, my_data$Runs_Scored, method = "pearson")

# Print the correlation coefficients
cat("Correlation coefficient (Set 1):", cor1, "\n")
## Correlation coefficient (Set 1): 0.9077391
cat("Correlation coefficient (Set 2):", cor2, "\n")
## Correlation coefficient (Set 2): 0.4677913
cat("Correlation coefficient (Set 3):", cor3, "\n")
## Correlation coefficient (Set 3): 0.9077391

Interpreting the correlation coefficients based on the visualizations:

Set 1: Total Runs Per Innings vs. Runs Scored

Visualization: The scatter plot showed a positive linear relationship.

Correlation Coefficient: Since the relationship was positive, we expect a positive correlation coefficient.

Interpretation: If the correlation coefficient is close to 1, it would make sense. This would mean that as the total runs per innings increases, runs scored also increase. However, if the coefficient is close to 0, it would not make sense as it would suggest no correlation.

Set 2: Bowling Economy vs. Runs Scored

Visualization: The scatter plot showed a scattered relationship with a general decreasing trend.

Correlation Coefficient: Since the relationship was not perfectly linear, we expect a correlation coefficient, but it may not be very high.

Interpretation: If the correlation coefficient is negative (but not close to -1), it would make sense. This would imply that as bowling economy decreases (improves), runs scored tend to decrease. However, a low coefficient would make sense because the relationship is not perfectly linear.

Set 3: Total Runs Per Over vs. Runs Scored

Visualization: The scatter plot showed a positive linear relationship.

Correlation Coefficient: Similar to Set 1, we expect a positive correlation coefficient.

Interpretation: If the correlation coefficient is close to 1, it would make sense, indicating that as the total runs per over increases, runs scored also increase.

Build a confidence interval for each of the response variables. Provide a detailed conclusion of the response variable (i.e., the population) based on your confidence interval.

We’ll use a common confidence level of 95%, which corresponds to a significance level of α = 0.05. We’ll interpret the confidence intervals to make conclusions about the population.

Below are the confidence intervals and their interpretations for each response variable:

Response Variable 1: Total Runs Per Innings

# Confidence interval for Total Runs Per Innings
ci_total_runs_per_innings <- t.test(my_data$Total_Runs_Per_Innings, conf.level = 0.95)

# Print the confidence interval
cat("Confidence Interval for Total Runs Per Innings:", ci_total_runs_per_innings$conf.int, "\n")
## Confidence Interval for Total Runs Per Innings: 1.351826 1.368475

Interpretation (Response Variable 1):

Based on the confidence interval for the “Total Runs Per Innings,” we can be 95% confident that the true population mean of runs per innings falls within the interval provided. This means that, on average, we expect the total runs per innings to be within the reported range.

Response Variable 2: Bowling Economy

# Confidence interval for Bowling Economy
ci_bowling_economy <- t.test(my_data$Bowling_Economy, conf.level = 0.95)

# Print the confidence interval
cat("Confidence Interval for Bowling Economy:", ci_bowling_economy$conf.int, "\n")
## Confidence Interval for Bowling Economy: 1.381211 1.413101

Interpretation (Response Variable 2):

Based on the confidence interval for “Bowling Economy,” we can be 95% confident that the true population mean of bowling economy falls within the interval provided. This means that, on average, we expect the bowling economy to be within the reported range.

Response Variable 3: Total Runs Per Over

# Confidence interval for Total Runs Per Over
ci_total_runs_per_over <- t.test(my_data$Total_Runs_Per_Over, conf.level = 0.95)

# Print the confidence interval
cat("Confidence Interval for Total Runs Per Over:", ci_total_runs_per_over$conf.int, "\n")
## Confidence Interval for Total Runs Per Over: 1.351826 1.368475

Interpretation (Response Variable 3): Based on the confidence interval for “Total Runs Per Over,” we can be 95% confident that the true population mean of total runs per over falls within the interval provided. This means that, on average, we expect the total runs per over to be within the reported range.

In summary, confidence intervals provide a range within which we can be confident the population means of these response variables lie. This allows us to make inferences about the population based on the sample data.