This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
my_data <- read.csv('C:/Users/dell/Downloads/Ball_By_Ball.csv')
summary(my_data)
## MatcH_id Over_id Ball_id Innings_No
## Min. : 335987 Min. : 1.00 Min. :1.000 Min. :1.000
## 1st Qu.: 419154 1st Qu.: 5.00 1st Qu.:2.000 1st Qu.:1.000
## Median : 548382 Median :10.00 Median :4.000 Median :1.000
## Mean : 636208 Mean :10.14 Mean :3.617 Mean :1.482
## 3rd Qu.: 829742 3rd Qu.:15.00 3rd Qu.:5.000 3rd Qu.:2.000
## Max. :1082650 Max. :20.00 Max. :9.000 Max. :4.000
##
## Team_Batting Team_Bowling Striker_Batting_Position
## Length:150451 Length:150451 Min. : 1.000
## Class :character Class :character 1st Qu.: 2.000
## Mode :character Mode :character Median : 3.000
## Mean : 3.584
## 3rd Qu.: 5.000
## Max. :11.000
## NA's :13861
## Extra_Type Runs_Scored Extra_runs Wides
## Length:150451 Min. :0.000 Min. :0.00000 Min. :0.0000
## Class :character 1st Qu.:0.000 1st Qu.:0.00000 1st Qu.:0.0000
## Mode :character Median :1.000 Median :0.00000 Median :0.0000
## Mean :1.222 Mean :0.06899 Mean :0.0375
## 3rd Qu.:1.000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :6.000 Max. :5.00000 Max. :5.0000
##
## Legbyes Byes Noballs Penalty
## Min. :0.00000 Min. :0.000000 Min. :0.00000 Min. :0.0e+00
## 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.0e+00
## Median :0.00000 Median :0.000000 Median :0.00000 Median :0.0e+00
## Mean :0.02223 Mean :0.004885 Mean :0.00434 Mean :3.3e-05
## 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.0e+00
## Max. :5.00000 Max. :4.000000 Max. :5.00000 Max. :5.0e+00
##
## Bowler_Extras Out_type Caught Bowled
## Min. :0.00000 Length:150451 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.00000 Class :character 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.00000 Mode :character Median :0.00000 Median :0.000000
## Mean :0.04184 Mean :0.02907 Mean :0.009186
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :5.00000 Max. :1.00000 Max. :1.000000
##
## Run_out LBW Retired_hurt Stumped
## Min. :0.000000 Min. :0.000000 Min. :0.00e+00 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.00e+00 1st Qu.:0.000000
## Median :0.000000 Median :0.000000 Median :0.00e+00 Median :0.000000
## Mean :0.005018 Mean :0.003024 Mean :5.98e-05 Mean :0.001615
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.00e+00 3rd Qu.:0.000000
## Max. :1.000000 Max. :1.000000 Max. :1.00e+00 Max. :1.000000
##
## caught_and_bowled hit_wicket ObstructingFeild Bowler_Wicket
## Min. :0.000000 Min. :0.00e+00 Min. :0.0e+00 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.00e+00 1st Qu.:0.0e+00 1st Qu.:0.00000
## Median :0.000000 Median :0.00e+00 Median :0.0e+00 Median :0.00000
## Mean :0.001402 Mean :5.98e-05 Mean :6.6e-06 Mean :0.04435
## 3rd Qu.:0.000000 3rd Qu.:0.00e+00 3rd Qu.:0.0e+00 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.00e+00 Max. :1.0e+00 Max. :1.00000
##
## Match_Date Season Striker Non_Striker
## Length:150451 Min. :2008 Min. : 1.0 Min. : 1.0
## Class :character 1st Qu.:2010 1st Qu.: 40.0 1st Qu.: 40.0
## Mode :character Median :2012 Median : 96.0 Median : 96.0
## Mean :2012 Mean :136.5 Mean :135.6
## 3rd Qu.:2015 3rd Qu.:208.0 3rd Qu.:208.0
## Max. :2017 Max. :497.0 Max. :497.0
##
## Bowler Player_Out Fielders Striker_match_SK
## Min. : 1.0 Min. : 1.0 Min. : 1.0 Min. :12694
## 1st Qu.: 77.0 1st Qu.: 41.0 1st Qu.: 47.0 1st Qu.:16173
## Median :174.0 Median :107.0 Median :111.0 Median :19672
## Mean :194.1 Mean :148.6 Mean :155.4 Mean :19675
## 3rd Qu.:310.0 3rd Qu.:236.0 3rd Qu.:237.5 3rd Qu.:23127
## Max. :497.0 Max. :497.0 Max. :497.0 Max. :26685
## NA's :143013 NA's :145100
## StrikerSK NonStriker_match_SK NONStriker_SK Fielder_match_SK
## Min. : 0.0 Min. :12694 Min. : 0.0 Min. : -1
## 1st Qu.: 39.0 1st Qu.:16173 1st Qu.: 39.0 1st Qu.: -1
## Median : 95.0 Median :19672 Median : 95.0 Median : -1
## Mean :135.5 Mean :19675 Mean :134.6 Mean : 690
## 3rd Qu.:207.0 3rd Qu.:23127 3rd Qu.:207.0 3rd Qu.: -1
## Max. :496.0 Max. :26685 Max. :496.0 Max. :26680
##
## Fielder_SK Bowler_match_SK BOWLER_SK PlayerOut_match_SK
## Min. : -1.000 Min. :12697 Min. : 0.0 Min. : -1.0
## 1st Qu.: -1.000 1st Qu.:16175 1st Qu.: 76.0 1st Qu.: -1.0
## Median : -1.000 Median :19674 Median :173.0 Median : -1.0
## Mean : 4.527 Mean :19677 Mean :193.1 Mean : 970.3
## 3rd Qu.: -1.000 3rd Qu.:23131 3rd Qu.:309.0 3rd Qu.: -1.0
## Max. :496.000 Max. :26685 Max. :496.0 Max. :26685.0
##
## BattingTeam_SK BowlingTeam_SK Keeper_Catch Player_out_sk
## Min. : 0.000 Min. : 0.000 Min. :0.000000 Min. : -1.000
## 1st Qu.: 2.000 1st Qu.: 2.000 1st Qu.:0.000000 1st Qu.: 0.000
## Median : 4.000 Median : 4.000 Median :0.000000 Median : 0.000
## Mean : 4.346 Mean : 4.333 Mean :0.000432 Mean : 1.101
## 3rd Qu.: 6.000 3rd Qu.: 6.000 3rd Qu.:0.000000 3rd Qu.: 0.000
## Max. :12.000 Max. :12.000 Max. :1.000000 Max. :496.000
##
## MatchDateSK
## Min. :20080418
## 1st Qu.:20100411
## Median :20120520
## Mean :20125288
## 3rd Qu.:20150420
## Max. :20170521
##
Set 1: Calculate the “Total_Runs_Scored” column by summing “Runs_Scored,” “Extra_runs,” and “Byes.”
Set 2: Calculate the “Bowling_Economy” column by summing “Runs_Scored”, “Extra_runs”, “Noballs”, “Wides”, “Legbyes”, “Byes” and divide by “my_data$Over_id / 6”.
Set 3: Calculate the “Average Runs Per Ball” column by summimg “Runs_Scored,” “Extra_runs,” and dividing them by “over_id” * 6
# Set 1: Calculate Total Runs Per Innings
response_variable_set1 <- my_data$Total_Runs_Per_Innings # Response variable for Set 1
explanatory_variables_set1 <- c(my_data$Runs_Scored, my_data$Extra_runs, my_data$Wides, my_data$Legbyes, my_data$Byes, my_data$Noballs) # Explanatory variables for Set 1
# Perform calculations for Set 1
my_data$Total_Runs_Per_Innings <- my_data$Runs_Scored + my_data$Extra_runs + my_data$Wides + my_data$Legbyes + my_data$Byes + my_data$Noballs
# Set 2: Calculate Bowling Economy
response_variable_set2 <- my_data$Bowling_Economy # Response variable for Set 2
explanatory_variables_set2 <- c(my_data$Runs_Scored, my_data$Extra_runs, my_data$Noballs, my_data$Wides, my_data$Legbyes, my_data$Byes) # Explanatory variables for Set 2
# Perform calculations for Set 2
my_data$Bowling_Economy <- (my_data$Runs_Scored + my_data$Extra_runs + my_data$Noballs + my_data$Wides + my_data$Legbyes + my_data$Byes) / (my_data$Over_id / 6)
# Set 3: Calculate Total Runs Per Over
response_variable_set3 <- my_data$Total_Runs_Per_Over # Response variable for Set 3
explanatory_variables_set3 <- c(my_data$Runs_Scored, my_data$Extra_runs, my_data$Wides, my_data$Legbyes, my_data$Byes, my_data$Noballs, my_data$Over_id) # Explanatory variables for Set 3
# Perform calculations for Set 3
my_data$Total_Runs_Per_Over <- my_data$Runs_Scored + my_data$Extra_runs + my_data$Wides + my_data$Legbyes + my_data$Byes + my_data$Noballs
# Display the first few rows of the updated dataset
head(my_data)
## MatcH_id Over_id Ball_id Innings_No Team_Batting Team_Bowling
## 1 598028 15 6 1 5 2
## 2 598028 14 1 1 5 2
## 3 598028 14 2 1 5 2
## 4 598028 14 3 1 5 2
## 5 598028 14 4 1 5 2
## 6 598028 14 5 1 5 2
## Striker_Batting_Position Extra_Type Runs_Scored Extra_runs Wides Legbyes Byes
## 1 6 No Extras 4 0 0 0 0
## 2 5 No Extras 1 0 0 0 0
## 3 3 No Extras 1 0 0 0 0
## 4 5 No Extras 1 0 0 0 0
## 5 3 No Extras 0 0 0 0 0
## 6 3 No Extras 4 0 0 0 0
## Noballs Penalty Bowler_Extras Out_type Caught Bowled Run_out LBW
## 1 0 0 0 Not Applicable 0 0 0 0
## 2 0 0 0 Not Applicable 0 0 0 0
## 3 0 0 0 Not Applicable 0 0 0 0
## 4 0 0 0 Not Applicable 0 0 0 0
## 5 0 0 0 Not Applicable 0 0 0 0
## 6 0 0 0 Not Applicable 0 0 0 0
## Retired_hurt Stumped caught_and_bowled hit_wicket ObstructingFeild
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 0 0 0 0 0
## Bowler_Wicket Match_Date Season Striker Non_Striker Bowler Player_Out
## 1 0 4/20/2013 2013 277 104 83 NA
## 2 0 4/20/2013 2013 104 6 346 NA
## 3 0 4/20/2013 2013 6 104 346 NA
## 4 0 4/20/2013 2013 104 6 346 NA
## 5 0 4/20/2013 2013 6 104 346 NA
## 6 0 4/20/2013 2013 6 104 346 NA
## Fielders Striker_match_SK StrikerSK NonStriker_match_SK NONStriker_SK
## 1 NA 20336 276 20333 103
## 2 NA 20333 103 20328 5
## 3 NA 20328 5 20333 103
## 4 NA 20333 103 20328 5
## 5 NA 20328 5 20333 103
## 6 NA 20328 5 20333 103
## Fielder_match_SK Fielder_SK Bowler_match_SK BOWLER_SK PlayerOut_match_SK
## 1 -1 -1 20343 82 -1
## 2 -1 -1 20348 345 -1
## 3 -1 -1 20348 345 -1
## 4 -1 -1 20348 345 -1
## 5 -1 -1 20348 345 -1
## 6 -1 -1 20348 345 -1
## BattingTeam_SK BowlingTeam_SK Keeper_Catch Player_out_sk MatchDateSK
## 1 4 1 0 0 20130420
## 2 4 1 0 0 20130420
## 3 4 1 0 0 20130420
## 4 4 1 0 0 20130420
## 5 4 1 0 0 20130420
## 6 4 1 0 0 20130420
## Total_Runs_Per_Innings Bowling_Economy Total_Runs_Per_Over
## 1 4 1.6000000 4
## 2 1 0.4285714 1
## 3 1 0.4285714 1
## 4 1 0.4285714 1
## 5 0 0.0000000 0
## 6 4 1.7142857 4
##Ploting the visualization for Three sets of variable combinations Question: Plot a visualization for each response-explanatory relationship, and draw some conclusions based on the plot
# Load the necessary library for plotting
library(ggplot2)
# Set up the layout for multiple plots
par(mfrow=c(1, 3))
# Set 1: Total Runs Per Innings
plot1 <- ggplot(data = my_data, aes(x = Total_Runs_Per_Innings, y = Runs_Scored)) +
geom_point() +
xlab("Total Runs Per Innings") +
ylab("Runs Scored") +
ggtitle("Set 1: Total Runs Per Innings vs. Runs Scored")
# Display the plot
print(plot1)
Set 1: Total Runs Per Innings vs. Runs Scored
Observations: The plot shows a positive linear relationship between the total runs per innings and the runs scored. As the total runs per innings increases, runs scored also tend to increase. However, there are some outliers with high runs scored for a given total runs per innings.
library(ggplot2)
# Set 2: Bowling Economy
plot2 <- ggplot(data = my_data, aes(x = Bowling_Economy, y = Runs_Scored)) +
geom_point() +
xlab("Bowling Economy") +
ylab("Runs Scored") +
ggtitle("Set 2: Bowling Economy vs. Runs Scored")
# Display the plot
print(plot2)
Set 2: Bowling Economy vs. Runs Scored
Observations: The plot indicates a more scattered relationship between bowling economy and runs scored. While there is a general trend of lower runs scored with lower bowling economy, there are several outliers with high runs scored even for higher bowling economy values.
library(ggplot2)
# Set 3: Total Runs Per Over
plot3 <- ggplot(data = my_data, aes(x = Total_Runs_Per_Over, y = Runs_Scored)) +
geom_point() +
xlab("Total Runs Per Over") +
ylab("Runs Scored") +
ggtitle("Set 3: Total Runs Per Over vs. Runs Scored")
# Display the plot
print(plot3)
Set 3: Total Runs Per Over vs. Runs Scored
Observations: This plot also demonstrates a positive linear relationship between total runs per over and runs scored. As the total runs per over increases, runs scored tend to increase. Similar to Set 1, there are outliers with high runs scored values.
The correlation coefficient ranges from -1 to 1, where:
r = 1: Perfect positive correlation r = -1: Perfect negative correlation r = 0: No correlation
# Calculate correlation coefficients
cor1 <- cor(my_data$Total_Runs_Per_Innings, my_data$Runs_Scored, method = "pearson")
cor2 <- cor(my_data$Bowling_Economy, my_data$Runs_Scored, method = "pearson")
cor3 <- cor(my_data$Total_Runs_Per_Over, my_data$Runs_Scored, method = "pearson")
# Print the correlation coefficients
cat("Correlation coefficient (Set 1):", cor1, "\n")
## Correlation coefficient (Set 1): 0.9077391
cat("Correlation coefficient (Set 2):", cor2, "\n")
## Correlation coefficient (Set 2): 0.4677913
cat("Correlation coefficient (Set 3):", cor3, "\n")
## Correlation coefficient (Set 3): 0.9077391
Set 1: Total Runs Per Innings vs. Runs Scored
Visualization: The scatter plot showed a positive linear relationship.
Correlation Coefficient: Since the relationship was positive, we expect a positive correlation coefficient.
Interpretation: If the correlation coefficient is close to 1, it would make sense. This would mean that as the total runs per innings increases, runs scored also increase. However, if the coefficient is close to 0, it would not make sense as it would suggest no correlation.
Set 2: Bowling Economy vs. Runs Scored
Visualization: The scatter plot showed a scattered relationship with a general decreasing trend.
Correlation Coefficient: Since the relationship was not perfectly linear, we expect a correlation coefficient, but it may not be very high.
Interpretation: If the correlation coefficient is negative (but not close to -1), it would make sense. This would imply that as bowling economy decreases (improves), runs scored tend to decrease. However, a low coefficient would make sense because the relationship is not perfectly linear.
Set 3: Total Runs Per Over vs. Runs Scored
Visualization: The scatter plot showed a positive linear relationship.
Correlation Coefficient: Similar to Set 1, we expect a positive correlation coefficient.
Interpretation: If the correlation coefficient is close to 1, it would make sense, indicating that as the total runs per over increases, runs scored also increase.
We’ll use a common confidence level of 95%, which corresponds to a significance level of α = 0.05. We’ll interpret the confidence intervals to make conclusions about the population.
Below are the confidence intervals and their interpretations for each response variable:
# Confidence interval for Total Runs Per Innings
ci_total_runs_per_innings <- t.test(my_data$Total_Runs_Per_Innings, conf.level = 0.95)
# Print the confidence interval
cat("Confidence Interval for Total Runs Per Innings:", ci_total_runs_per_innings$conf.int, "\n")
## Confidence Interval for Total Runs Per Innings: 1.351826 1.368475
Interpretation (Response Variable 1):
Based on the confidence interval for the “Total Runs Per Innings,” we can be 95% confident that the true population mean of runs per innings falls within the interval provided. This means that, on average, we expect the total runs per innings to be within the reported range.
# Confidence interval for Bowling Economy
ci_bowling_economy <- t.test(my_data$Bowling_Economy, conf.level = 0.95)
# Print the confidence interval
cat("Confidence Interval for Bowling Economy:", ci_bowling_economy$conf.int, "\n")
## Confidence Interval for Bowling Economy: 1.381211 1.413101
Interpretation (Response Variable 2):
Based on the confidence interval for “Bowling Economy,” we can be 95% confident that the true population mean of bowling economy falls within the interval provided. This means that, on average, we expect the bowling economy to be within the reported range.
# Confidence interval for Total Runs Per Over
ci_total_runs_per_over <- t.test(my_data$Total_Runs_Per_Over, conf.level = 0.95)
# Print the confidence interval
cat("Confidence Interval for Total Runs Per Over:", ci_total_runs_per_over$conf.int, "\n")
## Confidence Interval for Total Runs Per Over: 1.351826 1.368475
Interpretation (Response Variable 3): Based on the confidence interval for “Total Runs Per Over,” we can be 95% confident that the true population mean of total runs per over falls within the interval provided. This means that, on average, we expect the total runs per over to be within the reported range.
In summary, confidence intervals provide a range within which we can be confident the population means of these response variables lie. This allows us to make inferences about the population based on the sample data.