library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(stats)
library(ggthemes)
library(purrr)
library(pwr)
data <- read.csv("~/Documents/Rdocs/matches.csv", stringsAsFactors = TRUE)
head(data)
Befor performing hypothesis test, lets check sample size calculation with Neyman-Pearson framework..
data |>
group_by(city) |>
summarize(sd = sd(target_runs),
mean = mean(target_runs))
Won_by_runs <- data %>% filter(result == "runs")
Won_by_wkts <- data %>% filter(result == "wickets")
# Calculate effect size based on a small effect (Cohen's d = 0.2)
d <- 0.2
# Power analysis
pwr_result <- pwr.t.test(d = d, power = 0.80, sig.level = 0.05, type = "two.sample", alternative = "two.sided")
required_n <- ceiling(pwr_result$n)
cat("The required sample size per group is", required_n, "\n")
## The required sample size per group is 394
Sample size we require is around 400. We have around 1100 rows so we have enough data to perform hypothesis test.We can use a chi-squared test to check if there’s a relationship between winning the toss and winning the match.
We can use the significance Level of 0.05: This is a standard significance level for most scientific studies, balancing the risk of Type I error (false positive). Given the context of unbiased sports, a 95%ci can be assumed.
Power Level of 0.80: Power of 80% is standard, ensuring a reasonable chance of detecting a true effect without requiring an overly large sample size.
Effect Size of 0.5: A moderate effect size (Cohen’s d = 0.5) is chosen because smaller differences might not be practically significant in a this context. This effect size ensures that the test focuses on differences in result margins that would matter in real-world decisions.
Hypothesis: Do teams winning the toss have a significant advantage in winning the match statistically?
Alternative Hypothesis: The chances of winning a match is higher if a team wins the toss.
# Clean data to remove missing values
data_clean <- data |> filter(!is.na(winner) & !is.na(toss_winner))
# Lets create a new column indicating if the toss winner also won the match
data_clean <- data_clean %>%
mutate(toss_and_match_winner = ifelse(winner == toss_winner, 'Yes', 'No'))
# Lets perform Chi-squared test of independence
toss_match_table <- table(data_clean$toss_and_match_winner)
chi_test_result <- chisq.test(toss_match_table)
print(chi_test_result)
##
## Chi-squared test for given probabilities
##
## data: toss_match_table
## X-squared = 0.29725, df = 1, p-value = 0.5856
We got p value from chi squared test.The p-value from the test is more than the significance level (0.05), so we doesnot reject the null hypothesis, indicating that winning the toss does not give a statistically significant advantage in winning the match. Lets visualize it to understand better.
ggplot(data_clean, aes(x = toss_and_match_winner, fill = toss_and_match_winner)) +
geom_bar() +
labs(title = "Winning the Toss and Winning the Match",
x = "Did the Toss Winner Also Win the Match?",
y = "Number of Matches") +
theme_minimal()
This is such a balenced result we got from this visualization. We can see that Toss winner also winning matches has pretty equal amount of choice to that of not winning match. This bar plot visualizes the number of matches won by the team that won the toss vs the number of matches where the toss-winning team lost. Here As both bars are almost equal we can say that there is no statistical advantage in winning the toss for winning a match.
Alternative Hypothesis: The average result margin is independent of teams choosing to bat first vs field first.
Lets perform t-test using Fishers significance framework. Lets take standard significance level of 0.05
# Clean data to remove missing values
data_margin <- data_clean %>% filter(!is.na(result_margin))
# Create two seperate groups based on toss decision
bat_first <- data_margin %>% filter(toss_decision == 'bat') %>% pull(result_margin)
field_first <- data_margin %>% filter(toss_decision == 'field') %>% pull(result_margin)
t_test_result <- t.test(bat_first, field_first, var.equal = TRUE)
# Print test result
print(t_test_result)
##
## Two Sample t-test
##
## data: bat_first and field_first
## t = -0.54765, df = 1074, p-value = 0.584
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.480792 1.961765
## sample estimates:
## mean of x mean of y
## 16.77083 17.53035
We got p value of 0.584 from the t test. we fail to reject the null hypothesis and conclude that there is significant difference in the result margins based on toss decisions. Lets check a visualization to understand well.
# Visualize the distributions
ggplot(data_margin, aes(x = toss_decision, y = result_margin, fill = toss_decision)) +
geom_boxplot() +
labs(title = "Result Margin by Toss Decision",
x = "Toss Decision",
y = "Result Margin") +
theme_minimal()
From this boxplot we can see there are differences between result margin by toss decision. Teams that chose field first have won by bigger margin than that of teams that chose to bat first. So we can say that they are not independent. Although we can see there is minor difference in box plot due to unequal sample sizes of both decisions, we can see the scatter plots which indicates that they are not independent. So we fail to reject the null hypothesis.