This Data Dive explores the IPL Player Performance Dataset by Anova
testing and building regression models.
One-way ANOVA testing :- to compare average team innings scores across venues
Simple linear regression:-to model runs as a function of boundary hitting.
ipl_raw<-read_csv("C:/mayangup/SP26/ipl-data_Dataset 1.csv")
## Rows: 24044 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): player, team, match_outcome, opposition_team, venue
## dbl (16): match_id, runs, balls_faced, fours, sixes, wickets, overs_bowled,...
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Note: Data Preparation: The data set includes only 5 matches of year 2025 that is not complete and this would distort all calculations, so to avoid this, filtered out all rows from 2025 and used a clean dataset for further analysis including complete seasons only.
IPL <- ipl_raw |>
mutate(
date = as.Date(date),
season = year(date)
) |>
filter(season < 2025)
Response variable (continuous):
Runs per innings are calculted by aggregating runs by match_id and team
Explanatory variable (categorical):
Venue names are first mapped to cities to reduce duplication and standardize locations.
The goal of this hypothesis is to test whether average per innings score (runs) differ across major IPL cities.
Null Hypothesis (H₀):The mean team innings runs are equal across all selected venue cities.
\[ \LARGE \mu_{\text{innings_runs, City}_1} = \mu_{\text{innings_runs, City}_2} = \dots = \mu_{\text{innings_runs, City}_k} \]
Alternative Hypothesis (H₁):At least one venue city has a different mean team innings runs.
\[\LARGE \mu_{\text{innings_runs, City}_j} \neq \mu_{\text{innings_runs, City}_m} \quad \text{for at least one pair } (j, m) \]
#Creating the venue → city mapping
venue_city_map <- c(
"M Chinnaswamy Stadium" = "Bengaluru",
"M.Chinnaswamy Stadium" = "Bengaluru",
"M Chinnaswamy Stadium, Bengaluru" = "Bengaluru",
"Dr DY Patil Sports Academy, Mumbai" = "Mumbai",
"Dr DY Patil Sports Academy" = "Mumbai",
"Eden Gardens, Kolkata" = "Kolkata",
"Eden Gardens" = "Kolkata",
"Wankhede Stadium, Mumbai" = "Mumbai",
"Wankhede Stadium" = "Mumbai",
"Rajiv Gandhi International Stadium, Uppal" = "Hyderabad",
"Rajiv Gandhi International Stadium, Uppal, Hyderabad" = "Hyderabad",
"Rajiv Gandhi International Stadium" = "Hyderabad",
"Feroz Shah Kotla" = "Delhi",
"Arun Jaitley Stadium" = "Delhi",
"Arun Jaitley Stadium, Delhi" = "Delhi",
"Dubai International Cricket Stadium" = "Other",
"Sheikh Zayed Stadium" = "Other",
"Sharjah Cricket Stadium" = "Other",
"Zayed Cricket Stadium, Abu Dhabi" = "Other",
"SuperSport Park" = "Other",
"Kingsmead" = "Other",
"St George's Park" = "Other",
"Newlands" = "Other",
"Buffalo Park" = "Other",
"OUTsurance Oval" = "Other",
"New Wanderers Stadium" = "Other",
"De Beers Diamond Oval" = "Other",
"MA Chidambaram Stadium, Chepauk" = "Chennai",
"MA Chidambaram Stadium, Chepauk, Chennai" = "Chennai",
"MA Chidambaram Stadium" = "Chennai",
"Brabourne Stadium" = "Mumbai",
"Brabourne Stadium, Mumbai" = "Mumbai",
"Narendra Modi Stadium, Ahmedabad" = "Ahmedabad",
"Sardar Patel Stadium, Motera" = "Ahmedabad",
"Himachal Pradesh Cricket Association Stadium" = "Dharamshala",
"Himachal Pradesh Cricket Association Stadium, Dharamsala" = "Dharamshala",
"Punjab Cricket Association Stadium, Mohali" = "Mohali",
"Punjab Cricket Association IS Bindra Stadium" = "Mohali",
"Punjab Cricket Association IS Bindra Stadium, Mohali" = "Mohali",
"Punjab Cricket Association IS Bindra Stadium, Mohali, Chandigarh" = "Mohali",
"Dr. Y.S. Rajasekhara Reddy ACA-VDCA Cricket Stadium" = "Vizag",
"Dr. Y.S. Rajasekhara Reddy ACA-VDCA Cricket Stadium, Visakhapatnam" = "Vizag",
"Maharashtra Cricket Association Stadium" = "Pune",
"Maharashtra Cricket Association Stadium, Pune" = "Pune",
"Sawai Mansingh Stadium, Jaipur" = "Jaipur",
"Sawai Mansingh Stadium" = "Jaipur",
"Barabati Stadium" = "Cuttack",
"Green Park" = "Kanpur",
"Holkar Cricket Stadium" = "Indore",
"JSCA International Stadium Complex" = "Ranchi",
"Barsapara Cricket Stadium, Guwahati" = "Guwahati",
"Nehru Stadium" = "Kochi",
"Saurashtra Cricket Association Stadium" = "Rajkot",
"Subrata Roy Sahara Stadium" = "Pune",
"Shaheed Veer Narayan Singh International Stadium" = "Raipur",
"Vidarbha Cricket Association Stadium, Jamtha" = "Nagpur",
"Maharaja Yadavindra Singh International Cricket Stadium, Mullanpur" = "Chandigarh"
)
# Applying the Mapping
IPL <- IPL |>
mutate(
venue_city = venue_city_map[venue]
)
IPL <- IPL |>
mutate(
venue_city = as.character(venue_city_map[venue])
)
#Filter out neutral venues
IPL <- IPL |>
filter(venue_city != "Other")
# Create innings-level totals
innings_totals <- IPL |>
group_by(match_id, team, venue_city) |>
summarise(
innings_runs = sum(runs, na.rm = TRUE),
.groups = "drop"
)
head(innings_totals)
## # A tibble: 6 × 4
## match_id team venue_city innings_runs
## <dbl> <chr> <chr> <dbl>
## 1 335982 Kolkata Knight Riders Bengaluru 205
## 2 335982 Royal Challengers Bangalore Bengaluru 63
## 3 335983 Chennai Super Kings Mohali 234
## 4 335983 Kings XI Punjab Mohali 196
## 5 335984 Delhi Daredevils Delhi 122
## 6 335984 Rajasthan Royals Delhi 122
# Count number of innings per city
city_counts <- innings_totals |>
count(venue_city, name = "n_innings") |>
arrange(desc(n_innings))
city_counts
## # A tibble: 22 × 2
## venue_city n_innings
## <chr> <int>
## 1 Mumbai 364
## 2 Bengaluru 188
## 3 Kolkata 186
## 4 Delhi 180
## 5 Chennai 170
## 6 Hyderabad 154
## 7 Mohali 122
## 8 Jaipur 114
## 9 Pune 102
## 10 Ahmedabad 72
## # ℹ 12 more rows
IPL |> select(venue, venue_city) |> head(20)
## # A tibble: 20 × 2
## venue venue_city
## <chr> <chr>
## 1 M Chinnaswamy Stadium Bengaluru
## 2 M Chinnaswamy Stadium Bengaluru
## 3 M Chinnaswamy Stadium Bengaluru
## 4 Dr DY Patil Sports Academy, Mumbai Mumbai
## 5 Eden Gardens, Kolkata Kolkata
## 6 M Chinnaswamy Stadium Bengaluru
## 7 M Chinnaswamy Stadium Bengaluru
## 8 Wankhede Stadium, Mumbai Mumbai
## 9 Wankhede Stadium Mumbai
## 10 Wankhede Stadium Mumbai
## 11 Rajiv Gandhi International Stadium, Uppal Hyderabad
## 12 Feroz Shah Kotla Delhi
## 13 Wankhede Stadium, Mumbai Mumbai
## 14 Rajiv Gandhi International Stadium, Uppal Hyderabad
## 15 Arun Jaitley Stadium Delhi
## 16 Rajiv Gandhi International Stadium, Uppal Hyderabad
## 17 MA Chidambaram Stadium, Chepauk Chennai
## 18 Brabourne Stadium Mumbai
## 19 Wankhede Stadium Mumbai
## 20 MA Chidambaram Stadium, Chepauk, Chennai Chennai
# Select top 10 cities by number of innings
top_cities <- city_counts |>
slice_max(n_innings, n = 10) |>
pull(venue_city)
top_cities
## [1] "Mumbai" "Bengaluru" "Kolkata" "Delhi" "Chennai" "Hyderabad"
## [7] "Mohali" "Jaipur" "Pune" "Ahmedabad"
#Filter innings data to only top cities
innings_top <- innings_totals |>
filter(venue_city %in% top_cities)
head(innings_top)
## # A tibble: 6 × 4
## match_id team venue_city innings_runs
## <dbl> <chr> <chr> <dbl>
## 1 335982 Kolkata Knight Riders Bengaluru 205
## 2 335982 Royal Challengers Bangalore Bengaluru 63
## 3 335983 Chennai Super Kings Mohali 234
## 4 335983 Kings XI Punjab Mohali 196
## 5 335984 Delhi Daredevils Delhi 122
## 6 335984 Rajasthan Royals Delhi 122
To prepare the data for the ANOVA test, I first mapped each stadium in the dataset to its corresponding city using a custom venue_city_map, ensuring that all venues were consistently categorized at the city level, then removed neutral venues labeled as “Other”. Next, I aggregated data by grouping by match_id, team, and venue_city, and summing the total runs scored in each innings, then selected the top ten cities with the highest number of innings for ANOVA .
A one-way ANOVA model to test whether the mean team innings runs differ across venues. The response variable is \(innings\_run\), and the predictor is the categorical variable \(venue\_city\). The ANOVA table reports the F-statistic and the corresponding p-value.
A statistically significant p-value \((typically < 0.05)\) would indicate that at least one city has a different mean innings total compared to the others, providing evidence against the null hypothesis of equal means across all venues.
# Fit the one-way ANOVA model
anova_model <- aov(innings_runs ~ venue_city, data = innings_top)
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## venue_city 9 17428 1936 1.799 0.0638 .
## Residuals 1642 1767044 1076
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova Result:-\(F = 1.799\) and \(p = 0.0638\)
Since the p‑value is greater than the \(0.05\) significance level, we fail to reject the null hypothesis that all ten cities have the same mean innings runs. Although the city averages differ slightly, these differences are small relative to the large match‑to‑match variability within each venue. In practical terms, cities differ a little, but not enough for the differences to be statistically detectable, and the analysis does not provide evidence that venue city has a significant effect on innings runs.
innings_top |>
ggplot() +
geom_boxplot(
mapping = aes(
x = venue_city,
y = innings_runs,
fill = venue_city
),
outlier.alpha = 0.4
) +
labs(
title = "Distribution of Innings Runs Across 10 IPL Venue Cities",
x = "Venue City",
y = "Innings Runs"
) +
theme_minimal(base_size = 14) +
theme(
legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1)
)
The boxplot shows that the distribution of innings runs is broadly similar across all ten IPL venue cities. While the medians vary slightly from city to city, the overall spread, interquartile ranges, and presence of outliers are largely comparable. No city displays a distinctly higher or lower scoring pattern, and the heavy overlap in distributions indicates that the differences between cities are small relative to the large match‑to‑match variability within each venue. This visual pattern aligns with the ANOVA result, reinforcing that cities differ a little, but not enough for the differences to be statistically detectable.
further question:-Would the results change if we analyzed only recent IPL seasons?
To explore the relationship between batting performance and boundary‑hitting, I fit a simple linear regression model with \(runs\) as the response variable and \(boundaries\) ( \(fours\) + \(sixes\)) as the predictor.
# create boundaries
IPL <- IPL |>
mutate(boundaries = fours + sixes)
# fit the regression model
model_runs <- lm(runs ~ boundaries, data = IPL)
summary(model_runs)
##
## Call:
## lm(formula = runs ~ boundaries, data = IPL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.035 -1.786 -1.786 1.339 35.214
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.78571 0.04310 41.44 <2e-16 ***
## boundaries 6.62495 0.01242 533.46 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.169 on 19949 degrees of freedom
## Multiple R-squared: 0.9345, Adjusted R-squared: 0.9345
## F-statistic: 2.846e+05 on 1 and 19949 DF, p-value: < 2.2e-16
1. Intercept
The intercept represents the expected value of the response variable when the predictor is zero; from model_runs , a player with \(0\) boundaries is predicted to score approximately \(1.79\) runs, based on the estimate \(\hat{\beta}_0 = 1.7857\).
2. Slope (Coefficient of the Predictor)
The slope quantifies how much the response variable changes for a one‑unit increase in the predictor; here, each additional boundary is associated with an estimated increase of \(6.62\) runs, given by \(\hat{\beta}_1 = 6.62495\).
3. P‑value
The extremely small value \(p < 2 \times 10^{-16}\) indicates that boundaries are a highly significant predictor of runs.
4. Coefficient of Determination \((R^2)\)
The coefficient of determination measures the proportion of variance in the response explained by the model, an \((R^2 = 0.9345)\) means that \(93.45%\) of the variation in runs is explained by boundaries.
5. Residual Standard Error (RSE)
The residual standard error estimates the typical size of prediction errors, an RSE of \(5.169\) indicates that the model’s predicted runs differ from actual runs by about \(5\) runs on average.
The regression results show a strong and meaningful linear relationship between boundary‑hitting and total runs scored. The slope estimate of \(\hat {\beta }_1=6.62\) indicates that each additional boundary is associated with an average increase of about \(6.6\) runs, while the intercept \(\hat {\beta }_0=1.79\) represents the expected runs for a player who hits no boundaries. The predictor is highly statistically significant, with a p‑value less than \(2\times 10^{-16}\), providing strong evidence that boundaries contribute substantially to run‑scoring. The model explains \(93.45\%\) of the variation in runs \((R^2=0.9345)\), suggesting that boundary‑hitting alone accounts for most of the differences in player scoring. The residual standard error of approximately \(5.17\) runs indicates that the model’s predictions are typically within about five runs of the actual values. Overall, the regression model demonstrates that boundary‑hitting is a powerful and reliable predictor of batting performance in this dataset.
IPL |>
ggplot(aes(x = boundaries, y = runs)) +
geom_point(alpha = 0.3, color = "steelblue") +
geom_smooth(method = "lm", se = FALSE, color = "darkred", linewidth = 1) +
labs(
title = "Relationship Between Boundaries and Runs",
x = "Total Boundaries (Fours + Sixes)",
y = "Runs Scored"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The scatterplot clearly shows a strong positive linear relationship between total boundaries and runs scored. As the number of boundaries increases, the points rise sharply along the vertical axis, indicating that players who hit more boundaries consistently score more runs. The fitted regression line captures this trend effectively, with most points lying close to the line, reflecting the high \(R^2\) value from the model. The tight clustering around the line suggests that boundary‑hitting is a highly reliable predictor of run‑scoring, with relatively little unexplained variation. Overall, the visualization reinforces the numerical results by showing that boundary‑hitting is strongly and linearly associated with batting performance.
further question :- Does the linear relationship between boundaries and runs remain consistent across different player /team?