Introduction

This Data Dive explores IPL Player Performance Dataset by:-

Selecting two numeric variables, each paired with a created (mutated) numeric column, forming two variable pairs.
Creating visualizations for each pair to examine trends, patterns, and potential outliers.
Computing the correlation coefficient for each combination and interpreting the value in the context of the plots.
Constructing confidence intervals for each response variable to draw conclusions about the underlying population.

ipl_raw<-read_csv("C:/mayangup/SP26/ipl-data_Dataset 1.csv")

## Rows: 24044 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (5): player, team, match_outcome, opposition_team, venue
## dbl  (16): match_id, runs, balls_faced, fours, sixes, wickets, overs_bowled,...
## date  (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Note: Data Preparation: The data set includes only 5 matches of year 2025 that is not complete and this would distort all calculations, so to avoid this, filtered out all rows from 2025 and used a clean dataset for further analysis including complete seasons only.

IPL <- ipl_raw |>
  mutate(
    date = as.Date(date),
    season = year(date)
  ) |>
  filter(season < 2025)

Variable selection

To analyze bowling performance at the innings level, I create two new continuous variables using existing numeric columns. These created variables do not exist in the original dataset.

Pair 1: Runs conceded and Bowling Average

Original variable: \(runs\_conceded\)
Created variable: \[\large bowling\_average = \frac{runs\_conceded}{wickets}\]
- Explanatory: \(runs\_conceded\)
- Response: \(bowling\_average\)
Measure :-how many runs a bowler concedes per wicket taken , a key measure of bowling effectiveness

Pair 2: Balls bowled and Bowling Strike Rate

Original variable: \(balls\_bowled\)
Created variable: \[\large bowling\_strike\_rate = \frac {balls\_bowled}{wickets}\]
- Explanatory: \(balls\_bowled\)
- Response: \(bowling\_strike\_rate\)

Measures: balls bowled per wicket by bowler , another measure of bowling effectiveness)

Since both formulas require division by wickets, I remove innings where wickets = 0 to avoid undefined values.

ipl_bowling <- IPL |>
  filter(wickets > 0) |>
  mutate(
    bowling_average = runs_conceded / wickets,
    bowling_strike_rate = balls_bowled / wickets
  )

Visualization of each variable pair

Pair 1: Runs conceded and Bowling Average

ipl_bowling |>
  ggplot()+ 
  geom_point(mapping = aes(x = runs_conceded, y = bowling_average),alpha = 0.3,   color = "steelblue")+
  geom_smooth(mapping = aes(x = runs_conceded, y = bowling_average),method = "loess", se = FALSE, color = "darkblue") +
  labs(
    title = "Relationship Between Runs Conceded and Bowling Average",
    x = "Runs Conceded (per innings)",
    y = "Bowling Average (Runs per wicket)"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

The scatterplot shows a clear positive relationship between runs conceded and bowling average. As bowlers concede more runs in an innings, their bowling average increases almost proportionally, which is expected because bowling average is defined as runs conceded per wicket. The LOESS (Locally Estimated Scatterplot Smoothing) curve rises steadily, indicating that higher‑run innings consistently produce worse averages. There is substantial spread at lower run values, where some bowlers achieve extremely low averages by conceding very few runs for a wicket. Outliers are visible at both ends very low averages (1–5) and very high averages (40–80) which reflect the natural variability of single‑innings bowling performances.

Pair 2: Balls bowled and Bowling Strike Rate

ipl_bowling |>
  ggplot()+ 
  geom_point(mapping = aes(x = balls_bowled, y = bowling_strike_rate),alpha = 0.3,   color = "firebrick")+
  geom_smooth(mapping = aes(x = balls_bowled, y = bowling_strike_rate),method = "loess", se = FALSE, color = "darkred") +
  labs(
    title = "Relationship Between Balls Bowled and Bowling Strike Rate",
    x = "Balls Bowled (per innings)",
    y = "Bowling Strike Rate (Balls per wicket)"
)+
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

The scatterplot shows a positive relationship between balls bowled and bowling strike rate. Short bowling spells (6–12 balls) display high variability, with some innings producing extremely low strike rates (e.g., a wicket in the first ball) and others producing very high strike rates. As bowlers deliver more balls in an innings, the strike rate gradually increases and becomes more stable, which is expected because strike rate is defined as balls per wicket. The LOESS curve rises smoothly, indicating that longer spells generally require more balls per wicket, reflecting the natural variability of wicket-taking in single-innings performances.

Computing correlation coefficient

Correlation coffeicient for both variable pairs

round(cor(ipl_bowling$runs_conceded, ipl_bowling$bowling_average), 2)

## [1] 0.74

round(cor(ipl_bowling$balls_bowled, ipl_bowling$bowling_strike_rate), 2)

## [1] 0.38

Correlation Interpretation for Pair 1: Runs Conceded vs Bowling Average

\(Correlation = 0.74\)

The correlation between runs conceded and bowling average is \(0.74\), indicating a strong positive relationship. The value is expected because the scatterplot shows a clear upward trend, and the LOESS curve rises steadily across the entire range.

Bowling average is directly derived from runs conceded (runs per wicket), so as bowlers concede more runs in an innings, their average naturally increases. The strong correlation is fully supported by the scatter plot visualization .

Correlation Interpretation for Pair 2 :Balls Bowled vs Bowling Strike Rate

\(Correlation = 0.38\)

The correlation between balls bowled and bowling strike rate is 0.38, which indicates a moderate positive relationship. This value makes sense because the scatterplot shows a general upward trend, but with substantial spread in the points.

Short bowling spells have highly variable strike rates, while longer bowling spells show a smoother increase. The LOESS curve rises gradually, confirming that although strike rate tends to increase with balls bowled, the relationship is not very strong. The moderate correlation aligns perfectly with the plot.

Build Confidence Intervals for Each Response Variable

Response Variable:- \(bowling\_average\)

t.test(ipl_bowling$bowling_average)

## 
##  One Sample t-test
## 
## data:  ipl_bowling$bowling_average
## t = 152.34, df = 7244, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  20.14715 20.67242
## sample estimates:
## mean of x 
##  20.40979

\(Mean\ bowling\_average\): \(20.40979\)

\(95\%\ CI:(20.15, 20.67)\)

Using a one‑sample t‑test, the mean bowling average is 20.41 runs per wicket, with a 95% confidence interval from 20.15 to 20.67. This means we are 95% confident that the true population mean bowling average for IPL bowlers lies within this range. Because the sample size is very large (over 7,000 innings), the interval is narrow and stable This suggests that, across the population of IPL innings, bowlers typically concede around 20 runs for every wicket they take.

Response Variable:-\(bowling\_strike\_rate\)

t.test(ipl_bowling$bowling_strike_rate)

## 
##  One Sample t-test
## 
## data:  ipl_bowling$bowling_strike_rate
## t = 205.71, df = 7244, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  16.74025 17.06236
## sample estimates:
## mean of x 
##   16.9013

\(Mean\ bowling\_strike\_rate\): \(16.9013\)

\(95\%\ CI:(16.74, 17.06)\)

Using a one‑sample t‑test, the mean bowling strike rate is 16.90 balls per wicket, with a 95% confidence interval from 16.74 to 17.06. This means we are 95% confident that the true population mean bowling strike rate for IPL bowlers lies within this range. With such a large sample size, the interval is tight, suggesting that IPL bowlers, on average, take a wicket roughly every 17 balls. This provides a stable and reliable estimate of the population strike rate.

The analysis revealed two key relationships in IPL bowling performance. Here, we observe that the linear association between runs conceded and bowling average is substantially stronger, as reflected in both the scatterplot and the higher correlation value \(0.74\), compared to moderate and more dispersed relationship between balls bowled and bowling strike rate with correlation value of \(0.38\) .

Confidence intervals further clarified the population‑level behavior: the true mean bowling average lies between \(20.15 \ and \ 20.67\), while the true mean strike rate lies between \(16.74\ and \ 17.06\) .

The moderate relationship between balls bowled and strike rate reflects real cricket dynamics, where wicket‑taking ability varies more widely across bowlers and match situations.

Further question:- How would the inclusion of additional variables, such as opposition team or match venue, influence the correlations observed in the bowling analysis?

Week6_Datadive

Mayank Gupta

2026-02-17

Introduction

Variable selection

Pair 1: Runs conceded and Bowling Average

Pair 2: Balls bowled and Bowling Strike Rate

Visualization of each variable pair

Pair 1: Runs conceded and Bowling Average

Pair 2: Balls bowled and Bowling Strike Rate

Computing correlation coefficient

Correlation Interpretation for Pair 1: Runs Conceded vs Bowling Average

Correlation Interpretation for Pair 2 :Balls Bowled vs Bowling Strike Rate

Build Confidence Intervals for Each Response Variable