Lets start loading libraries required for this notebook.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ✔ readr 2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(stats)
library(ggthemes)
library(purrr)
library(tsibble)
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
##
## Attaching package: 'tsibble'
##
## The following object is masked from 'package:lubridate':
##
## interval
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
library(pwr)
library(effsize)
library(fable)
## Loading required package: fabletools
library(fabletools)
library(feasts)
Lets give data input and structure of the data to get basic idea of the data.
data <- read.csv("~/Documents/Rdocs/matches.csv", stringsAsFactors = TRUE)
str(data)
## 'data.frame': 1095 obs. of 20 variables:
## $ id : int 335982 335983 335984 335985 335986 335987 335988 335989 335990 335991 ...
## $ season : Factor w/ 17 levels "2007/08","2009",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ city : Factor w/ 36 levels "Abu Dhabi","Ahmedabad",..: 3 8 11 27 24 19 17 9 17 8 ...
## $ date : Factor w/ 823 levels "2008-04-18","2008-04-19",..: 1 2 2 3 3 4 5 6 7 8 ...
## $ match_type : Factor w/ 8 levels "3rd Place Play-Off",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ player_of_match: Factor w/ 291 levels "A Chandila","A Kumble",..: 38 151 152 175 58 263 278 158 288 114 ...
## $ venue : Factor w/ 58 levels "Arun Jaitley Stadium",..: 24 41 17 56 15 47 43 28 43 41 ...
## $ team1 : Factor w/ 19 levels "Chennai Super Kings",..: 17 7 4 11 9 14 2 1 2 7 ...
## $ team2 : Factor w/ 19 levels "Chennai Super Kings",..: 9 1 14 17 2 7 4 11 14 11 ...
## $ toss_winner : Factor w/ 19 levels "Chennai Super Kings",..: 17 1 14 11 2 7 2 11 14 11 ...
## $ toss_decision : Factor w/ 2 levels "bat","field": 2 1 1 1 1 1 1 2 2 2 ...
## $ winner : Factor w/ 19 levels "Chennai Super Kings",..: 9 1 4 17 9 14 4 1 14 7 ...
## $ result : Factor w/ 4 levels "no result","runs",..: 2 2 4 4 4 4 4 2 4 2 ...
## $ result_margin : int 140 33 9 5 5 6 9 6 3 66 ...
## $ target_runs : int 223 241 130 166 111 167 143 209 215 183 ...
## $ target_overs : num 20 20 20 20 20 20 20 20 20 20 ...
## $ super_over : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ method : Factor w/ 1 level "D/L": NA NA NA NA NA NA NA NA NA NA ...
## $ umpire1 : Factor w/ 62 levels "A Deshmukh","A Nand Kishore",..: 8 35 6 52 11 6 25 19 8 6 ...
## $ umpire2 : Factor w/ 62 levels "A Deshmukh","A Nand Kishore",..: 42 53 16 15 25 41 5 16 31 5 ...
Now lets convert date as correct date format(YYYY/MM/DD). And lets analyze result margin over time.
data$date <- as.Date(data$date, format = "%Y-%m-%d")
str(data$date)
## Date[1:1095], format: "2008-04-18" "2008-04-19" "2008-04-19" "2008-04-20" "2008-04-20" ...
Now it is in correct format. So, we will analyze the result_margin over time, looking for any trends and patterns.
summary(data$result_margin)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 6.00 8.00 17.26 20.00 146.00 19
data_clean <- data %>%
filter(!is.na(result_margin), !is.na(date))
head(data_clean)
Now as the data is cleaned, Lets create a time-series object containing date and result_margin.
aggregated_data <- data_clean %>%
group_by(date) %>%
summarise(result_margin = mean(result_margin, na.rm = TRUE)) %>%
ungroup()
ts_data <- aggregated_data %>%
as_tsibble(index = date)
ts_data %>%
ggplot(aes(x = date, y = result_margin)) +
geom_line() +
labs(title = "Result Margin Over Time (Aggregated)", x = "Date", y = "Margin of Victory")
As we cannot produce any initial impression from the plot, we will use linear regression to analyze trends in the result_margin over time.
trend_model <- lm(result_margin ~ as.numeric(date), data = ts_data)
summary(trend_model)
##
## Call:
## lm(formula = result_margin ~ as.numeric(date), data = ts_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.709 -11.533 -8.979 4.195 121.942
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.3579381 6.5915294 3.240 0.00124 **
## as.numeric(date) -0.0002360 0.0003833 -0.616 0.53835
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.97 on 813 degrees of freedom
## Multiple R-squared: 0.0004659, Adjusted R-squared: -0.0007636
## F-statistic: 0.3789 on 1 and 813 DF, p-value: 0.5384
ts_data %>%
ggplot(aes(x = date, y = result_margin)) +
geom_line() +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(title = "Result Margin with Trendline", x = "Date", y = "Margin of Victory")
## `geom_smooth()` using formula = 'y ~ x'
The peeks do represent the higher margin of victory. And the trendline in the plot is almost flat, suggesting that the result_margin does not systematically increase or decrease over the years.This may be due to players changing between teams and it resets the statistics of overall teams. If we look at a specific 5 year time we may get any strong trend in result margin over time.
Now lets use smoothing techniques ACF and PACF to identify seasonality in the data. But before that we will represent any missing values or duplicates.
ts_data %>%
print()
## # A tsibble: 815 x 2 [1D]
## date result_margin
## <date> <dbl>
## 1 2008-04-18 140
## 2 2008-04-19 21
## 3 2008-04-20 5
## 4 2008-04-21 6
## 5 2008-04-22 9
## 6 2008-04-23 6
## 7 2008-04-24 3
## 8 2008-04-25 66
## 9 2008-04-26 8
## 10 2008-04-27 7
## # ℹ 805 more rows
duplicates_found <- ts_data %>%
duplicates()
## Using `date` as index variable.
print(duplicates_found)
## # A tibble: 0 × 2
## # ℹ 2 variables: date <date>, result_margin <dbl>
ts_data_filled <- ts_data %>%
fill_gaps()
print(ts_data_filled)
## # A tsibble: 5,883 x 2 [1D]
## date result_margin
## <date> <dbl>
## 1 2008-04-18 140
## 2 2008-04-19 21
## 3 2008-04-20 5
## 4 2008-04-21 6
## 5 2008-04-22 9
## 6 2008-04-23 6
## 7 2008-04-24 3
## 8 2008-04-25 66
## 9 2008-04-26 8
## 10 2008-04-27 7
## # ℹ 5,873 more rows
ts_data_imputed <- ts_data_filled %>%
mutate(result_margin = ifelse(is.na(result_margin),
zoo::na.approx(result_margin, na.rm = FALSE),
result_margin))
library(fable)
decomposed <- ts_data_imputed %>%
model(STL(result_margin ~ season(window = "periodic")))
components <- components(decomposed)
autoplot(components) +
labs(title = "STL Decomposition of Result Margin",
y = "Result Margin")
str(ts_data_imputed)
## tbl_ts [5,883 × 2] (S3: tbl_ts/tbl_df/tbl/data.frame)
## $ date : Date[1:5883], format: "2008-04-18" "2008-04-19" ...
## $ result_margin: num [1:5883] 140 21 5 6 9 6 3 66 8 7 ...
## - attr(*, "key")= tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
## ..$ .rows: list<int> [1:1]
## .. ..$ : int [1:5883] 1 2 3 4 5 6 7 8 9 10 ...
## .. ..@ ptype: int(0)
## - attr(*, "index")= chr "date"
## ..- attr(*, "ordered")= logi TRUE
## - attr(*, "index2")= chr "date"
## - attr(*, "interval")= interval [1:1] 1D
## ..@ .regular: logi TRUE
summary(ts_data_imputed$result_margin)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 5.131 7.500 14.055 21.646 140.000
Repeated patterns observed in the season wise indicate periodic behavior in the result_margin. This might be because of repetitive type of match schedules in tournament. From trend component , as there is slightly decrease in the end it might indicate the long term decrease in result margin as one sided matches has been decreased over the years.The remainder component explains how the trend is being changed every season. As the 2021 and 2022 seasons were held in short span, the trend is closer at that time.
ts_data_imputed %>%
ACF(result_margin) %>%
autoplot() +
labs(title = "Autocorrelation Function (ACF) for Result Margin (Imputed Data)",
y = "Autocorrelation")
Significant peaks at specific lags confirm seasonality. The peak at 0 represent that the margin do change for every consequent match. The peak is mostly consistent which might indicate that there is minimal relation between the result margin over the years.
ts_data_imputed %>%
PACF(result_margin) %>%
autoplot() +
labs(title = "Partial Autocorrelation Function (PACF) for Result Margin (Imputed Data)",
y = "Partial Autocorrelation")
The short-term dependencies suggest that match results may be influenced by factors such as recent team performance, players form, or ground conditions during closely happened matches.The absence of long term trend indicates that the result_margin is not strongly affected by seasonal factors, such as specific phases of a tournament or years of tournament.
We can attempt these questions but they need more data on weather and team strength. Are there external factors affecting the result margins? Do specific seasons influence results significantly? They can be answered by adding data on weather and also checking the results for a specific team.