library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(xts)
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
##
## ######################### Warning from 'xts' package ##########################
## # #
## # The dplyr lag() function breaks how base R's lag() function is supposed to #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or #
## # source() into this session won't work correctly. #
## # #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
## # dplyr from breaking base R's lag() function. #
## # #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
## # #
## ###############################################################################
##
## Attaching package: 'xts'
##
## The following objects are masked from 'package:dplyr':
##
## first, last
library(tsibble)
##
## Attaching package: 'tsibble'
##
## The following object is masked from 'package:zoo':
##
## index
##
## The following object is masked from 'package:lubridate':
##
## interval
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
pl <- read_csv("C:/Users/bfunk/Downloads/E0.csv")
## Rows: 380 Columns: 120
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Div, Date, HomeTeam, AwayTeam, FTR, HTR, Referee
## dbl (112): FTHG, FTAG, HTHG, HTAG, HS, AS, HST, AST, HF, AF, HC, AC, HY, AY...
## time (1): Time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
First step was to convert the date column in my data set to be workable.
pl$Date <- as.Date(pl$Date, format = "%d/%m/%Y")
I wanted to look how the goals within each game changes by time of the year. Do teams score more as the year goes on? Does the January transfer window affect anything? What about the cold weather like it does in American football? Does that stunt offense? Is there any connection at all?
pl <- pl |>
mutate(total_goals = FTHG + FTAG)
This graph shows every premier league game’s total goals plotted against their date. Nothing remarkable jumps out at me.
pl |>
ggplot(aes(x = Date, y = total_goals)) +
geom_point(size = 1) +
geom_smooth(span = 0.2, color = "blue", se = FALSE) +
labs(title = "Total Goals by Match Date") +
theme_hc()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
A shift to see if just home goals reveal anything. The dip in January is interesting but it could just be low sample size from that period.
pl |>
ggplot(aes(x = Date, y = FTHG)) +
geom_point(size = 1) +
geom_smooth(span = 0.2, color = "blue", se = FALSE) +
labs(title = "Home Goals by Match Date") +
theme_hc()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
To get more of a proper time series I decided to look at average goals by day to see what I can find. The two graphs above do not tell me much and do not satisfy much of anything.
pl_ts <- pl |>
select(Date, total_goals) |>
group_by(Date) |>
summarise(total_goals = mean(total_goals, na.rm = TRUE), .groups = "drop") |>
as_tsibble(index = Date) |>
fill_gaps(total_goals = 0)
pl_ts
## # A tsibble: 283 x 2 [1D]
## Date total_goals
## <date> <dbl>
## 1 2024-08-16 1
## 2 2024-08-17 2.17
## 3 2024-08-18 2.5
## 4 2024-08-19 2
## 5 2024-08-20 0
## 6 2024-08-21 0
## 7 2024-08-22 0
## 8 2024-08-23 0
## 9 2024-08-24 2.86
## 10 2024-08-25 4
## # ℹ 273 more rows
Immediately the days with no games are way too influential. The times during international break brings the trend line way down this does not accurately represent what i am looking for.
pl_ts |>
ggplot(aes(x = Date, y = total_goals)) +
geom_point(size = 1) +
geom_smooth(span = 0.2, color = "blue", se = FALSE) +
labs(title = "Average Goals by Match Date") +
theme_hc()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
pl_ts <- pl |>
mutate(total_goals = FTHG + FTAG) |>
group_by(Date) |>
summarise(total_goals = mean(total_goals, na.rm = TRUE), .groups = "drop") |>
as_tsibble(index = Date)
pl_ts
## # A tsibble: 109 x 2 [1D]
## Date total_goals
## <date> <dbl>
## 1 2024-08-16 1
## 2 2024-08-17 2.17
## 3 2024-08-18 2.5
## 4 2024-08-19 2
## 5 2024-08-24 2.86
## 6 2024-08-25 4
## 7 2024-08-31 3.14
## 8 2024-09-01 2.67
## 9 2024-09-14 2.38
## 10 2024-09-15 2
## # ℹ 99 more rows
Removing the dates with no games results in this. Mostly sporadic but you can start to see some trends. Like how at that start of the year total goals start slower as expected as teams are adjusting to new systems/managers/tactics while the peak seems to be right after the January window when players are rested and teams add players that fit best with their current system during the winter transfer window.
pl_ts |>
ggplot(aes(x = Date, y = total_goals)) +
geom_point(size = 1) +
geom_smooth(span = 0.2, color = "blue", se = FALSE) +
labs(title = "Average Goals by Match Date") +
theme_hc()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The smooth trend line shows that for the most part average goals by match stays consistent over time.
pl_ts |>
ggplot(aes(x = Date, y = total_goals)) +
geom_line() +
geom_smooth(method = "lm", color = "blue", se = FALSE) +
labs(title = "Linear Trend in Average Goals per Match") +
theme_hc()
## `geom_smooth()` using formula = 'y ~ x'
trend_full <- lm(total_goals ~ as.numeric(Date), data = as_tibble(pl_ts))
summary(trend_full)
##
## Call:
## lm(formula = total_goals ~ as.numeric(Date), data = as_tibble(pl_ts))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4202 -0.6697 0.0798 0.5800 2.0804
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.000e+00 2.287e+01 0.131 0.896
## as.numeric(Date) -3.987e-06 1.138e-03 -0.004 0.997
##
## Residual standard error: 0.9572 on 107 degrees of freedom
## Multiple R-squared: 1.148e-07, Adjusted R-squared: -0.009346
## F-statistic: 1.228e-05 on 1 and 107 DF, p-value: 0.9972
With the very high p value and the non existent r squared there is unlikely to be a trend. The slope sits at essentially 0 showing no change overtime while looking at the full season. So I decided to break it into pre January transfer window.
pl_ts |>
filter_index("2024-08" ~ "2024-12") |>
ggplot(aes(x = Date, y = total_goals)) +
geom_line() +
geom_smooth(method = "lm", color = "blue", se = FALSE) +
labs(title = "Trend: pre January transfer ") +
theme_hc()
## `geom_smooth()` using formula = 'y ~ x'
As the season progresses the average goes up slightly. More significant than the previous graph. It wouldn’t make sense if the increase was dramatic but a slight uptick does make sense as the players and teams get used to their tactics and teammates.
pl_ts |>
filter_index("2025-01" ~ "2025-05") |>
ggplot(aes(x = Date, y = total_goals)) +
geom_line() +
geom_smooth(method = "lm", color = "blue", se = FALSE) +
labs(title = "Trend: Jan 2025 to May 2025") +
theme_hc()
## `geom_smooth()` using formula = 'y ~ x'
Looking at the second half post/during January transfer window it appears the mean of average goals goes downwards according to the line of best fit. This could be dragged down by 1 game 1 goal days or it clould be a sign of fatigue or as a result of injuries.
pl_week <- pl |>
mutate(week = floor_date(Date, "week")) |>
group_by(week) |>
summarise(avg_goals = mean(total_goals, na.rm = TRUE), .groups = "drop") |>
as_tsibble(index = week)
pl_week |>
ggplot(aes(x = week, y = avg_goals)) +
geom_line() +
geom_smooth(span = 0.3, color = "blue", se = FALSE) +
labs(title = "Weekly Average Goals Over Time") +
theme_hc()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
This goals by week graph is interesting because it does appear that scoring frequency slightly resembles a hill where as the season progresses offenses get more comfortable, but at a certain point fitness and health starts to set in and the average goals go down. I think this graph is the most accurate out of all of them because of how many more samples go into each mean, and I believe the end product makes sense.
acf(pl_week$avg_goals, main = "ACF of Weekly Average Goals")
pacf(pl_week$avg_goals, main = "PACF of Weekly Average Goals")
The AFC plot shows there’s no carryover pattern since all of the points are within the lines. The PACF chart also shows there is probably no relationship between average goals and weeks. This tells me the slight deviation from the line of best fit is more than likely just noise.
To conclude looking at how high scoring Premier League games are as they progress over the course of the season is an interesting concept with plausible theories, but the end result mostly tells me that there is no relationship between time of the year and goals per game in the premier league. There are little trends but after this data dive I have to conclude that it is mostly noise as high scoring and low scoring matches come at random times during the year, and some weeks/days have few enough games that the average can be easily biased.