logo

Introduction

Growing up, my family and I have always massively supported Liverpool F.C., whether that be from home or travelling 5 hours to Anfield stadium. I have chosen to do an analysis and forecasting of the daily Wikipedia page views for the “Liverpool F.C.” article because I wanted to make this project more personal and engaging for myself. This dataset gives us a scope into when and why the public at home are interested in Liverpool F.C. and covers a five year period from 10/03/2021 to 10/03/2026. By analysing this data, we can identify seasonal trends and apply time series forecasting to predict future statistics in terms of online searches of the club.

Objectives

  • To analyse the history of “Liverpool F.C.” searches on Wikipedia.
  • To translate this data into a dataframe for time series forecasting.
  • To make predictions for future page views in the next 365 days using Prophet.

Defining Prophet

Prophet is a tool used for forecasting that specialises in handling daily data with strong seasonal patterns, as well as missing data points which we will see as we continue with this project. Ideas such as weekends or specific seasonal months are normally identified to contribute to the patterns found in a time series, and we will find out our own ideas that may contribute to spikes or drops in our upcoming data.

In essence, Prophet uses a time series model to illustrate any dataset. The mathematical formula used here is:

\[X(t)=M(t)+S(t)+Y(t)+\epsilon_t\]

Where:

  • \(X(t)\) is the forecast (the number of page views)
  • \(M(t)\) represents the trend (how Liverpool’s overall popularity grows or shrinks over time).
  • \(S(t)\) represents seasonality (the weekly and yearly patterns, like weekend matchdays).
  • \(Y(t)\) represents the effects of holidays or any other significant one-off events.
  • \(\epsilon_t\) is the error term (any unusual changes that the model cannot predict).

Exploring the Dataset

The first thing to do is to load the Prophet library alongside the dataset. Luckily for me, by using ‘read.csv()’, R will automatically interpret our data as a dataframe so I do not need to change it from a time series into a data frame; it is already done!

# Load the library and the data 
options(scipen = 999)
library(prophet)
## Loading required package: Rcpp
## Loading required package: rlang
football_data <- read.csv("data/football_data.csv")

# Look at the head (first 6 rows) of the data
head(football_data)
##         Date Liverpool.F.C.
## 1 2021-03-10          14572
## 2 2021-03-11          12183
## 3 2021-03-12           8134
## 4 2021-03-13           7893
## 5 2021-03-14           8173
## 6 2021-03-15          12571

Preparing the Data for Prophet

It is extremely important to understand that in order for Prophet to do what we want, we need to make sure that we do two things. Firstly, the data must be in a dataframe which we automatically had. Secondly, the axis labels need to be strictly named. We need to ensure that the time axis is renamed to ds, and the variable axis is renamed to y.

# Rename columns to 'ds' and 'y'
colnames(football_data) <- c("ds", "y")

# Check that the changes have been made 
head(football_data)
##           ds     y
## 1 2021-03-10 14572
## 2 2021-03-11 12183
## 3 2021-03-12  8134
## 4 2021-03-13  7893
## 5 2021-03-14  8173
## 6 2021-03-15 12571

Now that we have done this, we are able to use Prophet to our advantage to analyse the data and predict future statistics of searches in the next year.

Understanding the Growth with Linear Regression

Before I use Prophet, it would be smart to use a simple linear regression model in order to understand the overall baseline growth of Liverpool’s Wikipedia page over this five-year stretch. The equation used for this is:

\[y_t = \beta_0 + \beta_1 t + \epsilon_t\] Where:

  • \(y_t\) is the number of page views at time \(t\)
  • \(\beta_0\) is the y-intercept (baseline page views at the beginning of our data)
  • \(\beta_1\) is the slope (average daily change in page views)
  • \(t\) is the time stamp (Day 1, Day 39 etc.) \(\epsilon_t\) is the error term

In order to calculate this, we will fit a linear model into a numeric time index that represents each day.

# Create a numeric time index (Day 1, Day 39, etc.)
football_data$time_index <- 1:nrow(football_data)

# Run the linear regression model
lin_model <- lm(y ~ time_index, data = football_data)

# Print the statistical summary
summary(lin_model)
## 
## Call:
## lm(formula = y ~ time_index, data = football_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -9690  -5787  -3230   1486 113111 
## 
## Coefficients:
##               Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 15415.0620   503.6149  30.609 <0.0000000000000002 ***
## time_index     -0.7744     0.4772  -1.623               0.105    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10760 on 1825 degrees of freedom
## Multiple R-squared:  0.001441,   Adjusted R-squared:  0.0008936 
## F-statistic: 2.633 on 1 and 1825 DF,  p-value: 0.1048
# Plot the raw data with the linear trend line
plot(football_data$time_index, football_data$y, type = "l", col = "darkgrey",
     main = "Linear Baseline Trend of Liverpool F.C. Page Views",
     xlab = "Time (Days)", ylab = "Page Views")
abline(lin_model, col = "red", lwd = 2)

Here we can see that the the red baseline is basically flat at around 15000 page views, which means in totality that the baseline popularity of Liverpool F.C. and searches on Wikipedia has remained extremely consistent over the last five years. The flat line suggests that our \(\beta_1\) is very close to zero and as a result means that the the club’s general public interest has neither significantly grown nor declined. Clearly there have been many moments of extreme volatility (can be from matchdays, transfer windows or cup finals), but overall there has not been enough to produce a significant increase in the base page views of Liverpool F.C. on Wikipedia. The long-term stability over this period of time is very useful as it will be great for Prophet, since the future predictions will focus on repeated seasonal cycles instead of a changing baseline.

Analysing the Forecast

Now it is time for the fun part - it is time to instruct this AI model to learn from the data I provided earlier and generate a 365-day prediction into the future of potential Liverpool F.C. searches on Wikipedia.

The Main Prediction

# Build the model and forecast the next year of searches
m <- prophet(football_data)
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
future <- make_future_dataframe(m, periods = 365)
forecast <- predict(m, future)

# Plot the main forecast 
plot(m, forecast)

In the plot above we have managed to provide a depiction of both our Liverpool F.C. data and a further 365-day prediction. We can see by the black dots that there are many extremely volatile outliers which are in roughly the months of December/January, and can be linked to perhaps a spiked interest due to the transfer window being open around this time and many potential rumours circling Liverpool F.C. and who they may sign or sell. Some of these days exceeded 100000 daily views which is fascinating. The dark blue line signals that the future daily views will stay rather consistent and steady as it has been since 2021, with Prophet predicting near enough no significant changes are expected to happen in the next year This makes sense since the views are mostly from club supporters who are always deemed to be loyal and so should not change significantly unless something unpredictable was to happen to the club. We can see behind this dark blue line that there is a lighter blue shade, and this represents the uncertainty that the model carries, which basically explains a more realistic and guaranteed range at which the daily Wikipedia views of Liverpool F.C. will fall over the next 365 days This area is larger than the dark blue line since it is almost like a safety net at which if the model’s dark blue line is wrong, the future views will most likely still fall in the light blue section.

Other Interesting Things in the Data Model

With Prophet, we are also able to dissect the data into different parts, to see the data from different viewpoints. Here I will separate it into trend, weekly and yearly sections so that we can understand the data by each of its components.

# Plot the trend, weekly, and yearly seasonality
prophet_plot_components(m, forecast)
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the prophet package.
##   Please report the issue at <https://github.com/facebook/prophet/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

There are a few interesting pieces of information that can be gathered here from this data, and key insights that we can take away.

From the trend chart, we can gather that Prophet predicts Liverpool F.C.’s popularity to decrease slightly over the next 365 days in terms of Wikipedia searches. A reason for this could be that since Liverpool won the Premier League in 2024/2025, searches were heightened in that period and the beginning of 2026 may be perhaps a start in the decline of this interest, since fans are now interested in the top performing teams of the current 2025/2026 season.

The weekly seasonality chart behaves in a way that we would have expected; the highest searches are on the weekend days which align with Premier League matchdays as they take place mainly on the weekends. Fridays notoriously have the least amount of football which explains the drop in views of Liverpool F.C. on the Wikipedia page.

The yearly seasonality chart is also interesting. We can see that there are peaks in September and March where respectively the football calendar starts, and when it is is roughly the crunch time for football teams in their season as things begin to heat up towards the end of the season. There are drops in months such as April and June where, once again respectively, there are normally a lot less fixtures, and when it is more or less the beginning of the off season for club football, hence drops in views of Liverpool F.C.

Conclusion

In conclusion, this project successfully used Prophet’s time series forecasting to analyse and further predict the Wikipedia search interest for Liverpool.F.C.. The initial linear regression model showed us that the club has a very steady baseline of popularity even after some serious volatility in day to day searches, and Prophet enabled us to distinguish the reasons behind this volatility. The data illustrated to us that online views peak during weekend matchdays and during transfer windows, and we saw dips in the weekdays and during the off season in the summer. We also saw Prophet predict a very slight decrease in future Liverpool F.C. Wikipedia searches, but nonetheless the recurring seasonal patterns remained dominant and highly predictable. This 365-day forecast allowed us to see a realistic prediction of how the public’s interest in this club would be over the next year, and is data-driven which shows the importance of maths in a worldly example.