logo

0. Meta’s Prophet & Dependencies Installation

0.1 Dependencies

library(zoo)
library(prophet)
library(dplyr)
library(forecast)
ChadwellHeath = readxl::read_excel("data/ChadwellHeath.xlsx")

0.2 Prophet

According to the instructions, I installed it via part (b) in the instructions given to me first but ran into the error,

Error: Failed to install 'prophet' from GitHub:   Could not find tools necessary to compile a package Call `pkgbuild::check_build_tools(debug = TRUE)` to diagnose the problem.

So I went with the normal installation outlines in (a) which worked without issues.

1. The Data

The data we will be analysing today comes from TfL’s large dataset on their website. I have chosen Chadwell Heath Station on the newly opened Elizabeth Line as this is one of my local stations that I use for commuting into London on a daily basis.

The station has a long history as it is on the Great Eastern Mainline out from London Liverpool Street, it has been served by many different trains over the years and most recently been taken over by the Elizabeth Line in 2022 being part of a massive East-West network with trains from here going as far as Reading and Heathrow Airport.

1.1 Formatting

The date column is in a numeric format and so is not recognised by R as an actual date, we can remedy this by converting the column data into a date format. For this evaluation, we will be looking at entries through time.

X = ChadwellHeath$TravelDate
Y = ChadwellHeath$EntryTapCount
Z = ChadwellHeath$ExitTapCount

entries.df = data.frame(X,Y)

entries.df <- entries.df %>% mutate(X = as.Date(as.character(X), format="%Y%m%d"))

plot(entries.df,type = 'l'); summary(entries.df)

##        X                    Y       
##  Min.   :2022-01-01   Min.   :   0  
##  1st Qu.:2022-09-30   1st Qu.:4630  
##  Median :2023-07-10   Median :6009  
##  Mean   :2023-07-04   Mean   :5679  
##  3rd Qu.:2024-04-04   3rd Qu.:7116  
##  Max.   :2024-12-28   Max.   :8468
class(entries.df)
## [1] "data.frame"

While good, this is not actually a time series by the class() function. Lets change that,

start_date <- min(entries.df$X)

entries.ts = ts(entries.df$Y, start = c(as.numeric(format(start_date, "%Y")),as.numeric(format(start_date, "%j"))), frequency = 365)

plot(entries.ts, main = 'Daily Entries at Chadwell Heath Station', xlab = 'Time', ylab = 'Entries')

class(entries.ts)
## [1] "ts"
summary(entries.ts)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    4630    6009    5679    7116    8468

This is now looking a lot like a standard time series which doesn’t exhibit any harsh heteroskadastic properties. While quite spiky, we can already make some preliminary observations using the function decompose that will allow us to split our time series of the current form

\[ X_t = m_t + S_t + Y_t \]

into its constituents so we can analyse the trend, seasonal and residual error components respectively.

entries.decomp <- decompose(entries.ts)

plot(entries.decomp)

Here are some initial observations we can get from this simple decomposition:

  • There is a definite positive trend of entries into the station as time goes on, it plateaus a bit midway through 2024 but that may just be due to the cut off. We will investigate this positive trend some more with Prophet.

  • Seasonality-wise, there is a sort of recurring pattern that is seen each year during the summer periods naturally as people are away on holidays and all schools/universities are closed.

  • Our random component has a slight recurring pattern which suggests there is more information hidden inside that is otherwise not explained by the seasonal or trend component of our time series.

1.2 Meta’s Prophet Analysis

Lets use Meta’s Prophet now to explore this time series some more. The command prophet allows us to see some detailed analysis on our time series

# entries.meta = prophet(entries.df)

We get an error here as the columns were not labelled correctly,

Error in fit.prophet(m, df, ...) : Dataframe must have columns 'ds' and 'y' with the dates and values respectively.

Let’s rename our columns accordingly…

entries.df <- entries.df %>% rename(y = Y, ds = X)

…and try again.

entries.meta = prophet(entries.df)
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.

This was successful! Lets look at forecasting now as we can get a lot more information from that.

1.3 Forecasting

Now we can forecast some future values and see what we can find

entries.forecast = make_future_dataframe(entries.meta, 365, freq = "day", include_history = TRUE)

entries.predictions <- predict(entries.meta, entries.forecast)

tail(entries.predictions)
##              ds    trend additive_terms additive_terms_lower
## 1419 2025-12-23 7244.065      -221.6965            -221.6965
## 1420 2025-12-24 7245.500      -268.3983            -268.3983
## 1421 2025-12-25 7246.935      -262.1186            -262.1186
## 1422 2025-12-26 7248.371      -641.2885            -641.2885
## 1423 2025-12-27 7249.806     -2656.0685           -2656.0685
## 1424 2025-12-28 7251.241     -4075.3390           -4075.3390
##      additive_terms_upper     weekly weekly_lower weekly_upper    yearly
## 1419            -221.6965   882.6443     882.6443     882.6443 -1104.341
## 1420            -268.3983   930.5214     930.5214     930.5214 -1198.920
## 1421            -262.1186  1020.1956    1020.1956    1020.1956 -1282.314
## 1422            -641.2885   711.9677     711.9677     711.9677 -1353.256
## 1423           -2656.0685 -1245.3879   -1245.3879   -1245.3879 -1410.681
## 1424           -4075.3390 -2621.5914   -2621.5914   -2621.5914 -1453.748
##      yearly_lower yearly_upper multiplicative_terms multiplicative_terms_lower
## 1419    -1104.341    -1104.341                    0                          0
## 1420    -1198.920    -1198.920                    0                          0
## 1421    -1282.314    -1282.314                    0                          0
## 1422    -1353.256    -1353.256                    0                          0
## 1423    -1410.681    -1410.681                    0                          0
## 1424    -1453.748    -1453.748                    0                          0
##      multiplicative_terms_upper yhat_lower yhat_upper trend_lower trend_upper
## 1419                          0   5971.500   8093.030    7179.817    7311.860
## 1420                          0   5991.412   7942.771    7180.951    7313.642
## 1421                          0   5938.864   8027.098    7182.127    7315.422
## 1422                          0   5615.196   7592.708    7183.333    7317.161
## 1423                          0   3568.505   5669.913    7184.665    7318.900
## 1424                          0   2102.161   4200.937    7185.858    7320.537
##          yhat
## 1419 7022.369
## 1420 6977.102
## 1421 6984.817
## 1422 6607.082
## 1423 4593.737
## 1424 3175.902
plot(entries.meta, entries.predictions, type = 'l')

Key Observations:

  • The overall trend which is given by the dark blue line appears to be increasing, suggesting growing commuter activity which would make sense as more people tend to use public transport year on year as the population naturally increases too.

  • The sharp dips in predictions could indicate seasonal fluctuations like holidays or weekends.

  • The confidence intervals widen toward the future, reflecting increased uncertainty in long-term predictions.

1.4 Seasonality Analysis

seasonal.entries.meta = prophet(entries.df, daily.seasonality = TRUE)
prophet_plot_components(seasonal.entries.meta, entries.predictions)

We can clearly see there is a weekly trend going on where passenger entries build up from the start of the week and hit an average peak on Thursday before falling back off on Friday. This could be for a number of factors but I would surmise that people tend to work from home on Mondays and/or Fridays as people tend to enjoy a longer weekend.

The yearly trend shows us the monthly trends of average commuters throughout the year. We can see a sharp increase in entries from January as people get back to work and then it continues to fluctuate throughout Q1. In Q2, we see a steady increase with no major dips before Q3 where the expected valley occurs as school finishes up and people go on holiday. Similarly, in September the entries increase again with much higher traffic in Q4 than the rest of the year which maintains itself until the expected December decline.

On the larger scale, the trend is increasing over the years which makes sense as more and more people get to know the new train-line due to its continued success and ease of use.

2. Conclusions

To conclude, we have seen that the passenger entries at Chadwell Heath follow a time series model fairly well. We have great seasonality on the weekly and yearly scale which can give us great insights into the behaviours of commuters.

One aspect I would like to explore more of if I were to revisit this data after learning some more about time series analysis would be the residual error component. We can see that there is definitely some more information stored there that can be extracted by comparing to other time series such as a detailed rainfall dataset for the Chadwell Heath area.