According to the instructions, I installed it via part (b) in the instructions given to me first but ran into the error,
Error: Failed to install 'prophet' from GitHub: Could not find tools necessary to compile a package Call `pkgbuild::check_build_tools(debug = TRUE)` to diagnose the problem.
So I went with the normal installation outlines in (a) which worked without issues.
The data we will be analysing today comes from TfL’s large dataset on their website. I have chosen Chadwell Heath Station on the newly opened Elizabeth Line as this is one of my local stations that I use for commuting into London on a daily basis.
The station has a long history as it is on the Great Eastern Mainline out from London Liverpool Street, it has been served by many different trains over the years and most recently been taken over by the Elizabeth Line in 2022 being part of a massive East-West network with trains from here going as far as Reading and Heathrow Airport.
The date column is in a numeric format and so is not recognised by R as an actual date, we can remedy this by converting the column data into a date format. For this evaluation, we will be looking at entries through time.
X = ChadwellHeath$TravelDate
Y = ChadwellHeath$EntryTapCount
Z = ChadwellHeath$ExitTapCount
entries.df = data.frame(X,Y)
entries.df <- entries.df %>% mutate(X = as.Date(as.character(X), format="%Y%m%d"))
plot(entries.df,type = 'l'); summary(entries.df)## X Y
## Min. :2022-01-01 Min. : 0
## 1st Qu.:2022-09-30 1st Qu.:4630
## Median :2023-07-10 Median :6009
## Mean :2023-07-04 Mean :5679
## 3rd Qu.:2024-04-04 3rd Qu.:7116
## Max. :2024-12-28 Max. :8468
## [1] "data.frame"
While good, this is not actually a time series by the
class() function. Lets change that,
start_date <- min(entries.df$X)
entries.ts = ts(entries.df$Y, start = c(as.numeric(format(start_date, "%Y")),as.numeric(format(start_date, "%j"))), frequency = 365)
plot(entries.ts, main = 'Daily Entries at Chadwell Heath Station', xlab = 'Time', ylab = 'Entries')## [1] "ts"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 4630 6009 5679 7116 8468
This is now looking a lot like a standard time series which doesn’t
exhibit any harsh heteroskadastic properties. While quite spiky, we can
already make some preliminary observations using the function
decompose that will allow us to split our time series of
the current form
\[ X_t = m_t + S_t + Y_t \]
into its constituents so we can analyse the trend, seasonal and residual error components respectively.
Here are some initial observations we can get from this simple decomposition:
There is a definite positive trend of entries into the station as time goes on, it plateaus a bit midway through 2024 but that may just be due to the cut off. We will investigate this positive trend some more with Prophet.
Seasonality-wise, there is a sort of recurring pattern that is seen each year during the summer periods naturally as people are away on holidays and all schools/universities are closed.
Our random component has a slight recurring pattern which suggests there is more information hidden inside that is otherwise not explained by the seasonal or trend component of our time series.
Lets use Meta’s Prophet now to explore this time series some more.
The command prophet allows us to see some detailed analysis
on our time series
We get an error here as the columns were not labelled correctly,
Error in fit.prophet(m, df, ...) : Dataframe must have columns 'ds' and 'y' with the dates and values respectively.
Let’s rename our columns accordingly…
…and try again.
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
This was successful! Lets look at forecasting now as we can get a lot more information from that.
Now we can forecast some future values and see what we can find
entries.forecast = make_future_dataframe(entries.meta, 365, freq = "day", include_history = TRUE)
entries.predictions <- predict(entries.meta, entries.forecast)
tail(entries.predictions)## ds trend additive_terms additive_terms_lower
## 1419 2025-12-23 7244.065 -221.6965 -221.6965
## 1420 2025-12-24 7245.500 -268.3983 -268.3983
## 1421 2025-12-25 7246.935 -262.1186 -262.1186
## 1422 2025-12-26 7248.371 -641.2885 -641.2885
## 1423 2025-12-27 7249.806 -2656.0685 -2656.0685
## 1424 2025-12-28 7251.241 -4075.3390 -4075.3390
## additive_terms_upper weekly weekly_lower weekly_upper yearly
## 1419 -221.6965 882.6443 882.6443 882.6443 -1104.341
## 1420 -268.3983 930.5214 930.5214 930.5214 -1198.920
## 1421 -262.1186 1020.1956 1020.1956 1020.1956 -1282.314
## 1422 -641.2885 711.9677 711.9677 711.9677 -1353.256
## 1423 -2656.0685 -1245.3879 -1245.3879 -1245.3879 -1410.681
## 1424 -4075.3390 -2621.5914 -2621.5914 -2621.5914 -1453.748
## yearly_lower yearly_upper multiplicative_terms multiplicative_terms_lower
## 1419 -1104.341 -1104.341 0 0
## 1420 -1198.920 -1198.920 0 0
## 1421 -1282.314 -1282.314 0 0
## 1422 -1353.256 -1353.256 0 0
## 1423 -1410.681 -1410.681 0 0
## 1424 -1453.748 -1453.748 0 0
## multiplicative_terms_upper yhat_lower yhat_upper trend_lower trend_upper
## 1419 0 5971.500 8093.030 7179.817 7311.860
## 1420 0 5991.412 7942.771 7180.951 7313.642
## 1421 0 5938.864 8027.098 7182.127 7315.422
## 1422 0 5615.196 7592.708 7183.333 7317.161
## 1423 0 3568.505 5669.913 7184.665 7318.900
## 1424 0 2102.161 4200.937 7185.858 7320.537
## yhat
## 1419 7022.369
## 1420 6977.102
## 1421 6984.817
## 1422 6607.082
## 1423 4593.737
## 1424 3175.902
Key Observations:
The overall trend which is given by the dark blue line appears to be increasing, suggesting growing commuter activity which would make sense as more people tend to use public transport year on year as the population naturally increases too.
The sharp dips in predictions could indicate seasonal fluctuations like holidays or weekends.
The confidence intervals widen toward the future, reflecting increased uncertainty in long-term predictions.
seasonal.entries.meta = prophet(entries.df, daily.seasonality = TRUE)
prophet_plot_components(seasonal.entries.meta, entries.predictions)We can clearly see there is a weekly trend going on where passenger entries build up from the start of the week and hit an average peak on Thursday before falling back off on Friday. This could be for a number of factors but I would surmise that people tend to work from home on Mondays and/or Fridays as people tend to enjoy a longer weekend.
The yearly trend shows us the monthly trends of average commuters throughout the year. We can see a sharp increase in entries from January as people get back to work and then it continues to fluctuate throughout Q1. In Q2, we see a steady increase with no major dips before Q3 where the expected valley occurs as school finishes up and people go on holiday. Similarly, in September the entries increase again with much higher traffic in Q4 than the rest of the year which maintains itself until the expected December decline.
On the larger scale, the trend is increasing over the years which makes sense as more and more people get to know the new train-line due to its continued success and ease of use.
To conclude, we have seen that the passenger entries at Chadwell Heath follow a time series model fairly well. We have great seasonality on the weekly and yearly scale which can give us great insights into the behaviours of commuters.
One aspect I would like to explore more of if I were to revisit this data after learning some more about time series analysis would be the residual error component. We can see that there is definitely some more information stored there that can be extracted by comparing to other time series such as a detailed rainfall dataset for the Chadwell Heath area.