This week’s data dive is on
Time Series Modelling, a statistical
concept that uses time as an explanatory variable in a model. It is a
crucial aspect of analyzing and forecasting time-dependent data, which
involves studying the patterns and characteristics of a sequence of
observations collected over time, to understand the underlying
data-generating process and make accurate predictions about future
values.
As discussed earlier in the week; the key components of time series modeling include: trend, seasonality, cyclical patterns, and random noise or irregular fluctuations.
Time series modeling involves several steps, including data exploration and visualization, stationarity testing, model identification, parameter estimation, diagnostic checking, and forecasting. The choice of model depends on the characteristics of the data we are looking at, the presence of trend and seasonality, and the desired level of accuracy and interpretability.
Data Loading:
We will proceed to load our dataset here, however, since the regular
bike sales dataset does not have a time-based column of data, a clean
and reliable dataset that relates to bikes was culled from https://pageviews.wmcloud.org/, this dataset is about
the total pageviews on Wikipedia for
Bicycle for a period of January 2017 to
February 2024 (this is the available data dates as at the time of this
analysis).
In a bid to find a correlation, I will tie the results and insights from this analysis into what we’ve been seeing so far with the bike sales dataset for the past 11 weeks.
Now, loading the pageviews dataset, and the necessary libraries for
the analysis and understanding its structure. The
Date column is selected, and the
ymd function is used to convert it to a
Date format:
suppressMessages({
suppressWarnings({
library(lubridate)
library(tidyverse)
library(ggthemes)
library(ggrepel)
library(tsibble)
library(ggplot2)
library(forecast)
library(fpp3)
bicycle_data <- read_csv("bike_pageviews.csv")
})
})
bicycle_data$Date <- ymd(bicycle_data$Date)
In this step, the Date column is
converted to the Date format using the as.Date() function which encodes
the time information. The format argument specifies the format of the
date string, which is “%Y-%m-%d” (YYYY-MM-DD).
This conversion is necessary before we start our analysis because it is a special class, and many time series analysis functions require the data to be in this format to allow proper date manipulation and analysis.
data <- read.csv("bike_pageviews.csv")
data$Date <- as.Date(data$Date, format = "%Y-%m-%d")
Here, the only available column for the response variable is the
Bicycle column, it can be analyzed over
time because this column represents the page views related to bicycles
that is not present in our usual bike sales dataset.
This creates a new variable named
response_var and assigns it the values of
the Bicycle column from the
data object.This extracts the values from
the specified column in the data object
and stores them in the variable
response_var for further analysis or
manipulation.
response_var <- data$Bicycle
It is important to first organize the dataset into a tsibble object, which is a specialized data structure for time series data. This will ensure that the bicycle pageviews dataset has a time index associated with it.
So, once the data is structured as a tsibble, the next step is to visualize the data over time using a plot. This will give a quick glimpse which can help identify patterns, trends, and anomalies over time.
if (any(is.na(data$Date))) {
stop("Column 'Date' must not contain NA.")
}
bike_tsibble <- tsibble(data, index = Date)
plot(bike_tsibble, main = "Bicycle Pageviews over Time")
The following insights and observations can be gathered from our plot above:
trend_model <- lm(Bicycle ~ Date, data = bike_tsibble)
summary(trend_model)
##
## Call:
## lm(formula = Bicycle ~ Date, data = bike_tsibble)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2021.6 -998.3 -494.8 451.8 27864.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11362.5514 949.5838 11.966 <2e-16 ***
## Date -0.4637 0.0514 -9.023 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1953 on 2586 degrees of freedom
## Multiple R-squared: 0.03052, Adjusted R-squared: 0.03015
## F-statistic: 81.41 on 1 and 2586 DF, p-value: < 2.2e-16
First things first, the linear regression model fitted above to detect trend is a simple yet insightful model that gives us some great insights from the bicycle pageviews dataset. To fully understand what the above model values mean, let me break it down and explain the numbers and statistics one after the other.
The Residuals section gives us an idea
of how well the model fits the data, the minimum and maximum residuals
are quite far apart, indicating that there are some data points where
the model’s predictions are way off the actual values. However, the
median residual is relatively close to 0, suggesting that for most data
points, the model’s predictions are reasonably accurate.
The estimated values of the intercept and slope for the regression
line can be seen from the Coefficients
section. The intercept value of 11362.5514 tells us that
when the Date is 0 (which doesn’t really make sense to
me :) ), the predicted bicycle pageviews would be around
11,362.
More importantly, the slope coefficient of -0.4637 for
Date is statistically significant which is indicated by the
‘***’ stars, meaning that as time progresses, the model predicts a
slight decrease in bicycle pageviews. Now, this might seem
counterintuitive at first, but I’ll come back to that in a bit.
The Residual standard error of
1953 gives us an idea of how much the
actual data points tend to deviate from the model’s predictions, on
average. A lower value would indicate a better fit, but for this type of
data, I must say that a value of 1953
isn’t too bad.
The Multiple R-squared value of
0.03052 tells us that about
3% of the variation in bicycle pageviews
can be explained by the Date variable
alone. This is a pretty low value, suggesting that there are likely
other important factors influencing the pageviews that aren’t accounted
for in this simple model.
Now, let me circle back to that negative slope for
Date. While it might seem counterintuitive
at first, it’s important to remember that this is a linear model, and
it’s trying to fit a straight line to the entire dataset. What we’re
likely seeing here is the model’s attempt to capture the overall
decreasing trend in pageviews over the years, while ignoring the
seasonal fluctuations.
However, we know that in reality, the time series plots that the bicycle pageviews exhibit a clear seasonal pattern, with peaks and troughs occurring at different times of the year. I made this simple linear model to give a general understanding and give a rough approximation of the overall trend as first, it did not capture the seasonal variations.
So, while this model just provides some basic insights, what then to do next? Well, let me proceed to explore some other sophisticated modeling techniques, i.e. the Exponential Smoothing (ETS) models, following with more to better capture the nuances and complexities in this bicycle pageviews data.
ETS (Exponential Smoothing) Models :
This is another type of univariate time series model that use exponential smoothing techniques to forecast future values. They can handle different types of trend and seasonal patterns.
ets_model <- bike_tsibble %>%
model(ets = ETS(Bicycle))
ets_forecast <- ets_model %>%
forecast(h = 12)
autoplot(ets_forecast)
Insights from the ETS model:
The model visualization reveals a distinct cyclical pattern in Wikipedia page views related to bicycles, indicating the presence of seasonality influenced by factors such as cycling seasons, events, or user interests. Despite the seasonal fluctuations, an overall upward trend is observed, suggesting a gradual increase in traffic and engagement with bicycle-related content on the platform over time. The shaded areas represent confidence intervals, with the darker shade depicting a narrower range of uncertainty and the lighter shade depicting a wider range.
This ETS model highlights the page views influenced by seasonal factors, while exhibiting an upward trend over time. The ETS modeling approach forecasts future values, taking into account the observed trends and seasonality. This model alongside the analysis can help understand the seasonal patterns, trends, and variability in page views related to bicycles.
Going deeper into the analysis of the data to really pinpoint the trend in seasonality, I will proceed to engage some more models to help us understand the behavior behind the pageviews over time.
The overall aim of the process is to decompose the time series to
visually examine long-term trends and recurring seasonal patterns in the
bicycle usage data, and also to identify potential model structure by
using the Autocorrelation Function (ACF)
and
Partial Autocorrelation Function (PACF) to
further illustrate and confirm the presence of seasonality in the data.
These functions plot the correlation between the time series and its
lagged values, which can help identify periodic patterns:
bike_ts <- ts(bike_tsibble$Bicycle, frequency = 7)
decomposed_bike_ts <- stl(bike_ts, s.window = "periodic")
plot(decomposed_bike_ts)
acf(bike_ts)
pacf(bike_ts)
The Seasonal and Trend decomposition using Loessfirst plot displays the seasonal, trend, and remainder components of a time series decomposition method, the top panel shows the observed data, which appears to have a seasonal pattern and an increasing trend over time. The middle panel depicts the seasonal component, revealing a repeating weekly pattern. The bottom panel displays the trend component, indicating an overall upward trend in the data.
The second plot is the
Autocorrelation Function (ACF) of the time
series modelling, which measures the correlation between the time series
and its lagged values. The plot shows significant spikes at lags 1 and
7, indicating strong positive autocorrelation at these lags. This
pattern suggests the presence of a weekly seasonal component in the
data.
And lastly, the
Partial Autocorrelation Function (PACF)
measures the correlation between the time series and its lagged values,
after accounting for the effects of intermediate lags. The plot shows
significant spikes at lags 1 and 7, suggesting that the seasonal
component is best modeled using an order of 7 (corresponding to the
weekly seasonality) in an autoregressive model.
These critical plots provide insights into the seasonality, trend, and autocorrelation structure of the time series data. The decomposition plot reveals a weekly seasonal pattern and an increasing trend. The ACF and PACF plots confirm the presence of a strong weekly seasonal component and suggest the appropriate order for modeling the seasonality in an ARIMA model, which is done below:
ARIMA (Autoregressive Integrated Moving Average) Models:
ARIMA models are widely used for univariate time series forecasting as they capture the autocorrelation structure of the data and can handle non-stationarity through differencing.
This conducts time series forecasting for the
Bicycle variable using by initially
fitting the ARIMA model to historical bicycle data and then generates
forecasts for future time points.
arima_model <- bike_tsibble %>%
model(arima = ARIMA(Bicycle))
arima_forecast <- arima_model %>%
forecast(h = 7)
autoplot(arima_forecast)
Insights from the ARIMA model:
The plot shows an overall upward trend in page views, indicating a growing interest in bicycle-related content over the analyzed time period. However, the data exhibits a distinct cyclical pattern, suggesting the presence of seasonality influenced by factors such as weather conditions, cycling seasons, or bicycle-related sporting events.
The bicycle counts remain relatively stable across the forecast period. The narrower confidence intervals in the confidence level indicate low uncertainty in the model’s predictions; with the darker shade indicating a narrower range of uncertainty and the lighter shade depicting a wider range.
The presence of peaks and troughs in the plot highlights periods of higher and lower page view activity, respectively. These fluctuations could be attributed to various factors, including cycling events, holidays, weather conditions, or changes in user behavior, but overall, the model shows reasonable confidence in its forecasts.
#After gathering all the insights from all the models, the number and statistics, it is best to now tie the results from the pageviews data with the bike sales dataset, to truly understand the background factors that influences bike sales globally.
One thing was very evident throughout, and also looking back to our previous analysis and data dive, the seasonal pattern observed in the bicycle pageviews data is a clear indication that interest in bicycles is influenced by seasonal factors, such as weather conditions, light leiesure exercises, and outdoor activities.
We can deduce that during the warmer months of spring and summer, people are more likely to engage in outdoor activities, including cycling for recreation, exercise, or commuting. This increased interest is reflected in the higher pageviews during these periods.
Conversely, during the colder winter months, outdoor activities tend to decrease, leading to a lower demand for bicycle-related content and, consequently, lower pageviews. This seasonal pattern can be valuable for businesses, organizations, or even the bike manufacturers themselves, as it allows them to plan and adjust their strategies, and marketing campaigns to align with the seasonal fluctuations in demand.
For example, bike manufacturers could anticipate higher sales during the spring and summer months and adjust their inventory and promotional efforts accordingly. Similarly, cycling event organizers could schedule their events during peak seasons to maximize attendance and engagement.
Understanding the seasonal pattern can also aid in resource allocation, staffing, and budgeting decisions, ensuring that resources are optimized to meet the varying demand throughout the year.
Overall, this analysis aims to explore the time aspect of the data,
identify trends and seasonality, and provide a comprehensive
understanding of the temporal patterns present in the
Bicycle variable.
As mentioned in the observation above, the seasonal plot clearly shows peaks occurring around April-May and troughs around December-January. This aligns with the initial observation from the time series plot and confirms the presence of a seasonal pattern driven by the warmer months and increased outdoor activities.
Backing up to the week 6 of our data dive where we discussed
Confidence Intervals, and we came to a vital conclusion based on the
analysis; the confidence interval provides valuable insight into the
behavior of the population from which the sample was drawn, offering a
range within which the true proportion of interest likely falls. This
information is crucial for making informed decisions, understanding the
population’s tendencies, and guiding further research or policy
decisions related to bike purchasing behaviors.
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.