Bike Data Analysis - Week 12 - Time Series Modelling

Introduction:

This week’s data dive is on Time Series Modelling, a statistical concept that uses time as an explanatory variable in a model. It is a crucial aspect of analyzing and forecasting time-dependent data, which involves studying the patterns and characteristics of a sequence of observations collected over time, to understand the underlying data-generating process and make accurate predictions about future values.

As discussed earlier in the week; the key components of time series modeling include: trend, seasonality, cyclical patterns, and random noise or irregular fluctuations.

Time series modeling involves several steps, including data exploration and visualization, stationarity testing, model identification, parameter estimation, diagnostic checking, and forecasting. The choice of model depends on the characteristics of the data we are looking at, the presence of trend and seasonality, and the desired level of accuracy and interpretability.

Data Loading:

We will proceed to load our dataset here, however, since the regular bike sales dataset does not have a time-based column of data, a clean and reliable dataset that relates to bikes was culled from https://pageviews.wmcloud.org/, this dataset is about the total pageviews on Wikipedia for Bicycle for a period of January 2017 to February 2024 (this is the available data dates as at the time of this analysis).

In a bid to find a correlation, I will tie the results and insights from this analysis into what we’ve been seeing so far with the bike sales dataset for the past 11 weeks.

Now, loading the pageviews dataset, and the necessary libraries for the analysis and understanding its structure. The Date column is selected, and the ymd function is used to convert it to a Date format:

suppressMessages({
  suppressWarnings({
    library(lubridate)
    library(tidyverse)
    library(ggthemes)
    library(ggrepel)
    library(tsibble)
    library(ggplot2)
    library(forecast)
    library(fpp3)
    
    bicycle_data <- read_csv("bike_pageviews.csv")
  })
})

bicycle_data$Date <- ymd(bicycle_data$Date)

Selecting and Converting Time Column:

In this step, the Date column is converted to the Date format using the as.Date() function which encodes the time information. The format argument specifies the format of the date string, which is “%Y-%m-%d” (YYYY-MM-DD).

This conversion is necessary before we start our analysis because it is a special class, and many time series analysis functions require the data to be in this format to allow proper date manipulation and analysis.

data <- read.csv("bike_pageviews.csv")
data$Date <- as.Date(data$Date, format = "%Y-%m-%d")

Choosing Response Variable Column to Analyze Over Time:

Here, the only available column for the response variable is the Bicycle column, it can be analyzed over time because this column represents the page views related to bicycles that is not present in our usual bike sales dataset.

This creates a new variable named response_var and assigns it the values of the Bicycle column from the data object.This extracts the values from the specified column in the data object and stores them in the variable response_var for further analysis or manipulation.

response_var <- data$Bicycle

Creating a tsibble and Plot Data:

It is important to first organize the dataset into a tsibble object, which is a specialized data structure for time series data. This will ensure that the bicycle pageviews dataset has a time index associated with it.

So, once the data is structured as a tsibble, the next step is to visualize the data over time using a plot. This will give a quick glimpse which can help identify patterns, trends, and anomalies over time.

if (any(is.na(data$Date))) {
  stop("Column 'Date' must not contain NA.")
}
bike_tsibble <- tsibble(data, index = Date)
plot(bike_tsibble, main = "Bicycle Pageviews over Time")

Insights from the Plot:

The following insights and observations can be gathered from our plot above:

Seasonal Pattern: The data exhibits a clear seasonal pattern, with peaks occurring around the same time each year. This suggests that the interest in bicycle-related content is influenced by seasonal factors, such as weather or outdoor activities. The peaks tend to occur during the spring and summer months, while the troughs coincide with the winter months.
Yearly Increases: While the seasonal pattern is prominent, there also appears to be an overall increasing trend in the pageviews from year to year. This indicates that the general interest in bicycles is growing over time, potentially driven by factors such as increased awareness of eco-friendly transportation, health and fitness trends, or changes in urban infrastructure and policies supporting cycling.
Varying Peak Magnitudes: Although the peaks occur around the same time each year, their magnitudes vary. For instance, the peak in the year 2020 seems significantly higher compared to the peaks in other years. This variation could be attributed to external factors that temporarily boosted the interest in bicycles during that specific period, such as the COVID-19 pandemic, which led to a surge in outdoor activities and an increased interest in cycling.
Outliers and Anomalies: This plot shows some outliers and anomalies, particularly in year 2020, where there is a sharp spike in pageviews that deviates significantly from the overall pattern. Such anomalies could be caused by exceptional events, again such as the COVID-19 lockdown, which actually generated an unusually high level of interest in bicycles during the pandemic.
Data Quality: The data appears to be relatively clean, with no obvious gaps or missing values.

Detecting Trends with Linear Regression:

trend_model <- lm(Bicycle ~ Date, data = bike_tsibble)
summary(trend_model)

## 
## Call:
## lm(formula = Bicycle ~ Date, data = bike_tsibble)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2021.6  -998.3  -494.8   451.8 27864.5 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11362.5514   949.5838  11.966   <2e-16 ***
## Date           -0.4637     0.0514  -9.023   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1953 on 2586 degrees of freedom
## Multiple R-squared:  0.03052,    Adjusted R-squared:  0.03015 
## F-statistic: 81.41 on 1 and 2586 DF,  p-value: < 2.2e-16

Insights from Linear Regression Model:

First things first, the linear regression model fitted above to detect trend is a simple yet insightful model that gives us some great insights from the bicycle pageviews dataset. To fully understand what the above model values mean, let me break it down and explain the numbers and statistics one after the other.

The Residuals section gives us an idea of how well the model fits the data, the minimum and maximum residuals are quite far apart, indicating that there are some data points where the model’s predictions are way off the actual values. However, the median residual is relatively close to 0, suggesting that for most data points, the model’s predictions are reasonably accurate.

The estimated values of the intercept and slope for the regression line can be seen from the Coefficients section. The intercept value of 11362.5514 tells us that when the Date is 0 (which doesn’t really make sense to me :) ), the predicted bicycle pageviews would be around 11,362.

More importantly, the slope coefficient of -0.4637 for Date is statistically significant which is indicated by the ‘***’ stars, meaning that as time progresses, the model predicts a slight decrease in bicycle pageviews. Now, this might seem counterintuitive at first, but I’ll come back to that in a bit.

The Residual standard error of 1953 gives us an idea of how much the actual data points tend to deviate from the model’s predictions, on average. A lower value would indicate a better fit, but for this type of data, I must say that a value of 1953 isn’t too bad.

The Multiple R-squared value of 0.03052 tells us that about 3% of the variation in bicycle pageviews can be explained by the Date variable alone. This is a pretty low value, suggesting that there are likely other important factors influencing the pageviews that aren’t accounted for in this simple model.

Now, let me circle back to that negative slope for Date. While it might seem counterintuitive at first, it’s important to remember that this is a linear model, and it’s trying to fit a straight line to the entire dataset. What we’re likely seeing here is the model’s attempt to capture the overall decreasing trend in pageviews over the years, while ignoring the seasonal fluctuations.

However, we know that in reality, the time series plots that the bicycle pageviews exhibit a clear seasonal pattern, with peaks and troughs occurring at different times of the year. I made this simple linear model to give a general understanding and give a rough approximation of the overall trend as first, it did not capture the seasonal variations.

So, while this model just provides some basic insights, what then to do next? Well, let me proceed to explore some other sophisticated modeling techniques, i.e. the Exponential Smoothing (ETS) models, following with more to better capture the nuances and complexities in this bicycle pageviews data.

ETS (Exponential Smoothing) Models :

This is another type of univariate time series model that use exponential smoothing techniques to forecast future values. They can handle different types of trend and seasonal patterns.

ets_model <- bike_tsibble %>%
  model(ets = ETS(Bicycle))

ets_forecast <- ets_model %>%
  forecast(h = 12)

autoplot(ets_forecast)

Insights from the ETS model:

The model visualization reveals a distinct cyclical pattern in Wikipedia page views related to bicycles, indicating the presence of seasonality influenced by factors such as cycling seasons, events, or user interests. Despite the seasonal fluctuations, an overall upward trend is observed, suggesting a gradual increase in traffic and engagement with bicycle-related content on the platform over time. The shaded areas represent confidence intervals, with the darker shade depicting a narrower range of uncertainty and the lighter shade depicting a wider range.

This ETS model highlights the page views influenced by seasonal factors, while exhibiting an upward trend over time. The ETS modeling approach forecasts future values, taking into account the observed trends and seasonality. This model alongside the analysis can help understand the seasonal patterns, trends, and variability in page views related to bicycles.

Detect Seasonality with Smoothing:

Going deeper into the analysis of the data to really pinpoint the trend in seasonality, I will proceed to engage some more models to help us understand the behavior behind the pageviews over time.

The overall aim of the process is to decompose the time series to visually examine long-term trends and recurring seasonal patterns in the bicycle usage data, and also to identify potential model structure by using the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) to further illustrate and confirm the presence of seasonality in the data. These functions plot the correlation between the time series and its lagged values, which can help identify periodic patterns:

bike_ts <- ts(bike_tsibble$Bicycle, frequency = 7)
decomposed_bike_ts <- stl(bike_ts, s.window = "periodic")
plot(decomposed_bike_ts)

acf(bike_ts)

pacf(bike_ts)

The Seasonal and Trend decomposition using Loessfirst plot displays the seasonal, trend, and remainder components of a time series decomposition method, the top panel shows the observed data, which appears to have a seasonal pattern and an increasing trend over time. The middle panel depicts the seasonal component, revealing a repeating weekly pattern. The bottom panel displays the trend component, indicating an overall upward trend in the data.

The second plot is the Autocorrelation Function (ACF) of the time series modelling, which measures the correlation between the time series and its lagged values. The plot shows significant spikes at lags 1 and 7, indicating strong positive autocorrelation at these lags. This pattern suggests the presence of a weekly seasonal component in the data.

And lastly, the Partial Autocorrelation Function (PACF) measures the correlation between the time series and its lagged values, after accounting for the effects of intermediate lags. The plot shows significant spikes at lags 1 and 7, suggesting that the seasonal component is best modeled using an order of 7 (corresponding to the weekly seasonality) in an autoregressive model.

These critical plots provide insights into the seasonality, trend, and autocorrelation structure of the time series data. The decomposition plot reveals a weekly seasonal pattern and an increasing trend. The ACF and PACF plots confirm the presence of a strong weekly seasonal component and suggest the appropriate order for modeling the seasonality in an ARIMA model, which is done below:

ARIMA (Autoregressive Integrated Moving Average) Models:

ARIMA models are widely used for univariate time series forecasting as they capture the autocorrelation structure of the data and can handle non-stationarity through differencing.

This conducts time series forecasting for the Bicycle variable using by initially fitting the ARIMA model to historical bicycle data and then generates forecasts for future time points.

arima_model <- bike_tsibble %>%
  model(arima = ARIMA(Bicycle))

arima_forecast <- arima_model %>%
  forecast(h = 7)

autoplot(arima_forecast)

Insights from the ARIMA model:

The plot shows an overall upward trend in page views, indicating a growing interest in bicycle-related content over the analyzed time period. However, the data exhibits a distinct cyclical pattern, suggesting the presence of seasonality influenced by factors such as weather conditions, cycling seasons, or bicycle-related sporting events.

The bicycle counts remain relatively stable across the forecast period. The narrower confidence intervals in the confidence level indicate low uncertainty in the model’s predictions; with the darker shade indicating a narrower range of uncertainty and the lighter shade depicting a wider range.

The presence of peaks and troughs in the plot highlights periods of higher and lower page view activity, respectively. These fluctuations could be attributed to various factors, including cycling events, holidays, weather conditions, or changes in user behavior, but overall, the model shows reasonable confidence in its forecasts.

Correlation with Bike Sales Dataset:

#After gathering all the insights from all the models, the number and statistics, it is best to now tie the results from the pageviews data with the bike sales dataset, to truly understand the background factors that influences bike sales globally.

One thing was very evident throughout, and also looking back to our previous analysis and data dive, the seasonal pattern observed in the bicycle pageviews data is a clear indication that interest in bicycles is influenced by seasonal factors, such as weather conditions, light leiesure exercises, and outdoor activities.

We can deduce that during the warmer months of spring and summer, people are more likely to engage in outdoor activities, including cycling for recreation, exercise, or commuting. This increased interest is reflected in the higher pageviews during these periods.

Conversely, during the colder winter months, outdoor activities tend to decrease, leading to a lower demand for bicycle-related content and, consequently, lower pageviews. This seasonal pattern can be valuable for businesses, organizations, or even the bike manufacturers themselves, as it allows them to plan and adjust their strategies, and marketing campaigns to align with the seasonal fluctuations in demand.

For example, bike manufacturers could anticipate higher sales during the spring and summer months and adjust their inventory and promotional efforts accordingly. Similarly, cycling event organizers could schedule their events during peak seasons to maximize attendance and engagement.

Understanding the seasonal pattern can also aid in resource allocation, staffing, and budgeting decisions, ensuring that resources are optimized to meet the varying demand throughout the year.

Conclusion:

Overall, this analysis aims to explore the time aspect of the data, identify trends and seasonality, and provide a comprehensive understanding of the temporal patterns present in the Bicycle variable.

As mentioned in the observation above, the seasonal plot clearly shows peaks occurring around April-May and troughs around December-January. This aligns with the initial observation from the time series plot and confirms the presence of a seasonal pattern driven by the warmer months and increased outdoor activities.

Backing up to the week 6 of our data dive where we discussed Confidence Intervals, and we came to a vital conclusion based on the analysis; the confidence interval provides valuable insight into the behavior of the population from which the sample was drawn, offering a range within which the true proportion of interest likely falls. This information is crucial for making informed decisions, understanding the population’s tendencies, and guiding further research or policy decisions related to bike purchasing behaviors.

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.