knitr::opts_chunk$set(
message = FALSE,
warning = FALSE,
fig.align = "center",
comment = "#>" )

Introduction

In this project, we will perform forecasting using Wikipedia data from January 2020 to March 2024. This Wikipedia data provides information on the number of viewers accessing Wikipedia articles related to “Mitsubishi Pajero Sport”. The Wikipedia data is obtained using the article_pageviews() function from the pageviews library.

Data Preparation

Import Library

library(pageviews)
library(dplyr)
library(forecast)
library(plotly)
library(TTR)
library(MLmetrics)
library(tseries)

Load Dataset

wiki_ds <- article_pageviews(project = "en.wikipedia", 
                  article = "Mitsubishi Pajero Sport", 
                  start =as.Date('2020-01-01'), end = as.Date("2024-03-31"))
str(wiki_ds)

#> 'data.frame':    1552 obs. of  8 variables:
#>  $ project    : chr  "wikipedia" "wikipedia" "wikipedia" "wikipedia" ...
#>  $ language   : chr  "en" "en" "en" "en" ...
#>  $ article    : chr  "Mitsubishi_Pajero_Sport" "Mitsubishi_Pajero_Sport" "Mitsubishi_Pajero_Sport" "Mitsubishi_Pajero_Sport" ...
#>  $ access     : chr  "all-access" "all-access" "all-access" "all-access" ...
#>  $ agent      : chr  "all-agents" "all-agents" "all-agents" "all-agents" ...
#>  $ granularity: chr  "daily" "daily" "daily" "daily" ...
#>  $ date       : POSIXct, format: "2020-01-01" "2020-01-02" ...
#>  $ views      : num  443 523 488 476 524 562 522 501 548 537 ...

From the data frame above, the data has 8 columns, 1552 rows and the data types for each column.

Data Preprocessing

Data Cleaning

wiki_ds2 <-  wiki_ds %>% select(date, views) #select the data used, namely the date and views
head(wiki_ds2)

#>         date views
#> 1 2020-01-01   443
#> 2 2020-01-02   523
#> 3 2020-01-03   488
#> 4 2020-01-04   476
#> 5 2020-01-05   524
#> 6 2020-01-06   562

The requirements for data to be treated in Time Series analysis include:

Should be no missing values
The data must be sorted by time
The data must not have any gaps in time.

Missing Value

colSums(is.na(wiki_ds2))

#>  date views 
#>     0     0

In the dataset above, has no missing value data in any columns.

Data Time Series

Data Type Time Series

To process and model time series data in R, we need to convert the data into a time series object using the ts() function.

Data = the observed values of y, which in this case are the views or number of viewers in the Wikipedia data.

Start = the starting time of the time series, which is 2020.

Frequency = the number of observations within one seasonal pattern. Since the seasonal pattern is annual, the frequency is 365.25.

wiki_ds_ts <- ts(wiki_ds2$views, start = c(2020,1), frequency = 365.25)
autoplot(wiki_ds_ts, series = "Actual")

Decompose time series data

The main idea of decomposition is to decompose the three components of a time series object (trend, seasonal, residual).

Trend -> represents the overall mean movement.

Seasonal -> refers to the pattern of data per frequency.

Error -> represents the values not accounted for by the Trend and Seasonal components.

To decompose a time series object into these three components, we can use the decompose() function.

wiki_ds_dec <- decompose(wiki_ds_ts)
autoplot(wiki_ds_dec)

Check seasonal data

autoplot(wiki_ds_dec$x - wiki_ds_dec$trend)

Observe whether the trend is increasing or decreasing by removing the seasonal effect, we can use the adjusted seasonal component, which involves removing the seasonal effect from the data. This allows us to focus solely on the trend component of the time series.

autoplot(wiki_ds_dec$x - wiki_ds_dec$seasonal)

From the plot results above, it can be said that the seasonal pattern of the Wikipedia data is random. We can assume that the Wikipedia data does not have a seasonal component. Meanwhile, for the trend pattern of the Wikipedia data, it also does not exhibit an increasing or decreasing trend.

Build Model

Cross Validation

The concept of cross-validation (CV) in time series is the proportion of training data to build the model will be larger than the test data. There are condition that must be met:

The training data will use the earliest data.
The test data will use the most recent data.

wiki_ds_train <- wiki_ds_ts %>% 
  head(-365.25)

wiki_ds_test <- wiki_ds_ts %>% 
  tail(365.25)

Simple Exponential Smoothing

The function used to create an exponential smoothing model is ‘ets’ (error, trend, seasonal), and the parameters within the function are as follows :

y = The time series to be modeled
model = The exponential smoothing model to be used. Some options include “ANN” (additive error, additive trend, additive seasonality), “AAN” (additive error, additive trend, non-additive seasonality), “MNN” (multiplicative error, multiplicative trend, non-additive seasonality), and others.

A = additive M = multiplicative N = None Z = auto

error = Additive trend = None seasonal = None

SES (Simple Exponential Smoothing) is suitable for data that do not have a trend and do not have seasonality because SES will smooth out its errors. The coefficient for smoothing the error is called alpha.

wiki_ds_SES <- ets(wiki_ds_train, model = "ANN")
autoplot(wiki_ds_SES$fitted, series = "ANN") +
  autolayer(wiki_ds_train, series = "Actual")

To compare errors using different models such as Simple Exponential Smoothing (SES), Holt’s Linear Trend Method, and Holt-Winters Method to obtain the smallest error:

Double Exponential Smoothing (Holt)

wiki_ds_holt <- ets(wiki_ds_train, model = "AAN")
autoplot(wiki_ds_holt$fitted, series = "AAN") +
  autolayer(wiki_ds_train, series = "Actual")

Triple Exponential Smoothing ( Holt Winter )

wiki_ds_hw <- HoltWinters(wiki_ds_train, beta = F, gamma = F)

Comparing the error

# Simple Exponential Smoothing
sqrt(wiki_ds_SES$mse)

#> [1] 49.27023

# Double Exponential Smoothing (Holt)
sqrt(wiki_ds_holt$mse)

#> [1] 49.28245

# ## Triple Exponential Smoothing ( Holt Winter )
sqrt(MSE(wiki_ds_hw$fitted[,1], wiki_ds_train))

#> [1] 49.32267

From the RMSE results above, it can be said that the SES model is the model with the smallest error compared to other models. Therefore, We will use this SES model for the next steps.

ARIMA

ARIMA models are widely used for time series forecasting tasks and can handle a wide range of time series data patterns, including trends, seasonality, and irregularities. They are flexible and powerful tools for capturing the underlying patterns and making predictions in time series data.

ARIMA model :

AR(p) = The value of p indicates how many previous data points of variable y are used by the AutoRegressive component.

I(d) = The value of d indicates how many times the data is differenced until stationarity is achieved.

MA(q) = The value of q indicates how many errors are smoothed by the Moving Average component.

Note : The requirement for data in an ARIMA model is that it should be stationary.

Check stationary data

Stationarity in time series means that the time series data fluctuates around its mean and has constant variance.

wiki_ds_train %>% 
  adf.test()

#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  .
#> Dickey-Fuller = -4.1746, Lag order = 10, p-value = 0.01
#> alternative hypothesis: stationary

From the test results above, it states that the Wikipedia data is stationary with a p-value < 0.05, indicating that differencing is necessary for the data.

# differencing 
wiki_ds_train %>% diff() %>% kpss.test()

#> 
#>  KPSS Test for Level Stationarity
#> 
#> data:  .
#> KPSS Level = 0.0075122, Truncation lag parameter = 7, p-value = 0.1

After differencing once, the time series data is now stationary (p-value > 0.05) according to the kpss.test.

The number of differencing operations required to make the data stationary will become the value for d in the ARIMA(p,d,q) model.

d = 1 -> from the KPSS test result.

Cek PACF

wiki_ds_train %>% 
  pacf()

The PACF (Partial Autocorrelation Function) result indicates that the first lag has the highest spike, followed by the second lag. This suggests that we need to tune the AR component in the ARIMA model.

Tuning Model

# AR= 2, I=0, MA=0
wiki_ds_train_2 <- arima(wiki_ds_train, order = c(2,0,0))
wiki_ds_train_2$residuals %>% 
  pacf()

# AR= 2, I=1, MA=4
wiki_ds_train_214 <- arima(wiki_ds_train, order = c(2,1,4))
wiki_ds_train_214$residuals %>% 
  pacf()

From the tuning results above, we can use the ARIMA model: 2,1,4 which provides many lags that do not have autocorrelation.

Auto ARIMA

ARIMA models can also be created automatically using the auto.arima() function.

wiki_ds_auto <- auto.arima(wiki_ds_train, seasonal = F)
sqrt(MSE(wiki_ds_auto$fitted, wiki_ds_train))

#> [1] 47.7578

MAPE(wiki_ds_auto$fitted, wiki_ds_train)

#> [1] 0.07154913

Based on the Mean Absolute Percentage Error (MAPE) results, the Auto ARIMA model has a relatively small error of 7.1%.

Then, compare the AIC values of each model that has been created.

wiki_ds_train_214$aic

#> [1] 12546.36

wiki_ds_auto$aic

#> [1] 12543.59

The result shows that the Auto ARIMA model is the model with the smallest AIC value. We will use this Auto ARIMA model because a large AIC value discards a lot of information from the data.

Next, we will compare the RMSE values of the models that have been created, and we will choose the best one for predicting the data.

sqrt(wiki_ds_SES$mse)

#> [1] 49.27023

sqrt(wiki_ds_holt$mse)

#> [1] 49.28245

sqrt(MSE(wiki_ds_hw$fitted[,1], wiki_ds_train))

#> [1] 49.32267

sqrt(MSE(wiki_ds_auto$fitted, wiki_ds_train))

#> [1] 47.7578

Based on MSE result from all model above, we using model Auto Arima to continue in forecasting process.

Forecasting

Using Auto Arima Model

forecasting_wiki_ds_auto <- forecast(wiki_ds_auto, h = 365.25)
forecasting_wiki_ds_auto %>% 
  autoplot() +
  autolayer(wiki_ds_test)

Evaluation

Error Checking

RMSE check

sqrt(RMSE(forecasting_wiki_ds_auto$mean, wiki_ds_test))

#> [1] 11.86109

Based on the Root Mean Squared Error (RMSE) results, the model error is relatively small at 13.617.

MAPE check

MAPE(forecasting_wiki_ds_auto$mean, wiki_ds_test)

#> [1] 0.1671436

Based on the Mean Absolute Percentage Error (MAPE) results, the error is 16.7%.

Assumption check

Normal Distribution Check

hist(forecasting_wiki_ds_auto$residuals, breaks = 100)

Based on plot as above, the errors result are normally distributed. The Errors are concentrated around the value of 0, indicating that there are no extreme errors.

Autocorrelation Check

Box.test(wiki_ds_auto$residuals, type = "Ljung-Box")

#> 
#>  Box-Ljung test
#> 
#> data:  wiki_ds_auto$residuals
#> X-squared = 0.00017512, df = 1, p-value = 0.9894

Based on result as above, p-value > 0.5, Our model residuals show no autocorrelation.

Conclusion

From the evaluation of the model that has been conducted, it can be concluded that the Auto ARIMA model is a good model. This is seen from the error check results with RMSE and MAPE values that are relatively small. Additionally, from the autocorrelation assumption test results, a p-value > 0.5 was obtained, indicating no autocorrelation in the Auto ARIMA model.

This model can be used by the company as a reference to assess customer interest in its products in the future. The company can understand customer trends in seeking information about the product. From here, the company can see whether the product attracts significant interest from customers or otherwise. By understanding customer interest and market trends, the company can make strategic decisions to develop its products or adjust its marketing strategies.

Time series analysis on Wikipedia site visitors data for the product of (Mitsubishi Pajero Sport)

Rastiti Ratnasari

2024-04-07