Introduction
In this project, we will perform forecasting using Wikipedia data from January 2020 to March 2024. This Wikipedia data provides information on the number of viewers accessing Wikipedia articles related to “Mitsubishi Pajero Sport”. The Wikipedia data is obtained using the article_pageviews() function from the pageviews library.
Data Preparation
Import Library
Load Dataset
wiki_ds <- article_pageviews(project = "en.wikipedia",
article = "Mitsubishi Pajero Sport",
start =as.Date('2020-01-01'), end = as.Date("2024-03-31"))
str(wiki_ds)#> 'data.frame': 1552 obs. of 8 variables:
#> $ project : chr "wikipedia" "wikipedia" "wikipedia" "wikipedia" ...
#> $ language : chr "en" "en" "en" "en" ...
#> $ article : chr "Mitsubishi_Pajero_Sport" "Mitsubishi_Pajero_Sport" "Mitsubishi_Pajero_Sport" "Mitsubishi_Pajero_Sport" ...
#> $ access : chr "all-access" "all-access" "all-access" "all-access" ...
#> $ agent : chr "all-agents" "all-agents" "all-agents" "all-agents" ...
#> $ granularity: chr "daily" "daily" "daily" "daily" ...
#> $ date : POSIXct, format: "2020-01-01" "2020-01-02" ...
#> $ views : num 443 523 488 476 524 562 522 501 548 537 ...
From the data frame above, the data has 8 columns, 1552 rows and the data types for each column.
Data Preprocessing
Data Cleaning
wiki_ds2 <- wiki_ds %>% select(date, views) #select the data used, namely the date and views
head(wiki_ds2)#> date views
#> 1 2020-01-01 443
#> 2 2020-01-02 523
#> 3 2020-01-03 488
#> 4 2020-01-04 476
#> 5 2020-01-05 524
#> 6 2020-01-06 562
The requirements for data to be treated in Time Series analysis include:
Should be no missing values
The data must be sorted by time
The data must not have any gaps in time.
Data Time Series
Data Type Time Series
To process and model time series data in R, we need to convert the data into a time series object using the ts() function.
Data = the observed values of y, which in this case are the views or number of viewers in the Wikipedia data.
Start = the starting time of the time series, which is 2020.
Frequency = the number of observations within one seasonal pattern. Since the seasonal pattern is annual, the frequency is 365.25.
wiki_ds_ts <- ts(wiki_ds2$views, start = c(2020,1), frequency = 365.25)
autoplot(wiki_ds_ts, series = "Actual")Decompose time series data
The main idea of decomposition is to decompose the three components of a time series object (trend, seasonal, residual).
Trend -> represents the overall mean movement.
Seasonal -> refers to the pattern of data per frequency.
Error -> represents the values not accounted for by the Trend and Seasonal components.
To decompose a time series object into these three components, we can use the decompose() function.
Check seasonal data
Observe whether the trend is increasing or decreasing by removing the seasonal effect, we can use the adjusted seasonal component, which involves removing the seasonal effect from the data. This allows us to focus solely on the trend component of the time series.
From the plot results above, it can be said that the seasonal pattern of the Wikipedia data is random. We can assume that the Wikipedia data does not have a seasonal component. Meanwhile, for the trend pattern of the Wikipedia data, it also does not exhibit an increasing or decreasing trend.
Build Model
Cross Validation
The concept of cross-validation (CV) in time series is the proportion of training data to build the model will be larger than the test data. There are condition that must be met:
The training data will use the earliest data.
The test data will use the most recent data.
Simple Exponential Smoothing
The function used to create an exponential smoothing model is ‘ets’ (error, trend, seasonal), and the parameters within the function are as follows :
- y = The time series to be modeled
- model = The exponential smoothing model to be used. Some options include “ANN” (additive error, additive trend, additive seasonality), “AAN” (additive error, additive trend, non-additive seasonality), “MNN” (multiplicative error, multiplicative trend, non-additive seasonality), and others.
A = additive M = multiplicative N = None Z = auto
error = Additive trend = None seasonal = None
SES (Simple Exponential Smoothing) is suitable for data that do not have a trend and do not have seasonality because SES will smooth out its errors. The coefficient for smoothing the error is called alpha.
wiki_ds_SES <- ets(wiki_ds_train, model = "ANN")
autoplot(wiki_ds_SES$fitted, series = "ANN") +
autolayer(wiki_ds_train, series = "Actual")To compare errors using different models such as Simple Exponential Smoothing (SES), Holt’s Linear Trend Method, and Holt-Winters Method to obtain the smallest error:
Double Exponential Smoothing (Holt)
wiki_ds_holt <- ets(wiki_ds_train, model = "AAN")
autoplot(wiki_ds_holt$fitted, series = "AAN") +
autolayer(wiki_ds_train, series = "Actual")Triple Exponential Smoothing ( Holt Winter )
Comparing the error
#> [1] 49.27023
#> [1] 49.28245
#> [1] 49.32267
From the RMSE results above, it can be said that the SES model is the model with the smallest error compared to other models. Therefore, We will use this SES model for the next steps.
ARIMA
ARIMA models are widely used for time series forecasting tasks and can handle a wide range of time series data patterns, including trends, seasonality, and irregularities. They are flexible and powerful tools for capturing the underlying patterns and making predictions in time series data.
ARIMA model :
AR(p) = The value of p indicates how many previous data points of variable y are used by the AutoRegressive component.
I(d) = The value of d indicates how many times the data is differenced until stationarity is achieved.
MA(q) = The value of q indicates how many errors are smoothed by the Moving Average component.
Note : The requirement for data in an ARIMA model is that it should be stationary.
Check stationary data
Stationarity in time series means that the time series data fluctuates around its mean and has constant variance.
#>
#> Augmented Dickey-Fuller Test
#>
#> data: .
#> Dickey-Fuller = -4.1746, Lag order = 10, p-value = 0.01
#> alternative hypothesis: stationary
From the test results above, it states that the Wikipedia data is stationary with a p-value < 0.05, indicating that differencing is necessary for the data.
#>
#> KPSS Test for Level Stationarity
#>
#> data: .
#> KPSS Level = 0.0075122, Truncation lag parameter = 7, p-value = 0.1
After differencing once, the time series data is now stationary (p-value > 0.05) according to the kpss.test.
The number of differencing operations required to make the data stationary will become the value for d in the ARIMA(p,d,q) model.
d = 1 -> from the KPSS test result.
Cek PACF
The PACF (Partial Autocorrelation Function) result indicates that the first lag has the highest spike, followed by the second lag. This suggests that we need to tune the AR component in the ARIMA model.
Tuning Model
# AR= 2, I=0, MA=0
wiki_ds_train_2 <- arima(wiki_ds_train, order = c(2,0,0))
wiki_ds_train_2$residuals %>%
pacf()# AR= 2, I=1, MA=4
wiki_ds_train_214 <- arima(wiki_ds_train, order = c(2,1,4))
wiki_ds_train_214$residuals %>%
pacf()From the tuning results above, we can use the ARIMA model: 2,1,4 which provides many lags that do not have autocorrelation.
Auto ARIMA
ARIMA models can also be created automatically using the auto.arima() function.
wiki_ds_auto <- auto.arima(wiki_ds_train, seasonal = F)
sqrt(MSE(wiki_ds_auto$fitted, wiki_ds_train))#> [1] 47.7578
#> [1] 0.07154913
Based on the Mean Absolute Percentage Error (MAPE) results, the Auto ARIMA model has a relatively small error of 7.1%.
Then, compare the AIC values of each model that has been created.
#> [1] 12546.36
#> [1] 12543.59
The result shows that the Auto ARIMA model is the model with the smallest AIC value. We will use this Auto ARIMA model because a large AIC value discards a lot of information from the data.
Next, we will compare the RMSE values of the models that have been created, and we will choose the best one for predicting the data.
#> [1] 49.27023
#> [1] 49.28245
#> [1] 49.32267
#> [1] 47.7578
Based on MSE result from all model above, we using model Auto Arima to continue in forecasting process.
Forecasting
Evaluation
Error Checking
RMSE check
#> [1] 11.86109
Based on the Root Mean Squared Error (RMSE) results, the model error is relatively small at 13.617.
Assumption check
Normal Distribution Check
Based on plot as above, the errors result are normally distributed. The Errors are concentrated around the value of 0, indicating that there are no extreme errors.
Conclusion
From the evaluation of the model that has been conducted, it can be concluded that the Auto ARIMA model is a good model. This is seen from the error check results with RMSE and MAPE values that are relatively small. Additionally, from the autocorrelation assumption test results, a p-value > 0.5 was obtained, indicating no autocorrelation in the Auto ARIMA model.
This model can be used by the company as a reference to assess customer interest in its products in the future. The company can understand customer trends in seeking information about the product. From here, the company can see whether the product attracts significant interest from customers or otherwise. By understanding customer interest and market trends, the company can make strategic decisions to develop its products or adjust its marketing strategies.