Time Series Agriculture
Introduction
Library
#devtools::install_github("FinYang/tsdl")
# load library
library(dplyr) # data wrangling
library(lubridate) # date manipulation
library(forecast) # time series library
library(TTR) # for Simple moving average function
library(MLmetrics) # calculate error
library(ggplot2)
library(tidyr)
library(zoo)
# Data visualisation
library(plotly)
library(xts)
library(TSstudio)
library(tseries)Dataset
In this case we will use tsdl The Time Series Data Library package. This package contain time series data from a lot of subject. Data provide on this package already convert as Time Series object.
library(tsdl)Check kind of subject inside the Data Library
tsdl#> Time Series Data Library: 648 time series
#>
#> Frequency
#> Subject 0.1 0.25 1 4 5 6 12 13 52 365 Total
#> Agriculture 0 0 37 0 0 0 3 0 0 0 40
#> Chemistry 0 0 8 0 0 0 0 0 0 0 8
#> Computing 0 0 6 0 0 0 0 0 0 0 6
#> Crime 0 0 1 0 0 0 2 1 0 0 4
#> Demography 1 0 9 2 0 0 3 0 0 2 17
#> Ecology 0 0 23 0 0 0 0 0 0 0 23
#> Finance 0 0 23 5 0 0 20 0 2 1 51
#> Health 0 0 8 0 0 0 6 0 1 0 15
#> Hydrology 0 0 42 0 0 0 78 1 0 6 127
#> Industry 0 0 9 0 0 0 2 0 1 0 12
#> Labour market 0 0 3 4 0 0 17 0 0 0 24
#> Macroeconomic 0 0 18 33 0 0 5 0 0 0 56
#> Meteorology 0 0 18 0 0 0 17 0 0 12 47
#> Microeconomic 0 0 27 1 0 0 7 0 1 0 36
#> Miscellaneous 0 0 4 0 1 1 3 0 1 0 10
#> Physics 0 0 12 0 0 0 4 0 0 0 16
#> Production 0 0 4 14 0 0 28 1 1 0 48
#> Sales 0 0 10 3 0 0 24 0 9 0 46
#> Sport 0 1 1 0 0 0 0 0 0 0 2
#> Transport and tourism 0 0 1 1 0 0 12 0 0 0 14
#> Tree-rings 0 0 34 0 0 0 1 0 0 0 35
#> Utilities 0 0 2 1 0 0 8 0 0 0 11
#> Total 1 1 300 64 1 1 240 3 16 21 648
In this case i will use Agriculture data with Frequency 12 or Monthly. and here we need to subset it:
agri <- subset(tsdl, 'Agriculture', 12)
agri <- agri[[3]]
agri#> Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
#> 1962 578.3 609.8 628.4 665.6 713.8 707.2 628.4 588.1 576.3 566.5 561.1 571.4
#> 1963 589.1 615.3 641.2 682.8 728.5 726.4 648.0 605.8 591.5 576.3 573.2 587.1
#> 1964 616.6 648.6 675.5 715.3 756.0 746.7 665.7 627.4 612.8 599.9 602.7 622.5
#> 1965 646.1 676.1 696.1 732.5 767.8 767.0 689.3 641.2 624.0 609.7 610.8 623.5
#> 1966 664.7 690.3 722.6 766.0 796.3 809.6 721.7 684.4 670.6 654.9 654.4 675.5
#> 1967 700.1 725.1 748.2 795.4 821.8 828.9 753.1 708.9 690.9 674.5 669.6 685.3
#> 1968 704.0 730.5 760.9 807.6 842.4 838.0 768.8 726.6 711.2 693.2 686.9 698.1
#> 1969 720.7 750.1 770.8 816.7 855.2 857.3 786.5 750.1 735.6 709.9 700.1 720.7
#> 1970 736.4 768.5 792.4 836.0 869.9 871.5 804.1 768.8 750.8 733.4 721.4 737.4
#> 1971 789.4 821.8 844.4 890.8 924.9 926.3 853.2 818.9 801.5 785.5 774.1 785.5
#> 1972 811.0 838.6 873.9 913.1 943.6 948.6 877.8 839.5 820.8 795.3 777.2 790.4
#> 1973 806.1 840.3 867.0 911.1 939.6 937.5 865.0 821.8 795.4 776.6 771.1 787.4
#> 1974 813.0 845.7 872.9 915.2 951.4 960.8 891.5 851.3 826.9 797.3 784.3 798.2
Data Wrangling
Time Series data we get from tsdl need to be checked whether there is NA inside data.
length(agri)#> [1] 156
table(is.na(agri))#>
#> FALSE
#> 156
We can see total length of data is 156 and we get FALSE 156, it means data has no NA. Not only NA we should check, we try to know how distribution from our time series data. we can use summary syntax to know distribution of our data.
summary(agri)#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 561.1 675.2 749.1 746.5 817.2 960.8
Min of our time series data is 561.1 it means our data doesn’t have 0 value.
I will try to reverse it to data frame, it can help us to visualize data later.
agri_df <- data.frame(Y = as.matrix(agri), date = as_date(yearmon(time(agri))))Exploratory Data Analysis
In this part, i will try to explore our time series object. We will try to understand more our object, whether we can found trend, seasonal, or other interpretation or insight based on visualization.
Additive or Multiplicative
There are basically two methods to analyze the seasonality of a Time Series: additive and multiplicative. While to categorize Time Series object there is 3 main components which will be calculated for forecasting. These components are:
- trend (T): the movement of mean, globally, throughout an interval.
- seasonal (S): the pattern captured on each seasonal interval.
- error (E): the pattern/value that cannot be captured by both trend and seasonal.
Next we will try visualize our time series object to know either its multiplicative or additive.
agri %>% autoplot()Decompose
In this part we will try decompose data, which mean we will try to separate 3 components: Trend, Seasonality and Error.
agri %>%
decompose() %>%
autoplot()based on plot above, the data have a trend and seasonal.
Let’s take more closer.
agri_decom <- decompose(x = agri)Trend
agri_decom$trend %>% autoplot()Seasonality
agri_decom$seasonal %>% autoplot()we can said our decompose catch seasonal pattern well, because it show repeating pattern with a fixed period of time / same. We can try interprate more based on seasonal, lets try processing the data.
agri_df %>%
mutate(
seasonal = agri_decom$seasonal,
date = month(ymd(date), label = T, abbr = F)
) %>%
distinct(date, seasonal) %>%
ggplot(mapping = aes(x = date, y = seasonal)) +
geom_col() +
theme_bw()Cross Validation
The cross-validation scheme for time series should not be sampled randomly, but splitted sequentially. In this case we would take 36 last data as test data and keep rest data as train data.
test_agri <- tail(agri, 36)
train_agri <- head(agri, -length(test_agri))Build Model
Simple Moving Average
The moving average model is probably the most naive approach to time series modelling. This model simply states that the next observation is the mean of all past observations.
Model
agri_sma5 <- SMA(x = train_agri, n = 5)agri %>%
autoplot() +
autolayer(agri_sma5, series = "SMA (5)")Forecast
In the plot above, we applied the moving average model to a monthly window. The red line is smoothed the time series. We already have SMA model and we will forecast using forecast and length data we guess is length from our [agri_test] data.
forecast_sma5 <- forecast(agri_sma5, h = length(test_agri))Plot
We try to plot it, as we know this model will smooth historical data.
# Plot for SMA5
train_agri %>%
autoplot() +
autolayer(test_agri, series = "test") +
autolayer(agri_sma5, series = "SMA5") +
autolayer(forecast_sma5, series = "forecast SMA5")Exponential Smoothing
Exponential smoothing uses a similar logic to moving average, but this time, a different decreasing weight is assigned to each observations. In other words, less importance is given to observations as we move further from the present.
Exponential smoothing method is divided into 3 type:
- Simple Exponential Smoothing (SES)
- It suitable for time series data that does not have trends and seasonality
- Double Exponential Smoothing (Holt)
- For non-seasonal time range data (there are trends)
- Triple Exponential Smoothing (Holt Winters)
- For time frame data that have trend and seasonal.
So based on our Exploratory data analysis before, we know that the data have trend and seasonal. Method that we will use in this case is [Holt Winters] or Triple Exponential Smoothing.
Triple Exponential Smoothing
This method extends double exponential smoothing, by adding a seasonal smoothing factor. We know that our seasonal pattern is an additive, but in this case i will try to do both in additive and multiplicative to know more the result.
Model
# Model with Additive
agri_holtw_add <- HoltWinters(x = train_agri, seasonal = "additive")
agri_holtw_add#> Holt-Winters exponential smoothing with trend and additive seasonal component.
#>
#> Call:
#> HoltWinters(x = train_agri, seasonal = "additive")
#>
#> Smoothing parameters:
#> alpha: 0.6834431
#> beta : 0.004742946
#> gamma: 1
#>
#> Coefficients:
#> [,1]
#> a 840.460031
#> b 1.609754
#> s1 -18.811509
#> s2 5.367149
#> s3 24.377151
#> s4 66.050611
#> s5 96.397411
#> s6 94.857248
#> s7 20.276912
#> s8 -20.858395
#> s9 -43.055125
#> s10 -63.611411
#> s11 -73.124237
#> s12 -54.960031
# Model with Multiplicative
agri_holtw_mult <- HoltWinters(x = train_agri, seasonal = "multiplicative")
agri_holtw_mult#> Holt-Winters exponential smoothing with trend and multiplicative seasonal component.
#>
#> Call:
#> HoltWinters(x = train_agri, seasonal = "multiplicative")
#>
#> Smoothing parameters:
#> alpha: 0.6008551
#> beta : 0.008171636
#> gamma: 1
#>
#> Coefficients:
#> [,1]
#> a 855.0274754
#> b 1.8976096
#> s1 0.9636666
#> s2 0.9962179
#> s3 1.0238697
#> s4 1.0825297
#> s5 1.1280956
#> s6 1.1320383
#> s7 1.0403368
#> s8 0.9877602
#> s9 0.9556470
#> s10 0.9236406
#> s11 0.9033004
#> s12 0.9186839
Forecast
# Forecast for Additive
forecast_holtw_add <- forecast(object = agri_holtw_add, h = length(test_agri))# Forecast for Multiplicative
forecast_holtw_mult <- forecast(object = agri_holtw_mult, h = length(test_agri))Plot
# Plot for Additive
train_agri %>%
autoplot() +
autolayer(test_agri, series = "test") +
autolayer(agri_holtw_add$fitted[,1], series = "fitted values (additive)") +
autolayer(forecast_holtw_add$mean, series = "forecast data test (additive)")# Plot for Multiplicative
train_agri %>%
autoplot() +
autolayer(test_agri, series = "test") +
autolayer(agri_holtw_mult$fitted[,1], series = "fitted values (multiplicative)") +
autolayer(forecast_holtw_mult$mean, series = "forecast data test (multiplicative)")ARIMA
ARIMA abbrevation of Autoregressive Integrated Moving Average. While exponential smoothing methods do not make any assumptions about correlations between successive values of the time series, in some cases you can make a better predictive model by taking correlations in the data into account. Autoregressive Integrated Moving Average (ARIMA) models include an explicit statistical model for the irregular component of a time series, that allows for non-zero autocorrelation in the irregular component.
Differencing Time Series
ARIMA models are defined for stationary time series. Therefore, if we start off with a non-stationary time series, will need to ‘difference’ the time series until we obtain a stationary time series.
We check our time series data, is it stationary?
- H0: Data not stationary
- H1: Data stationary
If p-value > 0.05 means H0: Data not stationary
adf.test(x = agri)#>
#> Augmented Dickey-Fuller Test
#>
#> data: agri
#> Dickey-Fuller = -8.6086, Lag order = 5, p-value = 0.01
#> alternative hypothesis: stationary
We got p-value = 0.01 which is less than 0.05. Our data is considered stationary.
Model
We make ARIMA model using function auto.arima, this
function automate will give us optimal result of arima model.
agri_auto_arima <- auto.arima(train_agri, seasonal = T)
summary(agri_auto_arima)#> Series: train_agri
#> ARIMA(0,1,0)(0,1,1)[12]
#>
#> Coefficients:
#> sma1
#> -0.5912
#> s.e. 0.1216
#>
#> sigma^2 = 37.26: log likelihood = -347.4
#> AIC=698.8 AICc=698.91 BIC=704.14
#>
#> Training set error measures:
#> ME RMSE MAE MPE MAPE MASE
#> Training set 0.3097327 5.737024 3.967596 0.03836055 0.5489748 0.1634624
#> ACF1
#> Training set -0.1397488
Model we get is ARIMA(0,1,0), eventhough before we already check that the data is stationary, but using
auto.arimafunction they are still doing differencing which is 1. We can see it from ARIMA(p,d,q) so differencing based on model is d = 1.
Forecast
we will forecast using model and let see result
forecast_auto_arima <- forecast(agri_auto_arima, h=length(test_agri))Plot
train_agri %>%
autoplot() +
autolayer(test_agri, series = "Test") +
autolayer(forecast_auto_arima$mean, series = "Forecast Auto ARIMA")Model Evaluation
Simple Moving Average
# evaluation data train
accuracy(agri_sma5, train_agri)#> ME RMSE MAE MPE MAPE ACF1 Theil's U
#> Test set 2.571207 51.27293 46.96569 -0.03798591 6.542348 0.8025764 1.537611
# evaluation data test
accuracy(forecast_sma5$mean, test_agri)#> ME RMSE MAE MPE MAPE ACF1 Theil's U
#> Test set -53.11716 82.51525 67.49712 -6.567009 8.163933 0.8694295 2.622548
mape_sma <- MAPE(forecast_sma5$mean, test_agri)*100result_mape <- data.frame(
"Model" = "Simple Moving Average",
"MAPE" = round(mape_sma)
)Exponential Smoothing
# evaluation data train (Additive)
accuracy(agri_holtw_add$fitted[,1], train_agri)#> ME RMSE MAE MPE MAPE ACF1 Theil's U
#> Test set 0.9624909 6.974615 5.474194 0.1213887 0.7566958 0.2711501 0.2157291
# evaluation data test (Additive)
accuracy(forecast_holtw_add$mean, test_agri)#> ME RMSE MAE MPE MAPE ACF1 Theil's U
#> Test set -21.39375 26.91494 23.17621 -2.556067 2.764986 0.8782123 0.8649529
# evaluation data train (Multiplicative)
accuracy(agri_holtw_mult$fitted[,1], train_agri)#> ME RMSE MAE MPE MAPE ACF1 Theil's U
#> Test set 0.9830681 7.526608 5.93729 0.1518749 0.8076113 0.3471797 0.2264993
# evaluation data test (Multiplicative)
accuracy(forecast_holtw_mult$mean, test_agri)#> ME RMSE MAE MPE MAPE ACF1 Theil's U
#> Test set -42.43108 46.61575 42.43108 -4.957409 4.957409 0.8749004 1.437177
mape_holts_add <- MAPE(forecast_holtw_add$mean, test_agri)*100
mape_holts_mult <- MAPE(forecast_holtw_mult$mean, test_agri)*100mape_holts <- data.frame(
"Model" = c("Holts Winter (Additive)", "Holts Winter (Multiplicative)"),
"MAPE" = c(round(mape_holts_add), round(mape_holts_mult))
)
result_mape <- rbind(result_mape, mape_holts)ARIMA
# evaluation data train
accuracy(agri_auto_arima$fitted, train_agri)#> ME RMSE MAE MPE MAPE ACF1 Theil's U
#> Test set 0.3097327 5.737024 3.967596 0.03836055 0.5489748 -0.1397488 0.1745911
# evaluation data test
accuracy(forecast_auto_arima$mean, test_agri)#> ME RMSE MAE MPE MAPE ACF1 Theil's U
#> Test set -41.99823 48.74341 42.03759 -4.999546 5.00405 0.9129892 1.553211
mape_arima <- MAPE(forecast_auto_arima$mean, test_agri)*100mape_auto_arima <- data.frame(
"Model" = "ARIMA",
"MAPE" = round(mape_arima)
)
result_mape <- rbind(result_mape, mape_auto_arima)Assumptions
After we get result from each model Moving Average, Exponential Smoothing, and ARIMA. We try to do assumption check.
No-Autocorrelation
We want check no-autocorrelation we will use Box.test function. We expect p-value from our model is > 0.05 to accept H0.
- H0 : No autocorrelation in the forecast errors
- H1 : there is an autocorrelation in the forecast errors
Simple Moving Average
# Simple Moving Average
Box.test(x = forecast_sma5$residuals, type = "Ljung-Box")#>
#> Box-Ljung test
#>
#> data: forecast_sma5$residuals
#> X-squared = 51.265, df = 1, p-value = 0.0000000000008072
Exponential Smoothing
# Triple Exponential Smoothing with Additive Seasonal
Box.test(x = forecast_holtw_add$residuals, type = "Ljung-Box")#>
#> Box-Ljung test
#>
#> data: forecast_holtw_add$residuals
#> X-squared = 8.163, df = 1, p-value = 0.004275
# Triple Exponential Smoothing with Multiplicative Seasonal
Box.test(x = forecast_holtw_mult$residuals, type = "Ljung-Box")#>
#> Box-Ljung test
#>
#> data: forecast_holtw_mult$residuals
#> X-squared = 13.383, df = 1, p-value = 0.000254
ARIMA
# ARIMA
Box.test(x = forecast_auto_arima$residuals, type = "Ljung-Box")#>
#> Box-Ljung test
#>
#> data: forecast_auto_arima$residuals
#> X-squared = 2.4026, df = 1, p-value = 0.1211
Normality Residuals
Function we use in this checking is shapiro.test and we
expect to accepting H0 where p-value > 0.05.
- H0 : residuals are normally distributed
- H1 : residuals are not normally distributed
Simple Moving Average
# Simple Moving Average
shapiro.test(forecast_sma5$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: forecast_sma5$residuals
#> W = 0.98932, p-value = 0.501
Exponential Smoothing
# Triple Exponential Smoothing with Additive Seasonal
shapiro.test(forecast_holtw_add$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: forecast_holtw_add$residuals
#> W = 0.97247, p-value = 0.02425
# Triple Exponential Smoothing with Multiplicative Seasonal
shapiro.test(forecast_holtw_mult$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: forecast_holtw_mult$residuals
#> W = 0.98462, p-value = 0.2498
ARIMA
# ARIMA
shapiro.test(forecast_auto_arima$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: forecast_auto_arima$residuals
#> W = 0.89769, p-value = 0.0000001498
Summary
Lets take look our MAPE result table from all our model that we made before:
result_mape#> Model MAPE
#> 1 Simple Moving Average 8
#> 2 Holts Winter (Additive) 3
#> 3 Holts Winter (Multiplicative) 5
#> 4 ARIMA 5
- Based on MAPE Holts Winter with Additive seasonal have smallest MAPE, and we can say this model the most suitable model to predict our agriculture dataset compare to other model we have in this case.
- Based on No-Autocorrelation checking, From 4 model we use in this case, only ARIMA which have no-autocorrelation in forecasting errors.
- Based on Normality Residuals checking, From 4 model we use in this case, Simple Moving Average and Holts Winter (Multiplicative) have residuals distributed normally.