Time Series Agriculture

Introduction

Library

#devtools::install_github("FinYang/tsdl")
# load library
library(dplyr) # data wrangling
library(lubridate) # date manipulation
library(forecast) # time series library
library(TTR) # for Simple moving average function
library(MLmetrics) # calculate error
library(ggplot2)
library(tidyr)
library(zoo)
# Data visualisation
library(plotly)
library(xts)
library(TSstudio)
library(tseries)

Dataset

In this case we will use tsdl The Time Series Data Library package. This package contain time series data from a lot of subject. Data provide on this package already convert as Time Series object.

library(tsdl)

Check kind of subject inside the Data Library

tsdl

#> Time Series Data Library: 648 time series  
#> 
#>                        Frequency
#> Subject                 0.1 0.25   1   4   5   6  12  13  52 365 Total
#>   Agriculture             0    0  37   0   0   0   3   0   0   0    40
#>   Chemistry               0    0   8   0   0   0   0   0   0   0     8
#>   Computing               0    0   6   0   0   0   0   0   0   0     6
#>   Crime                   0    0   1   0   0   0   2   1   0   0     4
#>   Demography              1    0   9   2   0   0   3   0   0   2    17
#>   Ecology                 0    0  23   0   0   0   0   0   0   0    23
#>   Finance                 0    0  23   5   0   0  20   0   2   1    51
#>   Health                  0    0   8   0   0   0   6   0   1   0    15
#>   Hydrology               0    0  42   0   0   0  78   1   0   6   127
#>   Industry                0    0   9   0   0   0   2   0   1   0    12
#>   Labour market           0    0   3   4   0   0  17   0   0   0    24
#>   Macroeconomic           0    0  18  33   0   0   5   0   0   0    56
#>   Meteorology             0    0  18   0   0   0  17   0   0  12    47
#>   Microeconomic           0    0  27   1   0   0   7   0   1   0    36
#>   Miscellaneous           0    0   4   0   1   1   3   0   1   0    10
#>   Physics                 0    0  12   0   0   0   4   0   0   0    16
#>   Production              0    0   4  14   0   0  28   1   1   0    48
#>   Sales                   0    0  10   3   0   0  24   0   9   0    46
#>   Sport                   0    1   1   0   0   0   0   0   0   0     2
#>   Transport and tourism   0    0   1   1   0   0  12   0   0   0    14
#>   Tree-rings              0    0  34   0   0   0   1   0   0   0    35
#>   Utilities               0    0   2   1   0   0   8   0   0   0    11
#>   Total                   1    1 300  64   1   1 240   3  16  21   648

In this case i will use Agriculture data with Frequency 12 or Monthly. and here we need to subset it:

agri <- subset(tsdl, 'Agriculture', 12)
agri <- agri[[3]]
agri

#>        Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec
#> 1962 578.3 609.8 628.4 665.6 713.8 707.2 628.4 588.1 576.3 566.5 561.1 571.4
#> 1963 589.1 615.3 641.2 682.8 728.5 726.4 648.0 605.8 591.5 576.3 573.2 587.1
#> 1964 616.6 648.6 675.5 715.3 756.0 746.7 665.7 627.4 612.8 599.9 602.7 622.5
#> 1965 646.1 676.1 696.1 732.5 767.8 767.0 689.3 641.2 624.0 609.7 610.8 623.5
#> 1966 664.7 690.3 722.6 766.0 796.3 809.6 721.7 684.4 670.6 654.9 654.4 675.5
#> 1967 700.1 725.1 748.2 795.4 821.8 828.9 753.1 708.9 690.9 674.5 669.6 685.3
#> 1968 704.0 730.5 760.9 807.6 842.4 838.0 768.8 726.6 711.2 693.2 686.9 698.1
#> 1969 720.7 750.1 770.8 816.7 855.2 857.3 786.5 750.1 735.6 709.9 700.1 720.7
#> 1970 736.4 768.5 792.4 836.0 869.9 871.5 804.1 768.8 750.8 733.4 721.4 737.4
#> 1971 789.4 821.8 844.4 890.8 924.9 926.3 853.2 818.9 801.5 785.5 774.1 785.5
#> 1972 811.0 838.6 873.9 913.1 943.6 948.6 877.8 839.5 820.8 795.3 777.2 790.4
#> 1973 806.1 840.3 867.0 911.1 939.6 937.5 865.0 821.8 795.4 776.6 771.1 787.4
#> 1974 813.0 845.7 872.9 915.2 951.4 960.8 891.5 851.3 826.9 797.3 784.3 798.2

Data Wrangling

Time Series data we get from tsdl need to be checked whether there is NA inside data.

length(agri)

#> [1] 156

table(is.na(agri))

#> 
#> FALSE 
#>   156

We can see total length of data is 156 and we get FALSE 156, it means data has no NA. Not only NA we should check, we try to know how distribution from our time series data. we can use summary syntax to know distribution of our data.

summary(agri)

#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   561.1   675.2   749.1   746.5   817.2   960.8

Min of our time series data is 561.1 it means our data doesn’t have 0 value.

I will try to reverse it to data frame, it can help us to visualize data later.

agri_df <- data.frame(Y = as.matrix(agri), date = as_date(yearmon(time(agri))))

Exploratory Data Analysis

In this part, i will try to explore our time series object. We will try to understand more our object, whether we can found trend, seasonal, or other interpretation or insight based on visualization.

Additive or Multiplicative

There are basically two methods to analyze the seasonality of a Time Series: additive and multiplicative. While to categorize Time Series object there is 3 main components which will be calculated for forecasting. These components are:

trend (T): the movement of mean, globally, throughout an interval.
seasonal (S): the pattern captured on each seasonal interval.
error (E): the pattern/value that cannot be captured by both trend and seasonal.

Next we will try visualize our time series object to know either its multiplicative or additive.

agri %>% autoplot()

Decompose

In this part we will try decompose data, which mean we will try to separate 3 components: Trend, Seasonality and Error.

agri %>%
  decompose() %>%
  autoplot()

based on plot above, the data have a trend and seasonal.

Let’s take more closer.

agri_decom <- decompose(x = agri)

Trend

agri_decom$trend %>% autoplot()

Seasonality

agri_decom$seasonal %>% autoplot()

we can said our decompose catch seasonal pattern well, because it show repeating pattern with a fixed period of time / same. We can try interprate more based on seasonal, lets try processing the data.

agri_df %>% 
  mutate(
    seasonal = agri_decom$seasonal,
    date = month(ymd(date), label = T, abbr = F)
  ) %>% 
  distinct(date, seasonal) %>% 
  ggplot(mapping = aes(x = date, y = seasonal)) +
  geom_col() +
  theme_bw()

Cross Validation

The cross-validation scheme for time series should not be sampled randomly, but splitted sequentially. In this case we would take 36 last data as test data and keep rest data as train data.

test_agri <- tail(agri, 36)
train_agri <- head(agri, -length(test_agri))

Build Model

Simple Moving Average

The moving average model is probably the most naive approach to time series modelling. This model simply states that the next observation is the mean of all past observations.

Model

agri_sma5 <- SMA(x = train_agri, n = 5)

agri %>%
  autoplot() +
  autolayer(agri_sma5, series = "SMA (5)")

Forecast

In the plot above, we applied the moving average model to a monthly window. The red line is smoothed the time series. We already have SMA model and we will forecast using forecast and length data we guess is length from our [agri_test] data.

forecast_sma5 <- forecast(agri_sma5, h = length(test_agri))

Plot

We try to plot it, as we know this model will smooth historical data.

# Plot for SMA5
train_agri %>% 
  autoplot() + 
  autolayer(test_agri, series = "test") + 
  autolayer(agri_sma5, series = "SMA5") +
  autolayer(forecast_sma5, series = "forecast SMA5")

Exponential Smoothing

Exponential smoothing uses a similar logic to moving average, but this time, a different decreasing weight is assigned to each observations. In other words, less importance is given to observations as we move further from the present.

Exponential smoothing method is divided into 3 type:

Simple Exponential Smoothing (SES)
- It suitable for time series data that does not have trends and seasonality
Double Exponential Smoothing (Holt)
- For non-seasonal time range data (there are trends)
Triple Exponential Smoothing (Holt Winters)
- For time frame data that have trend and seasonal.

So based on our Exploratory data analysis before, we know that the data have trend and seasonal. Method that we will use in this case is [Holt Winters] or Triple Exponential Smoothing.

Triple Exponential Smoothing

This method extends double exponential smoothing, by adding a seasonal smoothing factor. We know that our seasonal pattern is an additive, but in this case i will try to do both in additive and multiplicative to know more the result.

Model

# Model with Additive
agri_holtw_add <- HoltWinters(x = train_agri, seasonal = "additive")
agri_holtw_add

#> Holt-Winters exponential smoothing with trend and additive seasonal component.
#> 
#> Call:
#> HoltWinters(x = train_agri, seasonal = "additive")
#> 
#> Smoothing parameters:
#>  alpha: 0.6834431
#>  beta : 0.004742946
#>  gamma: 1
#> 
#> Coefficients:
#>           [,1]
#> a   840.460031
#> b     1.609754
#> s1  -18.811509
#> s2    5.367149
#> s3   24.377151
#> s4   66.050611
#> s5   96.397411
#> s6   94.857248
#> s7   20.276912
#> s8  -20.858395
#> s9  -43.055125
#> s10 -63.611411
#> s11 -73.124237
#> s12 -54.960031

# Model with Multiplicative
agri_holtw_mult <- HoltWinters(x = train_agri, seasonal = "multiplicative")
agri_holtw_mult

#> Holt-Winters exponential smoothing with trend and multiplicative seasonal component.
#> 
#> Call:
#> HoltWinters(x = train_agri, seasonal = "multiplicative")
#> 
#> Smoothing parameters:
#>  alpha: 0.6008551
#>  beta : 0.008171636
#>  gamma: 1
#> 
#> Coefficients:
#>            [,1]
#> a   855.0274754
#> b     1.8976096
#> s1    0.9636666
#> s2    0.9962179
#> s3    1.0238697
#> s4    1.0825297
#> s5    1.1280956
#> s6    1.1320383
#> s7    1.0403368
#> s8    0.9877602
#> s9    0.9556470
#> s10   0.9236406
#> s11   0.9033004
#> s12   0.9186839

Forecast

# Forecast for Additive
forecast_holtw_add <- forecast(object = agri_holtw_add, h = length(test_agri))

# Forecast for Multiplicative
forecast_holtw_mult <- forecast(object = agri_holtw_mult, h = length(test_agri))

Plot

# Plot for Additive
train_agri %>% 
  autoplot() + 
  autolayer(test_agri, series = "test") + 
  autolayer(agri_holtw_add$fitted[,1], series = "fitted values (additive)") +
  autolayer(forecast_holtw_add$mean, series = "forecast data test (additive)")

# Plot for Multiplicative
train_agri %>% 
  autoplot() + 
  autolayer(test_agri, series = "test") + 
  autolayer(agri_holtw_mult$fitted[,1], series = "fitted values (multiplicative)") +
  autolayer(forecast_holtw_mult$mean, series = "forecast data test (multiplicative)")

ARIMA

ARIMA abbrevation of Autoregressive Integrated Moving Average. While exponential smoothing methods do not make any assumptions about correlations between successive values of the time series, in some cases you can make a better predictive model by taking correlations in the data into account. Autoregressive Integrated Moving Average (ARIMA) models include an explicit statistical model for the irregular component of a time series, that allows for non-zero autocorrelation in the irregular component.

Differencing Time Series

ARIMA models are defined for stationary time series. Therefore, if we start off with a non-stationary time series, will need to ‘difference’ the time series until we obtain a stationary time series.

We check our time series data, is it stationary?

H0: Data not stationary
H1: Data stationary

If p-value > 0.05 means H0: Data not stationary

adf.test(x = agri)

#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  agri
#> Dickey-Fuller = -8.6086, Lag order = 5, p-value = 0.01
#> alternative hypothesis: stationary

We got p-value = 0.01 which is less than 0.05. Our data is considered stationary.

Model

We make ARIMA model using function auto.arima, this function automate will give us optimal result of arima model.

agri_auto_arima <- auto.arima(train_agri, seasonal = T)

summary(agri_auto_arima)

#> Series: train_agri 
#> ARIMA(0,1,0)(0,1,1)[12] 
#> 
#> Coefficients:
#>          sma1
#>       -0.5912
#> s.e.   0.1216
#> 
#> sigma^2 = 37.26:  log likelihood = -347.4
#> AIC=698.8   AICc=698.91   BIC=704.14
#> 
#> Training set error measures:
#>                     ME     RMSE      MAE        MPE      MAPE      MASE
#> Training set 0.3097327 5.737024 3.967596 0.03836055 0.5489748 0.1634624
#>                    ACF1
#> Training set -0.1397488

Model we get is ARIMA(0,1,0), eventhough before we already check that the data is stationary, but using auto.arima function they are still doing differencing which is 1. We can see it from ARIMA(p,d,q) so differencing based on model is d = 1.

Forecast

we will forecast using model and let see result

forecast_auto_arima <- forecast(agri_auto_arima, h=length(test_agri))

Plot

train_agri %>% 
  autoplot() +
  autolayer(test_agri, series = "Test") + 
  autolayer(forecast_auto_arima$mean, series = "Forecast Auto ARIMA")

Model Evaluation

Simple Moving Average

# evaluation data train
accuracy(agri_sma5, train_agri)

#>                ME     RMSE      MAE         MPE     MAPE      ACF1 Theil's U
#> Test set 2.571207 51.27293 46.96569 -0.03798591 6.542348 0.8025764  1.537611

# evaluation data test
accuracy(forecast_sma5$mean, test_agri)

#>                 ME     RMSE      MAE       MPE     MAPE      ACF1 Theil's U
#> Test set -53.11716 82.51525 67.49712 -6.567009 8.163933 0.8694295  2.622548

mape_sma <- MAPE(forecast_sma5$mean, test_agri)*100

result_mape <- data.frame(
  "Model" = "Simple Moving Average",
  "MAPE" = round(mape_sma)
)

Exponential Smoothing

# evaluation data train (Additive)
accuracy(agri_holtw_add$fitted[,1], train_agri)

#>                 ME     RMSE      MAE       MPE      MAPE      ACF1 Theil's U
#> Test set 0.9624909 6.974615 5.474194 0.1213887 0.7566958 0.2711501 0.2157291

# evaluation data test (Additive)
accuracy(forecast_holtw_add$mean, test_agri)

#>                 ME     RMSE      MAE       MPE     MAPE      ACF1 Theil's U
#> Test set -21.39375 26.91494 23.17621 -2.556067 2.764986 0.8782123 0.8649529

# evaluation data train (Multiplicative)
accuracy(agri_holtw_mult$fitted[,1], train_agri)

#>                 ME     RMSE     MAE       MPE      MAPE      ACF1 Theil's U
#> Test set 0.9830681 7.526608 5.93729 0.1518749 0.8076113 0.3471797 0.2264993

# evaluation data test (Multiplicative)
accuracy(forecast_holtw_mult$mean, test_agri)

#>                 ME     RMSE      MAE       MPE     MAPE      ACF1 Theil's U
#> Test set -42.43108 46.61575 42.43108 -4.957409 4.957409 0.8749004  1.437177

mape_holts_add <- MAPE(forecast_holtw_add$mean, test_agri)*100

mape_holts_mult <- MAPE(forecast_holtw_mult$mean, test_agri)*100

mape_holts <- data.frame(
  "Model" = c("Holts Winter (Additive)", "Holts Winter (Multiplicative)"),
  "MAPE" = c(round(mape_holts_add), round(mape_holts_mult))
)

result_mape <- rbind(result_mape, mape_holts)

ARIMA

# evaluation data train 
accuracy(agri_auto_arima$fitted, train_agri)

#>                 ME     RMSE      MAE        MPE      MAPE       ACF1 Theil's U
#> Test set 0.3097327 5.737024 3.967596 0.03836055 0.5489748 -0.1397488 0.1745911

# evaluation data test
accuracy(forecast_auto_arima$mean, test_agri)

#>                 ME     RMSE      MAE       MPE    MAPE      ACF1 Theil's U
#> Test set -41.99823 48.74341 42.03759 -4.999546 5.00405 0.9129892  1.553211

mape_arima <- MAPE(forecast_auto_arima$mean, test_agri)*100

mape_auto_arima <- data.frame(
  "Model" = "ARIMA",
  "MAPE" = round(mape_arima)
)

result_mape <- rbind(result_mape, mape_auto_arima)

Assumptions

After we get result from each model Moving Average, Exponential Smoothing, and ARIMA. We try to do assumption check.

No-Autocorrelation

We want check no-autocorrelation we will use Box.test function. We expect p-value from our model is > 0.05 to accept H0.

H0 : No autocorrelation in the forecast errors
H1 : there is an autocorrelation in the forecast errors

Simple Moving Average

# Simple Moving Average
Box.test(x = forecast_sma5$residuals, type = "Ljung-Box")

#> 
#>  Box-Ljung test
#> 
#> data:  forecast_sma5$residuals
#> X-squared = 51.265, df = 1, p-value = 0.0000000000008072

Exponential Smoothing

# Triple Exponential Smoothing with Additive Seasonal
Box.test(x = forecast_holtw_add$residuals, type = "Ljung-Box")

#> 
#>  Box-Ljung test
#> 
#> data:  forecast_holtw_add$residuals
#> X-squared = 8.163, df = 1, p-value = 0.004275

# Triple Exponential Smoothing with Multiplicative Seasonal
Box.test(x = forecast_holtw_mult$residuals, type = "Ljung-Box")

#> 
#>  Box-Ljung test
#> 
#> data:  forecast_holtw_mult$residuals
#> X-squared = 13.383, df = 1, p-value = 0.000254

ARIMA

# ARIMA
Box.test(x = forecast_auto_arima$residuals, type = "Ljung-Box")

#> 
#>  Box-Ljung test
#> 
#> data:  forecast_auto_arima$residuals
#> X-squared = 2.4026, df = 1, p-value = 0.1211

Normality Residuals

Function we use in this checking is shapiro.test and we expect to accepting H0 where p-value > 0.05.

H0 : residuals are normally distributed
H1 : residuals are not normally distributed

Simple Moving Average

# Simple Moving Average
shapiro.test(forecast_sma5$residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  forecast_sma5$residuals
#> W = 0.98932, p-value = 0.501

Exponential Smoothing

# Triple Exponential Smoothing with Additive Seasonal
shapiro.test(forecast_holtw_add$residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  forecast_holtw_add$residuals
#> W = 0.97247, p-value = 0.02425

# Triple Exponential Smoothing with Multiplicative Seasonal
shapiro.test(forecast_holtw_mult$residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  forecast_holtw_mult$residuals
#> W = 0.98462, p-value = 0.2498

ARIMA

# ARIMA
shapiro.test(forecast_auto_arima$residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  forecast_auto_arima$residuals
#> W = 0.89769, p-value = 0.0000001498

Summary

Lets take look our MAPE result table from all our model that we made before:

result_mape

#>                           Model MAPE
#> 1         Simple Moving Average    8
#> 2       Holts Winter (Additive)    3
#> 3 Holts Winter (Multiplicative)    5
#> 4                         ARIMA    5

Based on MAPE Holts Winter with Additive seasonal have smallest MAPE, and we can say this model the most suitable model to predict our agriculture dataset compare to other model we have in this case.
Based on No-Autocorrelation checking, From 4 model we use in this case, only ARIMA which have no-autocorrelation in forecasting errors.
Based on Normality Residuals checking, From 4 model we use in this case, Simple Moving Average and Holts Winter (Multiplicative) have residuals distributed normally.