CLIMATE - TIME SERIES FORECASTING

1. Business Problem

Hi, welcome to my LBB about time series - forecasting This LBB is about climate time series - forecasting in the city of Delhi, India. We want to train the model on Weather Forecasting for Indian climate. Let’s check it out!

2. Import Library and Read Data

We can do import library and read data by using library() and read.csv() function. We use dim() function to check the dimension data of each object. Then, we use head() and tail() to check the first and the last 6 rows data.

# load library

library(tidyverse) #data manipulation
library(lubridate) # date manipulation
library(forecast) # time series library
library(TTR) # for Simple moving average function
library(MLmetrics) # calculate error
library(tseries) # adf.test
library(fpp) # usconsumption

# read data

climate_train <- read.csv("input_data/DailyDelhiClimateTrain.csv")

climate_test <- read.csv("input_data/DailyDelhiClimateTest.csv")

# check data
dim(climate_train)

#> [1] 1462    5

dim(climate_test)

#> [1] 114   5

# check data
head(climate_train)

# check data
head(climate_test)

The dimension of data object climate_train is 1462 rows and 5 columns. While the dimension data object climate_test is 114 rows and 5 columns. Next, we do data wrangling.

3. Data Wrangling

In this data wrangling, we can do check data type/stucture, missing value and outlier.

Description of variable : - date : Date of format YYYY-MM-DD - meantemp : Mean temperature averaged out from multiple 3 hour intervals in a day. - humidity : Humidity value for the day (units are grams of water vapor per cubic meter volume of air). - wind_speed : Wind speed measured in kmph. - meanpressure : Pressure reading of weather (measure in atm)

3.1. Check data type/structure

We use glimpse() to check the data type of the objects.

# check data type/structure
glimpse(climate_train)

#> Rows: 1,462
#> Columns: 5
#> $ date         <chr> "2013-01-01", "2013-01-02", "2013-01-03", "2013-01-04", "~
#> $ meantemp     <dbl> 10.000000, 7.400000, 7.166667, 8.666667, 6.000000, 7.0000~
#> $ humidity     <dbl> 84.50000, 92.00000, 87.00000, 71.33333, 86.83333, 82.8000~
#> $ wind_speed   <dbl> 0.0000000, 2.9800000, 4.6333333, 1.2333333, 3.7000000, 1.~
#> $ meanpressure <dbl> 1015.667, 1017.800, 1018.667, 1017.167, 1016.500, 1018.00~

# check data type/structure
glimpse(climate_test)

#> Rows: 114
#> Columns: 5
#> $ date         <chr> "2017-01-01", "2017-01-02", "2017-01-03", "2017-01-04", "~
#> $ meantemp     <dbl> 15.91304, 18.50000, 17.11111, 18.70000, 18.38889, 19.3181~
#> $ humidity     <dbl> 85.86957, 77.22222, 81.88889, 70.05000, 74.94444, 79.3181~
#> $ wind_speed   <dbl> 2.743478, 2.894444, 4.016667, 4.545000, 3.300000, 8.68181~
#> $ meanpressure <dbl> 59.000, 1018.278, 1018.333, 1015.700, 1014.333, 1011.773,~

Which data variable thet we need to change? - date : chr -> date

# climate_train - change data type/structure : `date` from chr to date 
climate_train <- climate_train %>% 
  mutate(date = lubridate::ymd(date))
glimpse(climate_train)

#> Rows: 1,462
#> Columns: 5
#> $ date         <date> 2013-01-01, 2013-01-02, 2013-01-03, 2013-01-04, 2013-01-~
#> $ meantemp     <dbl> 10.000000, 7.400000, 7.166667, 8.666667, 6.000000, 7.0000~
#> $ humidity     <dbl> 84.50000, 92.00000, 87.00000, 71.33333, 86.83333, 82.8000~
#> $ wind_speed   <dbl> 0.0000000, 2.9800000, 4.6333333, 1.2333333, 3.7000000, 1.~
#> $ meanpressure <dbl> 1015.667, 1017.800, 1018.667, 1017.167, 1016.500, 1018.00~

# climate_test - change data type/structure : `date` from chr to date 
climate_test <- climate_test %>% 
  mutate(date = lubridate::ymd(date))

glimpse(climate_test)

#> Rows: 114
#> Columns: 5
#> $ date         <date> 2017-01-01, 2017-01-02, 2017-01-03, 2017-01-04, 2017-01-~
#> $ meantemp     <dbl> 15.91304, 18.50000, 17.11111, 18.70000, 18.38889, 19.3181~
#> $ humidity     <dbl> 85.86957, 77.22222, 81.88889, 70.05000, 74.94444, 79.3181~
#> $ wind_speed   <dbl> 2.743478, 2.894444, 4.016667, 4.545000, 3.300000, 8.68181~
#> $ meanpressure <dbl> 59.000, 1018.278, 1018.333, 1015.700, 1014.333, 1011.773,~

Based on the process above, we use ymd() from lubridate packages to change the data type related to date.

3.2. Check missing value

We use anyNA(), is.na() and colSums() function to check the missing value.

# check missing value - climate_train
anyNA(climate_train)

#> [1] FALSE

climate_train %>% 
  is.na() %>% 
  colSums()

#>         date     meantemp     humidity   wind_speed meanpressure 
#>            0            0            0            0            0

# check missing value - climate_test
anyNA(climate_test)

#> [1] FALSE

climate_test %>% 
  is.na() %>% 
  colSums()

#>         date     meantemp     humidity   wind_speed meanpressure 
#>            0            0            0            0            0

There is no missing value on the data. We can continue the next progress.

3.3. Check outlier

boxplot(scale(climate_train[,2:5]))

On the object climate_train, there is outlier on the attributes wind_speed and meanpressure.

boxplot(scale(climate_test[,2:5]))

On the object climate_test, there is outlier on the attributes meanpressure.

We can use histogram to check distribution data on the objects climate_train and climate_test.

# Histogram for each Attribute - climate_train
climate_train %>%
  gather(Attributes, value, 2:5) %>%
  ggplot(aes(x=value, fill=Attributes)) +
  geom_histogram(colour="black", show.legend=FALSE) +
  facet_wrap(~Attributes, scales="free_x") +
  labs(x="Values", y="Frequency",
       title="Climate-train Attributes - Histograms") +
  theme_bw()

# Histogram for each Attribute - climate_test
climate_test %>%
  gather(Attributes, value, 2:5) %>%
  ggplot(aes(x=value, fill=Attributes)) +
  geom_histogram(colour="black", show.legend=FALSE) +
  facet_wrap(~Attributes, scales="free_x") +
  labs(x="Values", y="Frequency",
       title="Climate-train Attributes - Histograms") +
  theme_bw()

There is outlier at attribute meanpressure at points 1000 - 1100 on the objects climate_train and climate_test.

4. Data Processing

In this data processing, there are some steps to do, such as : change data into time series format, do decompose and model fitting. In this LBB, We use data object climate_train only.

4.1 Time Series

We find the range time for atribute date on the objects by use range() function.

# range
range(climate_train$date)

#> [1] "2013-01-01" "2017-01-01"

Use ts() function to change data into time series format.

# object train_ts
train_ts <- ts(data = climate_train$meantemp,# target variable
                  start = c(2013,1,1), #start period
                  frequency = 365)# daily data so we use 365 days
head(train_ts,12)

#> Time Series:
#> Start = c(2013, 1) 
#> End = c(2013, 12) 
#> Frequency = 365 
#>  [1] 10.000000  7.400000  7.166667  8.666667  6.000000  7.000000  7.000000
#>  [8]  8.857143 14.000000 11.000000 15.714286 14.000000

# check missing value - train_ts
anyNA(train_ts)

#> [1] FALSE

# train_ts %>% 
#   is.na() %>% 
#   colSums()

There is no missing value on the data train_ts. We can do the next process.

We check the data visualization to get the pattern.

# visualization object train_ts
train_ts %>% 
  autoplot()

We can see the pattern data for object train_ts is additive. Use this type data for the next decompose() function.

4.2. decompose

# decompose - train_ts
train_dc<- decompose(train_ts, type = "additive") 
train_dc%>% autoplot()

From the result on above, we can see the pattern is the same for some periode. So we can say type data seasonal is additive.

Before do model fitting, we do splitting data, with : * Data train: 2013 - 2015 (3 tahun) and save with the object name clim_train. * Data test: 2016 (1 tahun) and save with the object name clim_test.

Then, we continue the next process by checking the missing value on the data clim_train and clim_test.

# test use `tail()`
clim_test <- train_ts %>% tail(365) # take 12 last data

# train use `head()`
clim_train <- train_ts %>% head(-365) #  don't take 12 last data

# check missing value - clim_train, clim_test, train_dc
anyNA(clim_train)

#> [1] FALSE

anyNA(clim_test)

#> [1] FALSE

anyNA(train_dc)

#> [1] FALSE

# climate_train %>% 
#   is.na() %>% 
#   colSums()

There is no missing value on the data clim_train, clim_test and train_dc. We can continue to do the next process which is the fitting model.

4.3. Model Fitting

For model fitting, we use 3 kind models, such as : Simple Moving Average (SMA), Triple Exponential Smoothing (Holts Winter) and Auto Regresive Integrated Moving Average (ARIMA).

4.3.1. Model fitting with Simple Moving Average (SMA)

In this model, we use SMA() function from data train_ts with

# Build SMA model with orde = 3 and 10 
train_sma3 <- SMA(x = train_ts, n = 3)
train_sma10 <- SMA(x = train_ts, n = 10)

Based on the result of train_sma3 and train_sma10, there is missing value in the some rows. We cannot drop NA because it can cause the missing data.

# anyNA(train_sma3)
# train_sma_3 <-drop_na(train_sma3, "NA")
# train_sma_3

# Visualization data result from SMA model

# model_sma <- train_ts %>% 
#   autoplot() +
#   forecast::autolayer(train_sma3) +
#   forecast::autolayer(train_sma10)

class(train_sma3)

#> [1] "numeric"

This model is simple and it can not do the accuracy() function from forecasting packages. So we stop the process for the sma model here. The next model is model fitting by Triple Exponential Smoothing (Holts Winter).

4.3.2. Model fitting with Triple Exponential Smoothing (Holts Winter)

In this model, we use HoltWinters() function with data clim_train and additive for the seasonal. Then, we save the data model with the object name model_train.

# model fitting with train_ts
model_train <- HoltWinters(x = clim_train, seasonal = "additive")

model_train

#> Holt-Winters exponential smoothing with trend and additive seasonal component.
#> 
#> Call:
#> HoltWinters(x = clim_train, seasonal = "additive")
#> 
#> Smoothing parameters:
#>  alpha: 0.6770537
#>  beta : 0
#>  gamma: 1
#> 
#> Coefficients:
#>               [,1]
#> a     2.602363e+01
#> b     7.786026e-04
#> s1   -1.198739e+01
#> s2   -1.201948e+01
#> s3   -1.208472e+01
#> s4   -1.335531e+01
#> s5   -1.341066e+01
#> s6   -1.273301e+01
#> s7   -1.179621e+01
#> s8   -1.185330e+01
#> s9   -1.225452e+01
#> s10  -1.196065e+01
#> s11  -1.046138e+01
#> s12  -9.993057e+00
#> s13  -1.159085e+01
#> s14  -1.183360e+01
#> s15  -1.214046e+01
#> s16  -1.222146e+01
#> s17  -1.049410e+01
#> s18  -1.022124e+01
#> s19  -1.077042e+01
#> s20  -9.972734e+00
#> s21  -8.614474e+00
#> s22  -1.073655e+01
#> s23  -1.185561e+01
#> s24  -1.056419e+01
#> s25  -8.944911e+00
#> s26  -8.754930e+00
#> s27  -9.452906e+00
#> s28  -9.464328e+00
#> s29  -9.385187e+00
#> s30  -9.786707e+00
#> s31  -8.705496e+00
#> s32  -7.170183e+00
#> s33  -6.867985e+00
#> s34  -7.347541e+00
#> s35  -6.330757e+00
#> s36  -5.164852e+00
#> s37  -8.854825e+00
#> s38  -1.071593e+01
#> s39  -1.193176e+01
#> s40  -1.137633e+01
#> s41  -1.152675e+01
#> s42  -9.496429e+00
#> s43  -1.024789e+01
#> s44  -1.176785e+01
#> s45  -1.077741e+01
#> s46  -9.761975e+00
#> s47  -9.490288e+00
#> s48  -9.047401e+00
#> s49  -8.720213e+00
#> s50  -8.227624e+00
#> s51  -8.042738e+00
#> s52  -8.364720e+00
#> s53  -7.908749e+00
#> s54  -6.944919e+00
#> s55  -7.304191e+00
#> s56  -6.738413e+00
#> s57  -7.697524e+00
#> s58  -9.552880e+00
#> s59  -8.870290e+00
#> s60  -7.907670e+00
#> s61  -6.435495e+00
#> s62  -5.384627e+00
#> s63  -5.619107e+00
#> s64  -6.529959e+00
#> s65  -4.999649e+00
#> s66  -3.872528e+00
#> s67  -4.543274e+00
#> s68  -3.994859e+00
#> s69  -3.735427e+00
#> s70  -4.767116e+00
#> s71  -4.914178e+00
#> s72  -4.895460e+00
#> s73  -1.898573e+00
#> s74   7.963801e-01
#> s75   1.930300e-01
#> s76  -1.622856e+00
#> s77  -2.219351e+00
#> s78  -1.414227e+00
#> s79  -1.516522e-01
#> s80  -1.477008e+00
#> s81  -1.720296e+00
#> s82  -3.185110e-01
#> s83   7.962206e-01
#> s84  -9.144463e-01
#> s85  -5.987792e-01
#> s86  -4.509573e-02
#> s87  -8.908691e-01
#> s88  -1.177850e-01
#> s89   4.493147e-01
#> s90   2.254912e+00
#> s91  -4.489233e-01
#> s92  -5.144484e-01
#> s93   2.443984e+00
#> s94   4.886696e+00
#> s95   2.861116e+00
#> s96   2.198044e+00
#> s97   1.499231e+00
#> s98   1.942867e+00
#> s99   3.287239e+00
#> s100  2.770645e+00
#> s101  2.452824e+00
#> s102  3.220669e+00
#> s103  4.363059e+00
#> s104  2.254409e+00
#> s105  2.766362e+00
#> s106  1.214408e+00
#> s107  1.648391e+00
#> s108  2.318209e+00
#> s109  2.606305e+00
#> s110  3.669800e+00
#> s111  4.169232e+00
#> s112  4.621317e+00
#> s113  5.288762e+00
#> s114  5.729720e+00
#> s115  6.285048e+00
#> s116  6.891578e+00
#> s117  7.304470e+00
#> s118  8.559226e+00
#> s119  9.174117e+00
#> s120  9.114483e+00
#> s121  7.964882e+00
#> s122  7.354071e+00
#> s123  6.158270e+00
#> s124  5.889066e+00
#> s125  7.045911e+00
#> s126  6.318557e+00
#> s127  6.228680e+00
#> s128  5.528708e+00
#> s129  4.677026e+00
#> s130  3.374079e+00
#> s131 -3.650718e-01
#> s132  1.864083e+00
#> s133  3.389892e+00
#> s134  5.828839e+00
#> s135  6.332719e+00
#> s136  7.229460e+00
#> s137  6.170136e+00
#> s138  6.757834e+00
#> s139  8.500125e+00
#> s140  8.624006e+00
#> s141  9.755148e+00
#> s142  7.194046e+00
#> s143  7.376386e+00
#> s144  5.134703e+00
#> s145  5.925452e+00
#> s146  8.177301e+00
#> s147  9.805533e+00
#> s148  7.842959e+00
#> s149  9.795927e+00
#> s150  6.933521e+00
#> s151  8.306507e+00
#> s152  8.486835e+00
#> s153  1.080591e+01
#> s154  1.183884e+01
#> s155  1.259195e+01
#> s156  1.379096e+01
#> s157  1.354257e+01
#> s158  1.350898e+01
#> s159  1.238927e+01
#> s160  1.227420e+01
#> s161  9.418578e+00
#> s162  5.102697e+00
#> s163  4.080900e+00
#> s164  8.659279e+00
#> s165  1.098113e+01
#> s166  1.210620e+01
#> s167  1.035881e+01
#> s168  1.090681e+01
#> s169  1.148372e+01
#> s170  1.091575e+01
#> s171  1.058657e+01
#> s172  8.115540e+00
#> s173  9.977863e+00
#> s174  4.790759e+00
#> s175  8.177398e+00
#> s176  9.829933e+00
#> s177  1.049189e+01
#> s178  7.905602e+00
#> s179  7.322194e+00
#> s180  7.155751e+00
#> s181  6.922562e+00
#> s182  6.350669e+00
#> s183  7.922029e+00
#> s184  9.314715e+00
#> s185  4.509044e+00
#> s186  6.554235e+00
#> s187  7.325155e+00
#> s188  5.387483e+00
#> s189  4.255590e+00
#> s190  5.819591e+00
#> s191  5.178832e+00
#> s192  6.454543e+00
#> s193  6.435384e+00
#> s194  7.587076e+00
#> s195  5.708966e+00
#> s196  5.283614e+00
#> s197  4.991974e+00
#> s198  6.251457e+00
#> s199  4.340584e+00
#> s200  4.872696e+00
#> s201  4.841755e+00
#> s202  5.554948e+00
#> s203  5.324924e+00
#> s204  4.220181e+00
#> s205  4.820982e+00
#> s206  4.961836e+00
#> s207  5.797714e+00
#> s208  5.849649e+00
#> s209  6.440888e+00
#> s210  5.515838e+00
#> s211  5.229031e+00
#> s212  5.779635e+00
#> s213  7.517130e+00
#> s214  6.737311e+00
#> s215  5.865234e+00
#> s216  4.678301e+00
#> s217  3.794647e+00
#> s218  3.569938e+00
#> s219  2.191900e+00
#> s220  4.158276e+00
#> s221  4.487895e+00
#> s222  5.848860e+00
#> s223  6.231407e+00
#> s224  4.810258e+00
#> s225  3.349423e+00
#> s226  2.670219e+00
#> s227  3.516602e+00
#> s228  3.032772e+00
#> s229  4.292575e+00
#> s230  3.364114e+00
#> s231  2.845801e+00
#> s232  4.083058e+00
#> s233  5.190897e+00
#> s234  6.616419e+00
#> s235  6.202298e+00
#> s236  6.609839e+00
#> s237  7.158783e+00
#> s238  5.311021e+00
#> s239  6.121021e+00
#> s240  5.187595e+00
#> s241  4.207190e+00
#> s242  5.199997e+00
#> s243  6.087984e+00
#> s244  6.922225e+00
#> s245  7.070691e+00
#> s246  6.374606e+00
#> s247  6.792467e+00
#> s248  7.172922e+00
#> s249  5.561277e+00
#> s250  5.521302e+00
#> s251  4.979986e+00
#> s252  4.866787e+00
#> s253  6.582701e+00
#> s254  6.811925e+00
#> s255  6.315701e+00
#> s256  7.115734e+00
#> s257  6.182785e+00
#> s258  5.146724e+00
#> s259  4.740289e+00
#> s260  4.968169e+00
#> s261  4.200353e+00
#> s262  2.839160e+00
#> s263  3.112356e+00
#> s264  3.910461e+00
#> s265  4.812431e+00
#> s266  4.765990e+00
#> s267  4.482814e+00
#> s268  4.310929e+00
#> s269  4.675208e+00
#> s270  4.570342e+00
#> s271  3.477899e+00
#> s272  3.164084e+00
#> s273  3.949424e+00
#> s274  4.648610e+00
#> s275  2.333475e+00
#> s276  2.818144e+00
#> s277  3.809919e+00
#> s278  4.749270e+00
#> s279  4.206872e+00
#> s280  4.932077e+00
#> s281  4.874837e+00
#> s282  2.047586e+00
#> s283  1.834447e+00
#> s284  2.556250e+00
#> s285  1.777814e+00
#> s286  1.442464e+00
#> s287  1.492022e+00
#> s288  2.339262e+00
#> s289  1.609453e+00
#> s290  1.325911e+00
#> s291  5.807043e-01
#> s292  2.745502e-01
#> s293  1.237791e+00
#> s294  8.498795e-01
#> s295  5.606081e-01
#> s296 -1.466818e-01
#> s297 -1.248188e+00
#> s298 -1.775563e+00
#> s299 -2.944453e+00
#> s300 -1.660302e+00
#> s301 -1.743883e+00
#> s302 -1.458948e+00
#> s303 -1.275488e+00
#> s304 -1.796295e+00
#> s305 -2.541282e+00
#> s306 -3.066512e+00
#> s307 -4.992446e+00
#> s308 -4.227070e+00
#> s309 -4.373254e+00
#> s310 -5.265555e+00
#> s311 -5.032966e+00
#> s312 -6.453988e+00
#> s313 -7.052221e+00
#> s314 -7.305308e+00
#> s315 -7.117091e+00
#> s316 -7.017000e+00
#> s317 -7.081618e+00
#> s318 -7.693883e+00
#> s319 -8.958847e+00
#> s320 -7.938604e+00
#> s321 -7.048718e+00
#> s322 -5.892132e+00
#> s323 -6.417941e+00
#> s324 -6.316405e+00
#> s325 -7.351809e+00
#> s326 -5.859603e+00
#> s327 -4.359161e+00
#> s328 -2.790935e+00
#> s329 -4.577928e+00
#> s330 -4.738208e+00
#> s331 -5.604709e+00
#> s332 -5.655111e+00
#> s333 -6.110323e+00
#> s334 -6.610442e+00
#> s335 -5.985412e+00
#> s336 -7.208772e+00
#> s337 -8.101604e+00
#> s338 -8.218172e+00
#> s339 -8.282624e+00
#> s340 -7.916427e+00
#> s341 -8.475868e+00
#> s342 -7.120550e+00
#> s343 -8.286985e+00
#> s344 -9.198028e+00
#> s345 -8.401650e+00
#> s346 -7.867668e+00
#> s347 -8.390164e+00
#> s348 -8.556913e+00
#> s349 -1.036563e+01
#> s350 -1.013688e+01
#> s351 -9.495622e+00
#> s352 -9.487741e+00
#> s353 -9.614455e+00
#> s354 -9.752089e+00
#> s355 -9.680601e+00
#> s356 -1.064618e+01
#> s357 -1.072739e+01
#> s358 -1.075701e+01
#> s359 -1.095665e+01
#> s360 -1.218979e+01
#> s361 -1.305511e+01
#> s362 -1.279254e+01
#> s363 -1.136668e+01
#> s364 -1.063268e+01
#> s365 -1.202363e+01

We check the data model with variable fitted.

head(model_train$fitted, 10)

#> Time Series:
#> Start = c(2014, 1) 
#> End = c(2014, 10) 
#> Frequency = 365 
#>              xhat    level        trend    season
#> 2014.000 13.60364 24.66347 0.0007786026 -11.06061
#> 2014.003 11.08200 24.50945 0.0007786026 -13.42822
#> 2014.005 12.52996 24.45470 0.0007786026 -11.92552
#> 2014.008 12.88023 24.43520 0.0007786026 -11.55574
#> 2014.011 12.37850 24.43243 0.0007786026 -12.05471
#> 2014.014 11.42272 24.43084 0.0007786026 -13.00890
#> 2014.016 12.12759 24.43558 0.0007786026 -12.30877
#> 2014.019 11.85167 24.44670 0.0007786026 -12.59581
#> 2014.022 12.80911 24.46327 0.0007786026 -11.65494
#> 2014.025 12.35665 24.48045 0.0007786026 -12.12458

We check the visualization of model_ _train by using plot() function.

plot(model_train, main="The Daily Report of Climate")

The plot shows the original time series in black, and the forecasts as a red line. We can see from the picture that the forecasts time series agree pretty well with the original values, although they tend to lag behind the original values a little bit.

As a measure of the accuracy of the forecasts, we can calculate the sum of squared errors for the in-sample forecast errors, which is stored in the named element SSE:

# model_train$seasonal
# model_train$alpha
# model_train$beta
# model_train$gamma

# accuracy : way 1 by using `SSE`
model_train$SSE

#> [1] 2289.19

That is, here the sum-of-squared-errors is 2289.19.

We have the second way to measure the accuracy of the forecasts. We continue the next process is to see the error measurement with the accuracy() function from the forecast package. We take the column xhat of the data model_train$fitted based on the data clim_train.

# accuracy : way 2 by using `accuracy()` function
forecast::accuracy(object = model_train$fitted[,1], clim_train)

#>                   ME    RMSE     MAE        MPE     MAPE      ACF1 Theil's U
#> Test set 0.001594478 1.76842 1.18208 -0.1981344 5.143228 0.1470544  1.144072

Base on the result above, we get the accuracy measurement for the holt winters model are : MAPE : 5.14 % MAE : 1.18 % RMSE : 1.77 %

4.3.3. Model fitting with Autoregresive Integrated Moving Average (ARIMA)

In this model, we use adf.test() to check if the data is already stationer.

clim_train %>% 
  adf.test()

#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  .
#> Dickey-Fuller = -1.7903, Lag order = 10, p-value = 0.6671
#> alternative hypothesis: stationary

We make the hypothesis of the stationary data as like: H0: data no-stationary H1: data stationary

Based on the data above, we find the P-value (0.6671) > the alpha(0.05), so it means the data is not stationary. We must set the differencing in the seasonal data to make the data stationary.

clim_train %>% 
  diff(lag = 12) %>% #untuk diff seasonal (D = 1)
  diff(lag = 1) %>%  #untuk diff trend (d=1)
  adf.test()

#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  .
#> Dickey-Fuller = -9.7415, Lag order = 10, p-value = 0.01
#> alternative hypothesis: stationary

Based on the data above, we find the P-value (0.01) < the alpha(0.05), so it means the data is already stationary. Then we use the tsdisplay() function to see the adjusted data clim_train graphic.

clim_train %>% 
  diff(lag = 12) %>% 
  diff(lag = 1) %>% 
  tsdisplay(lag.max = 20)

Based on the graphic above, we set the combined model ARIMA.

For seasosal : AR
* P : 0 * D : 1 * Q : 0

For all data : MA * p: 1,2,3 * d: 1 * q: 0,1

Combine the model

SARIMA: (p, d, q)(P, D, Q)[m]

(1,1,0)(0,1,0)[12]
(1,1,1)(0,1,0)[12]
(2,1,0)(0,1,0)[12]
(2,1,1)(0,1,0)[12]
(3,1,0)(0,1,0)[12]
(3,1,1)(0,1,0)[12]

Use Arima() and auto.arima() function to create model Arima and save the data model into the new objects.

# create model using auto.arima()
sarima1 <- Arima(y = clim_train, order = c(1,1,0), seasonal =  c(0,1,0))
sarima2 <- Arima(y = clim_train, order = c(1,1,1), seasonal =  c(0,1,0))
sarima3 <- Arima(y = clim_train, order = c(2,1,0), seasonal =  c(0,1,0))
sarima4 <- Arima(y = clim_train, order = c(2,1,1), seasonal =  c(0,1,0))
sarima5 <- Arima(y = clim_train, order = c(3,1,0), seasonal =  c(0,1,0))
sarima6 <- Arima(y = clim_train, order = c(3,1,1), seasonal =  c(0,1,0))
sarima_auto <- auto.arima(y = clim_train)

After building the combined model, we check the aic value for each model.

# check AIC value
sarima1$aic

#> [1] 3249.425

sarima2$aic

#> [1] 3160.268

sarima3$aic

#> [1] 3242.748

sarima4$aic

#> [1] 3162.192

sarima_auto$aic

#> [1] 3150.97

5. Forecasting and Evaluation Model

# forecast, h = 18
clim_forecast <- forecast(object = sarima_auto, h = 18)
clim_forecast

#>           Point Forecast     Lo 80    Hi 80     Lo 95    Hi 95
#> 2016.0055      14.491288 11.829716 17.15286 10.420765 18.56181
#> 2016.0082      13.666039 10.379749 16.95233  8.640092 18.69198
#> 2016.0110      13.667601 10.097068 17.23813  8.206942 19.12826
#> 2016.0137      11.759262  8.048331 15.47019  6.083883 17.43464
#> 2016.0164       9.450647  5.668157 13.23314  3.665827 15.23547
#> 2016.0192       9.873726  6.054237 13.69322  4.032321 15.71513
#> 2016.0219      10.533547  6.694793 14.37230  4.662680 16.40441
#> 2016.0247      11.058766  7.209946 14.90759  5.172504 16.94503
#> 2016.0274      10.952031  7.097941 14.80612  5.057709 16.84635
#> 2016.0301      10.590259  6.733408 14.44711  4.691714 16.48880
#> 2016.0329      12.224839  8.366541 16.08314  6.324081 18.12560
#> 2016.0356      11.981777  8.122720 15.84083  6.079858 17.88370
#> 2016.0384      12.361802  8.502347 16.22126  6.459275 18.26433
#> 2016.0411      12.990442  9.130778 16.85011  7.087595 18.89329
#> 2016.0438      13.493077  9.633304 17.35285  7.590063 19.39609
#> 2016.0466      12.994986  9.135155 16.85482  7.091884 18.89809
#> 2016.0493      13.746369  9.886508 17.60623  7.843221 19.64952
#> 2016.0521      13.372370  9.512493 17.23225  7.469198 19.27554

# menggunakan autoplot dan autolayer
clim_train %>% 
  autoplot()+
  autolayer(clim_test, series = "data test")+
  autolayer(clim_forecast$mean, series = "forecast")+
  autolayer(sarima_auto$fitted, series = "fitted values")

# accuracy
forecast::accuracy(object = clim_forecast$mean, clim_test)

#>                ME     RMSE      MAE      MPE     MAPE      ACF1 Theil's U
#> Test set 3.005627 4.165464 3.317985 17.94666 20.48799 0.7312294  3.263169

Base on the result above, we get the accuracy measurement for the holt winters model are : MAPE : 20.49 % MAE : 3.32 % RMSE : 4.16 %

6. Asumption

Assumptions in the time series are tested to measure whether the residuals obtained from modeling results are good enough to describe and capture information on the data. Why use residual data? Because by using residual data, we can get information from actual data as well as from prediction results using models.

6.1. No-autocorrelation residual

H0 : residual has no-autocorrelation H1 : residual has autocorrelation

We noted that p-value > 0.05 (alpha), so the residual have no-autocorrelation.

# use visualization plot
sarima_auto$residuals %>% 
  tsdisplay(lag.max = 20)

We hope the model have no lag that over blue line.

# use Ljung-Box test
Box.test(sarima_auto$residuals, type = "Ljung-Box")

#> 
#>  Box-Ljung test
#> 
#> data:  sarima_auto$residuals
#> X-squared = 0.065276, df = 1, p-value = 0.7983

From the result above, we can conclude the assumption that the model with p-value 0.7983 have no autocorrelation (The model data meet H0 : residual has no-autocorrelation).

6.2. Normality residual

H0: Residual have norm distribution H1: Residual have no norm distribution

We noted that p-value > 0.05 (alpha), so the residual have norm distribution.

# use saphiro.test()
shapiro.test(sarima_auto$residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  sarima_auto$residuals
#> W = 0.92283, p-value < 2.2e-16

hist(sarima_auto$residuals)

From the result above, we can conclude the assumption that the model with p-value 0.00000000000000022 have norm distribution (The model data meet H0: Residual have norm distribution).

7. Conclusion

We recommended that the data meet the assumption (have norm distribution and no autocorrelation) and the best model used is Holt-Winters with least error.

8. Reference

I get the data from https://www.kaggle.com/sumanthvrao/daily-climate-time-series-data/activity .

LBB-TS-Melia Sie

Melia Sie

1/3/2022