1. Intro

Climate is the habit and character of the weather that occurs in a place or area. In this case study, we will predict the average climate temperature that occurs in the city of Delhi, India.

2. Import Library and Read Data

# load library
library(tidyverse) 
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate) 
library(forecast) 
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(TTR) 
library(MLmetrics) 
## 
## Attaching package: 'MLmetrics'
## 
## The following object is masked from 'package:base':
## 
##     Recall
library(tseries) 
library(fpp) 
## Loading required package: fma
## Loading required package: expsmooth
## Loading required package: lmtest
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
# read data

climate_train <- read.csv("DailyDelhiClimateTrain.csv")

climate_test <- read.csv("DailyDelhiClimateTest.csv")

# check data
dim(climate_train)
## [1] 1462    5
# cek dimensi data 
dim(climate_test)
## [1] 114   5
# check data
head(climate_train)
# check data
head(climate_test)

3. Data Wrangling

Description of variable :

  • date: Date in the format YYYY-MM-DD
  • meantemp: Mean temperature averaged out from multiple 3-hour intervals in a day.
  • humidity: Humidity value for the day (units are grams of water vapor per cubic meter volume of air).
  • wind_speed: Wind speed measured in kmph.
  • meanpressure: Pressure reading of weather (measured in atm).

3.1. Check data type/structure

# check data type/structure
glimpse(climate_train)
## Rows: 1,462
## Columns: 5
## $ date         <chr> "2013-01-01", "2013-01-02", "2013-01-03", "2013-01-04", "…
## $ meantemp     <dbl> 10.000000, 7.400000, 7.166667, 8.666667, 6.000000, 7.0000…
## $ humidity     <dbl> 84.50000, 92.00000, 87.00000, 71.33333, 86.83333, 82.8000…
## $ wind_speed   <dbl> 0.0000000, 2.9800000, 4.6333333, 1.2333333, 3.7000000, 1.…
## $ meanpressure <dbl> 1015.667, 1017.800, 1018.667, 1017.167, 1016.500, 1018.00…
# check data type/structure
glimpse(climate_test)
## Rows: 114
## Columns: 5
## $ date         <chr> "2017-01-01", "2017-01-02", "2017-01-03", "2017-01-04", "…
## $ meantemp     <dbl> 15.91304, 18.50000, 17.11111, 18.70000, 18.38889, 19.3181…
## $ humidity     <dbl> 85.86957, 77.22222, 81.88889, 70.05000, 74.94444, 79.3181…
## $ wind_speed   <dbl> 2.743478, 2.894444, 4.016667, 4.545000, 3.300000, 8.68181…
## $ meanpressure <dbl> 59.000, 1018.278, 1018.333, 1015.700, 1014.333, 1011.773,…
# climate_train - change data type/structure : `date` from chr to date 
climate_train <- climate_train %>% 
  mutate(date = lubridate::ymd(date))
glimpse(climate_train)
## Rows: 1,462
## Columns: 5
## $ date         <date> 2013-01-01, 2013-01-02, 2013-01-03, 2013-01-04, 2013-01-…
## $ meantemp     <dbl> 10.000000, 7.400000, 7.166667, 8.666667, 6.000000, 7.0000…
## $ humidity     <dbl> 84.50000, 92.00000, 87.00000, 71.33333, 86.83333, 82.8000…
## $ wind_speed   <dbl> 0.0000000, 2.9800000, 4.6333333, 1.2333333, 3.7000000, 1.…
## $ meanpressure <dbl> 1015.667, 1017.800, 1018.667, 1017.167, 1016.500, 1018.00…
# climate_test - change data type/structure : `date` from chr to date 
climate_test <- climate_test %>% 
  mutate(date = lubridate::ymd(date))

glimpse(climate_test)
## Rows: 114
## Columns: 5
## $ date         <date> 2017-01-01, 2017-01-02, 2017-01-03, 2017-01-04, 2017-01-…
## $ meantemp     <dbl> 15.91304, 18.50000, 17.11111, 18.70000, 18.38889, 19.3181…
## $ humidity     <dbl> 85.86957, 77.22222, 81.88889, 70.05000, 74.94444, 79.3181…
## $ wind_speed   <dbl> 2.743478, 2.894444, 4.016667, 4.545000, 3.300000, 8.68181…
## $ meanpressure <dbl> 59.000, 1018.278, 1018.333, 1015.700, 1014.333, 1011.773,…

3.2. Check missing value

We use anyNA(), is.na() and colSums() function to check the missing value.

# check missing value - climate_train
anyNA(climate_train)
## [1] FALSE
climate_train %>% 
  is.na() %>% 
  colSums()
##         date     meantemp     humidity   wind_speed meanpressure 
##            0            0            0            0            0
# check missing value - climate_test
anyNA(climate_test)
## [1] FALSE
climate_test %>% 
  is.na() %>% 
  colSums()
##         date     meantemp     humidity   wind_speed meanpressure 
##            0            0            0            0            0

3.3. Check outlier

boxplot(scale(climate_train[,2:5]))

On the object climate_train, there is outlier on the attributes wind_speed and meanpressure.

boxplot(scale(climate_test[,2:5]))

On the object climate_test, there is outlier on the attributes meanpressure.

We can use histogram to check distribution data on the objects climate_train and climate_test

# Histogram for each Attribute - climate_train
climate_train %>%
  gather(Attributes, value, 2:5) %>%
  ggplot(aes(x=value, fill=Attributes)) +
  geom_histogram(colour="black", show.legend=FALSE) +
  facet_wrap(~Attributes, scales="free_x") +
  labs(x="Values", y="Frequency",
       title="Climate-train Attributes - Histograms") +
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Histogram for each Attribute - climate_test
climate_test %>%
  gather(Attributes, value, 2:5) %>%
  ggplot(aes(x=value, fill=Attributes)) +
  geom_histogram(colour="black", show.legend=FALSE) +
  facet_wrap(~Attributes, scales="free_x") +
  labs(x="Values", y="Frequency",
       title="Climate-train Attributes - Histograms") +
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

4. Data Processing

In this data processing, there are some steps to be taken, such as: changing the data to time series format, performing decomposition, and fitting a model. In this LBB, we only use the data object climate_train.

4.1 Time Series

We find the range time for atribute date on the objects by use range() function.

# range
range(climate_train$date)
## [1] "2013-01-01" "2017-01-01"

Use ts() function to change data into time series format.

# object train_ts
train_ts <- ts(data = climate_train$meantemp,# target variable
                  start = c(2013,1,1), #start period
                  frequency = 365)# daily data so we use 365 days
head(train_ts,12)
## Time Series:
## Start = c(2013, 1) 
## End = c(2013, 12) 
## Frequency = 365 
##  [1] 10.000000  7.400000  7.166667  8.666667  6.000000  7.000000  7.000000
##  [8]  8.857143 14.000000 11.000000 15.714286 14.000000

We check the data visualization to get the pattern.

# visualization object train_ts
train_ts %>% 
  autoplot()

The pattern data for the object train_ts is additive. Use this type of data for the next decompose() function.

4.2. decompose

# decompose - train_ts
train_dc<- decompose(train_ts, type = "additive") 
train_dc%>% autoplot()

From the results above, we can see that the pattern is the same for some periods. Therefore, we can say that the type of seasonal data is additive.

Before fitting the model, we split the data into two parts:

  • Training data: 2013-2015 (3 years) and save it as the object clim_train.
  • Test data: 2016 (1 year) and save it as the object clim_test.

We then proceed to the next step by checking for missing values in the clim_train and clim_test data.

# test use `tail()`
clim_test <- train_ts %>% tail(365) # take 12 last data

# train use `head()`
clim_train <- train_ts %>% head(-365) #  don't take 12 last data
# check missing value - clim_train, clim_test, train_dc
anyNA(clim_train)
## [1] FALSE
anyNA(clim_test)
## [1] FALSE
anyNA(train_dc)
## [1] FALSE

There are no missing values in the data clim_train, clim_test, and train_dc. We can proceed to the next process, which is fitting the model.

4.3. Model Fitting

For model fitting, we use three types of models: Simple Moving Average (SMA), Triple Exponential Smoothing (Holt Winters), and Autoregressive Integrated Moving Average (ARIMA).

4.3.1. Model fitting with Simple Moving Average (SMA)

In this model, we use SMA() function from data train_ts with

# Build SMA model with orde = 3 and 10 
train_sma3 <- SMA(x = train_ts, n = 3)
train_sma10 <- SMA(x = train_ts, n = 10)

Based on the result of train_sma3 and train_sma10, there is missing value in the some rows. We cannot drop NA because it can cause the missing data.

class(train_sma3)
## [1] "numeric"

This model is simple, and it cannot use the accuracy() function from forecasting packages. Therefore, we will stop the process for the SMA model here. The next model will be fitting by Triple Exponential Smoothing (Holt Winters).

4.3.2. Model fitting with Triple Exponential Smoothing (Holts Winter)

In this model, we use the HoltWinters() function with the data clim_train and the seasonal type set to additive. We then save the model data with the object name model_train.

# model fitting with train_ts
model_train <- HoltWinters(x = clim_train, seasonal = "additive")

model_train
## Holt-Winters exponential smoothing with trend and additive seasonal component.
## 
## Call:
## HoltWinters(x = clim_train, seasonal = "additive")
## 
## Smoothing parameters:
##  alpha: 0.6770537
##  beta : 0
##  gamma: 1
## 
## Coefficients:
##               [,1]
## a     2.602363e+01
## b     7.786026e-04
## s1   -1.198739e+01
## s2   -1.201948e+01
## s3   -1.208472e+01
## s4   -1.335531e+01
## s5   -1.341066e+01
## s6   -1.273301e+01
## s7   -1.179621e+01
## s8   -1.185330e+01
## s9   -1.225452e+01
## s10  -1.196065e+01
## s11  -1.046138e+01
## s12  -9.993057e+00
## s13  -1.159085e+01
## s14  -1.183360e+01
## s15  -1.214046e+01
## s16  -1.222146e+01
## s17  -1.049410e+01
## s18  -1.022124e+01
## s19  -1.077042e+01
## s20  -9.972734e+00
## s21  -8.614474e+00
## s22  -1.073655e+01
## s23  -1.185561e+01
## s24  -1.056419e+01
## s25  -8.944911e+00
## s26  -8.754930e+00
## s27  -9.452906e+00
## s28  -9.464328e+00
## s29  -9.385187e+00
## s30  -9.786707e+00
## s31  -8.705496e+00
## s32  -7.170183e+00
## s33  -6.867985e+00
## s34  -7.347541e+00
## s35  -6.330757e+00
## s36  -5.164852e+00
## s37  -8.854825e+00
## s38  -1.071593e+01
## s39  -1.193176e+01
## s40  -1.137633e+01
## s41  -1.152675e+01
## s42  -9.496429e+00
## s43  -1.024789e+01
## s44  -1.176785e+01
## s45  -1.077741e+01
## s46  -9.761975e+00
## s47  -9.490288e+00
## s48  -9.047401e+00
## s49  -8.720213e+00
## s50  -8.227624e+00
## s51  -8.042738e+00
## s52  -8.364720e+00
## s53  -7.908749e+00
## s54  -6.944919e+00
## s55  -7.304191e+00
## s56  -6.738413e+00
## s57  -7.697524e+00
## s58  -9.552880e+00
## s59  -8.870290e+00
## s60  -7.907670e+00
## s61  -6.435495e+00
## s62  -5.384627e+00
## s63  -5.619107e+00
## s64  -6.529959e+00
## s65  -4.999649e+00
## s66  -3.872528e+00
## s67  -4.543274e+00
## s68  -3.994859e+00
## s69  -3.735427e+00
## s70  -4.767116e+00
## s71  -4.914178e+00
## s72  -4.895460e+00
## s73  -1.898573e+00
## s74   7.963801e-01
## s75   1.930300e-01
## s76  -1.622856e+00
## s77  -2.219351e+00
## s78  -1.414227e+00
## s79  -1.516522e-01
## s80  -1.477008e+00
## s81  -1.720296e+00
## s82  -3.185110e-01
## s83   7.962206e-01
## s84  -9.144463e-01
## s85  -5.987792e-01
## s86  -4.509573e-02
## s87  -8.908691e-01
## s88  -1.177850e-01
## s89   4.493147e-01
## s90   2.254912e+00
## s91  -4.489233e-01
## s92  -5.144484e-01
## s93   2.443984e+00
## s94   4.886696e+00
## s95   2.861116e+00
## s96   2.198044e+00
## s97   1.499231e+00
## s98   1.942867e+00
## s99   3.287239e+00
## s100  2.770645e+00
## s101  2.452824e+00
## s102  3.220669e+00
## s103  4.363059e+00
## s104  2.254409e+00
## s105  2.766362e+00
## s106  1.214408e+00
## s107  1.648391e+00
## s108  2.318209e+00
## s109  2.606305e+00
## s110  3.669800e+00
## s111  4.169232e+00
## s112  4.621317e+00
## s113  5.288762e+00
## s114  5.729720e+00
## s115  6.285048e+00
## s116  6.891578e+00
## s117  7.304470e+00
## s118  8.559226e+00
## s119  9.174117e+00
## s120  9.114483e+00
## s121  7.964882e+00
## s122  7.354071e+00
## s123  6.158270e+00
## s124  5.889066e+00
## s125  7.045911e+00
## s126  6.318557e+00
## s127  6.228680e+00
## s128  5.528708e+00
## s129  4.677026e+00
## s130  3.374079e+00
## s131 -3.650718e-01
## s132  1.864083e+00
## s133  3.389892e+00
## s134  5.828839e+00
## s135  6.332719e+00
## s136  7.229460e+00
## s137  6.170136e+00
## s138  6.757834e+00
## s139  8.500125e+00
## s140  8.624006e+00
## s141  9.755148e+00
## s142  7.194046e+00
## s143  7.376386e+00
## s144  5.134703e+00
## s145  5.925452e+00
## s146  8.177301e+00
## s147  9.805533e+00
## s148  7.842959e+00
## s149  9.795927e+00
## s150  6.933521e+00
## s151  8.306507e+00
## s152  8.486835e+00
## s153  1.080591e+01
## s154  1.183884e+01
## s155  1.259195e+01
## s156  1.379096e+01
## s157  1.354257e+01
## s158  1.350898e+01
## s159  1.238927e+01
## s160  1.227420e+01
## s161  9.418578e+00
## s162  5.102697e+00
## s163  4.080900e+00
## s164  8.659279e+00
## s165  1.098113e+01
## s166  1.210620e+01
## s167  1.035881e+01
## s168  1.090681e+01
## s169  1.148372e+01
## s170  1.091575e+01
## s171  1.058657e+01
## s172  8.115540e+00
## s173  9.977863e+00
## s174  4.790759e+00
## s175  8.177398e+00
## s176  9.829933e+00
## s177  1.049189e+01
## s178  7.905602e+00
## s179  7.322194e+00
## s180  7.155751e+00
## s181  6.922562e+00
## s182  6.350669e+00
## s183  7.922029e+00
## s184  9.314715e+00
## s185  4.509044e+00
## s186  6.554235e+00
## s187  7.325155e+00
## s188  5.387483e+00
## s189  4.255590e+00
## s190  5.819591e+00
## s191  5.178832e+00
## s192  6.454543e+00
## s193  6.435384e+00
## s194  7.587076e+00
## s195  5.708966e+00
## s196  5.283614e+00
## s197  4.991974e+00
## s198  6.251457e+00
## s199  4.340584e+00
## s200  4.872696e+00
## s201  4.841755e+00
## s202  5.554948e+00
## s203  5.324924e+00
## s204  4.220181e+00
## s205  4.820982e+00
## s206  4.961836e+00
## s207  5.797714e+00
## s208  5.849649e+00
## s209  6.440888e+00
## s210  5.515838e+00
## s211  5.229031e+00
## s212  5.779635e+00
## s213  7.517130e+00
## s214  6.737311e+00
## s215  5.865234e+00
## s216  4.678301e+00
## s217  3.794647e+00
## s218  3.569938e+00
## s219  2.191900e+00
## s220  4.158276e+00
## s221  4.487895e+00
## s222  5.848860e+00
## s223  6.231407e+00
## s224  4.810258e+00
## s225  3.349423e+00
## s226  2.670219e+00
## s227  3.516602e+00
## s228  3.032772e+00
## s229  4.292575e+00
## s230  3.364114e+00
## s231  2.845801e+00
## s232  4.083058e+00
## s233  5.190897e+00
## s234  6.616419e+00
## s235  6.202298e+00
## s236  6.609839e+00
## s237  7.158783e+00
## s238  5.311021e+00
## s239  6.121021e+00
## s240  5.187595e+00
## s241  4.207190e+00
## s242  5.199997e+00
## s243  6.087984e+00
## s244  6.922225e+00
## s245  7.070691e+00
## s246  6.374606e+00
## s247  6.792467e+00
## s248  7.172922e+00
## s249  5.561277e+00
## s250  5.521302e+00
## s251  4.979986e+00
## s252  4.866787e+00
## s253  6.582701e+00
## s254  6.811925e+00
## s255  6.315701e+00
## s256  7.115734e+00
## s257  6.182785e+00
## s258  5.146724e+00
## s259  4.740289e+00
## s260  4.968169e+00
## s261  4.200353e+00
## s262  2.839160e+00
## s263  3.112356e+00
## s264  3.910461e+00
## s265  4.812431e+00
## s266  4.765990e+00
## s267  4.482814e+00
## s268  4.310929e+00
## s269  4.675208e+00
## s270  4.570342e+00
## s271  3.477899e+00
## s272  3.164084e+00
## s273  3.949424e+00
## s274  4.648610e+00
## s275  2.333475e+00
## s276  2.818144e+00
## s277  3.809919e+00
## s278  4.749270e+00
## s279  4.206872e+00
## s280  4.932077e+00
## s281  4.874837e+00
## s282  2.047586e+00
## s283  1.834447e+00
## s284  2.556250e+00
## s285  1.777814e+00
## s286  1.442464e+00
## s287  1.492022e+00
## s288  2.339262e+00
## s289  1.609453e+00
## s290  1.325911e+00
## s291  5.807043e-01
## s292  2.745502e-01
## s293  1.237791e+00
## s294  8.498795e-01
## s295  5.606081e-01
## s296 -1.466818e-01
## s297 -1.248188e+00
## s298 -1.775563e+00
## s299 -2.944453e+00
## s300 -1.660302e+00
## s301 -1.743883e+00
## s302 -1.458948e+00
## s303 -1.275488e+00
## s304 -1.796295e+00
## s305 -2.541282e+00
## s306 -3.066512e+00
## s307 -4.992446e+00
## s308 -4.227070e+00
## s309 -4.373254e+00
## s310 -5.265555e+00
## s311 -5.032966e+00
## s312 -6.453988e+00
## s313 -7.052221e+00
## s314 -7.305308e+00
## s315 -7.117091e+00
## s316 -7.017000e+00
## s317 -7.081618e+00
## s318 -7.693883e+00
## s319 -8.958847e+00
## s320 -7.938604e+00
## s321 -7.048718e+00
## s322 -5.892132e+00
## s323 -6.417941e+00
## s324 -6.316405e+00
## s325 -7.351809e+00
## s326 -5.859603e+00
## s327 -4.359161e+00
## s328 -2.790935e+00
## s329 -4.577928e+00
## s330 -4.738208e+00
## s331 -5.604709e+00
## s332 -5.655111e+00
## s333 -6.110323e+00
## s334 -6.610442e+00
## s335 -5.985412e+00
## s336 -7.208772e+00
## s337 -8.101604e+00
## s338 -8.218172e+00
## s339 -8.282624e+00
## s340 -7.916427e+00
## s341 -8.475868e+00
## s342 -7.120550e+00
## s343 -8.286985e+00
## s344 -9.198028e+00
## s345 -8.401650e+00
## s346 -7.867668e+00
## s347 -8.390164e+00
## s348 -8.556913e+00
## s349 -1.036563e+01
## s350 -1.013688e+01
## s351 -9.495622e+00
## s352 -9.487741e+00
## s353 -9.614455e+00
## s354 -9.752089e+00
## s355 -9.680601e+00
## s356 -1.064618e+01
## s357 -1.072739e+01
## s358 -1.075701e+01
## s359 -1.095665e+01
## s360 -1.218979e+01
## s361 -1.305511e+01
## s362 -1.279254e+01
## s363 -1.136668e+01
## s364 -1.063268e+01
## s365 -1.202363e+01

We check the data model with variable fitted.

head(model_train$fitted, 10)
## Time Series:
## Start = c(2014, 1) 
## End = c(2014, 10) 
## Frequency = 365 
##              xhat    level        trend    season
## 2014.000 13.60364 24.66347 0.0007786026 -11.06061
## 2014.003 11.08200 24.50945 0.0007786026 -13.42822
## 2014.005 12.52996 24.45470 0.0007786026 -11.92552
## 2014.008 12.88023 24.43520 0.0007786026 -11.55574
## 2014.011 12.37850 24.43243 0.0007786026 -12.05471
## 2014.014 11.42272 24.43084 0.0007786026 -13.00890
## 2014.016 12.12759 24.43558 0.0007786026 -12.30877
## 2014.019 11.85167 24.44670 0.0007786026 -12.59581
## 2014.022 12.80911 24.46327 0.0007786026 -11.65494
## 2014.025 12.35665 24.48045 0.0007786026 -12.12458

We check the visualization of model_ _train by using plot() function.

plot(model_train, main="The Daily Report of Climate")

The plot shows the original time series in black and the forecasts as a red line. We can see from the picture that the forecasts time series agree pretty well with the original values, although they tend to lag behind the original values a little bit.

As a measure of the accuracy of the forecasts, we can calculate the sum of squared errors for the in-sample forecast errors, which is stored in the named element SSE.

# accuracy : way 1 by using `SSE`
model_train$SSE
## [1] 2289.19

That is, the sum-of-squared-errors is 2289.19.

We have a second way to measure the accuracy of the forecasts. We continue the next process by seeing the error measurement with the accuracy() function from the forecast package. We take the xhat column of the data model_train$fitted based on the data clim_train.

# accuracy : way 2 by using `accuracy()` function
forecast::accuracy(object = model_train$fitted[,1], clim_train)
##                   ME    RMSE     MAE        MPE     MAPE      ACF1 Theil's U
## Test set 0.001594478 1.76842 1.18208 -0.1981344 5.143228 0.1470544  1.144072

Base on the result above, we get the accuracy measurement for the holt winters model are : MAPE : 5.14 % MAE : 1.18 % RMSE : 1.77 %

4.3.3. Model fitting with Autoregresive Integrated Moving Average (ARIMA)

In this model, we use adf.test() to check if the data is already stationer.

clim_train %>% 
  adf.test()
## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -1.7903, Lag order = 10, p-value = 0.6671
## alternative hypothesis: stationary

We make the hypothesis of the stationary data as follows:

  • H0: The data is not stationary.
  • H1: The data is stationary.

Based on the data above, we find that the P-value (0.6671) is greater than the alpha (0.05). This means that we fail to reject the null hypothesis, so we cannot conclude that the data is stationary. We must set the differencing in the seasonal data to make the data stationary.

clim_train %>% 
  diff(lag = 12) %>% #untuk diff seasonal (D = 1)
  diff(lag = 1) %>%  #untuk diff trend (d=1)
  adf.test()
## Warning in adf.test(.): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -9.7415, Lag order = 10, p-value = 0.01
## alternative hypothesis: stationary

Based on the data above, we find that the P-value (0.01) is less than the alpha (0.05). This means that the data is already stationary. We then use the tsdisplay() function to see the adjusted data clim_train graphic.

library(forecast)

# Set the margins to smaller values
par(mar = c(3, 3, 2, 1))

clim_train_diff <- clim_train %>% 
  diff(lag = 12) %>% 
  diff(lag = 1)

tsdisplay(clim_train_diff, lag.max = 20)

Based on the graphic above, we set the combined model ARIMA.

For seasosal : AR * P : 0 * D : 1 * Q : 0

For all data : MA * p: 1,2,3 * d: 1 * q: 0,1

Combine the model

SARIMA: (p, d, q)(P, D, Q)[m]

(1,1,0)(0,1,0)[12] (1,1,1)(0,1,0)[12] (2,1,0)(0,1,0)[12] (2,1,1)(0,1,0)[12] (3,1,0)(0,1,0)[12] (3,1,1)(0,1,0)[12] Use Arima() and auto.arima() function to create model Arima and save the data model into the new objects.

# create model using auto.arima()
sarima1 <- Arima(y = clim_train, order = c(1,1,0), seasonal =  c(0,1,0))
sarima2 <- Arima(y = clim_train, order = c(1,1,1), seasonal =  c(0,1,0))
sarima3 <- Arima(y = clim_train, order = c(2,1,0), seasonal =  c(0,1,0))
sarima4 <- Arima(y = clim_train, order = c(2,1,1), seasonal =  c(0,1,0))
sarima5 <- Arima(y = clim_train, order = c(3,1,0), seasonal =  c(0,1,0))
sarima6 <- Arima(y = clim_train, order = c(3,1,1), seasonal =  c(0,1,0))
sarima_auto <- auto.arima(y = clim_train)

After building the combined model, we check the aic value for each model.

# check AIC value
sarima1$aic
## [1] 3249.425
sarima2$aic
## [1] 3160.268
sarima3$aic
## [1] 3242.748
sarima4$aic
## [1] 3162.192
sarima_auto$aic
## [1] 3150.97

5. Forecasting and Evaluation Model

# forecast, h = 18
clim_forecast <- forecast(object = sarima_auto, h = 18)
clim_forecast
##           Point Forecast     Lo 80    Hi 80     Lo 95    Hi 95
## 2016.0055      14.491288 11.829716 17.15286 10.420765 18.56181
## 2016.0082      13.666039 10.379749 16.95233  8.640093 18.69198
## 2016.0110      13.667601 10.097068 17.23813  8.206943 19.12826
## 2016.0137      11.759262  8.048332 15.47019  6.083883 17.43464
## 2016.0164       9.450648  5.668157 13.23314  3.665828 15.23547
## 2016.0192       9.873727  6.054237 13.69322  4.032321 15.71513
## 2016.0219      10.533547  6.694794 14.37230  4.662680 16.40441
## 2016.0247      11.058766  7.209946 14.90759  5.172504 16.94503
## 2016.0274      10.952031  7.097941 14.80612  5.057709 16.84635
## 2016.0301      10.590259  6.733408 14.44711  4.691715 16.48880
## 2016.0329      12.224839  8.366541 16.08314  6.324081 18.12560
## 2016.0356      11.981777  8.122720 15.84083  6.079859 17.88370
## 2016.0384      12.361802  8.502347 16.22126  6.459275 18.26433
## 2016.0411      12.990442  9.130778 16.85011  7.087595 18.89329
## 2016.0438      13.493077  9.633304 17.35285  7.590063 19.39609
## 2016.0466      12.994986  9.135156 16.85482  7.091884 18.89809
## 2016.0493      13.746369  9.886508 17.60623  7.843221 19.64952
## 2016.0521      13.372370  9.512493 17.23225  7.469198 19.27554
clim_train %>% 
  autoplot()+
  autolayer(clim_test, series = "data test")+
  autolayer(clim_forecast$mean, series = "forecast")+
  autolayer(sarima_auto$fitted, series = "fitted values")

# accuracy
forecast::accuracy(object = clim_forecast$mean, clim_test)
##                ME     RMSE      MAE      MPE     MAPE      ACF1 Theil's U
## Test set 3.005627 4.165464 3.317985 17.94666 20.48799 0.7312294  3.263169

Base on the result above, we get the accuracy measurement for the holt winters model are : MAPE : 20.49 % MAE : 3.32 % RMSE : 4.16 %

6. Asumption

Assumptions in time series are tested to measure whether the residuals obtained from modeling results are good enough to describe and capture information in the data. Residual data is used because it provides information from both the actual data and the prediction results from models.

6.1. No-autocorrelation residual

  • H0 : residual has no-autocorrelation
  • H1 : residual has autocorrelation

We noted that p-value > 0.05 (alpha), so the residual have no-autocorrelation.

# Reduce the margins for the plot
par(mar = c(3, 3, 2, 1) + 0.1)

# Use visualization plot
sarima_auto$residuals %>%
  tsdisplay(lag.max = 20)

We hope the model have no lag that over blue line.

# use Ljung-Box test
Box.test(sarima_auto$residuals, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  sarima_auto$residuals
## X-squared = 0.065276, df = 1, p-value = 0.7983

From the result above, we can conclude the assumption that the model with p-value 0.7983 have no autocorrelation (The model data meet H0 : residual has no-autocorrelation).

# use saphiro.test()
shapiro.test(sarima_auto$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  sarima_auto$residuals
## W = 0.92283, p-value < 2.2e-16
hist(sarima_auto$residuals)

From the results above, we can conclude that the model with a p-value of 0.00000000000000022 has a normal distribution (the model data meets H0: Residuals have a normal distribution).

7. Conclusion

Based on our analysis, we recommend that the data be pre-processed to meet the assumptions of the Holt-Winters model, which include having a normal distribution and no autocorrelation. The Holt-Winters model is the best model for this data because it has the least error.