Hi, welcome to my LBB about time series - forecasting This LBB is about climate time series - forecasting in the city of Delhi, India. We want to train the model on Weather Forecasting for Indian climate. Let’s check it out!
We can do import library and read data by using library() and read.csv() function. We use dim() function to check the dimension data of each object. Then, we use head() and tail() to check the first and the last 6 rows data.
# load library
library(tidyverse) #data manipulation
library(lubridate) # date manipulation
library(forecast) # time series library
library(TTR) # for Simple moving average function
library(MLmetrics) # calculate error
library(tseries) # adf.test
library(fpp) # usconsumption# read data
climate_train <- read.csv("input_data/DailyDelhiClimateTrain.csv")
climate_test <- read.csv("input_data/DailyDelhiClimateTest.csv")
# check data
dim(climate_train)#> [1] 1462 5
dim(climate_test)#> [1] 114 5
# check data
head(climate_train)# check data
head(climate_test)The dimension of data object climate_train is 1462 rows and 5 columns. While the dimension data object climate_test is 114 rows and 5 columns. Next, we do data wrangling.
In this data wrangling, we can do check data type/stucture, missing value and outlier.
Description of variable : - date : Date of format YYYY-MM-DD - meantemp : Mean temperature averaged out from multiple 3 hour intervals in a day. - humidity : Humidity value for the day (units are grams of water vapor per cubic meter volume of air). - wind_speed : Wind speed measured in kmph. - meanpressure : Pressure reading of weather (measure in atm)
We use glimpse() to check the data type of the objects.
# check data type/structure
glimpse(climate_train)#> Rows: 1,462
#> Columns: 5
#> $ date <chr> "2013-01-01", "2013-01-02", "2013-01-03", "2013-01-04", "~
#> $ meantemp <dbl> 10.000000, 7.400000, 7.166667, 8.666667, 6.000000, 7.0000~
#> $ humidity <dbl> 84.50000, 92.00000, 87.00000, 71.33333, 86.83333, 82.8000~
#> $ wind_speed <dbl> 0.0000000, 2.9800000, 4.6333333, 1.2333333, 3.7000000, 1.~
#> $ meanpressure <dbl> 1015.667, 1017.800, 1018.667, 1017.167, 1016.500, 1018.00~
# check data type/structure
glimpse(climate_test)#> Rows: 114
#> Columns: 5
#> $ date <chr> "2017-01-01", "2017-01-02", "2017-01-03", "2017-01-04", "~
#> $ meantemp <dbl> 15.91304, 18.50000, 17.11111, 18.70000, 18.38889, 19.3181~
#> $ humidity <dbl> 85.86957, 77.22222, 81.88889, 70.05000, 74.94444, 79.3181~
#> $ wind_speed <dbl> 2.743478, 2.894444, 4.016667, 4.545000, 3.300000, 8.68181~
#> $ meanpressure <dbl> 59.000, 1018.278, 1018.333, 1015.700, 1014.333, 1011.773,~
Which data variable thet we need to change? - date : chr -> date
# climate_train - change data type/structure : `date` from chr to date
climate_train <- climate_train %>%
mutate(date = lubridate::ymd(date))
glimpse(climate_train)#> Rows: 1,462
#> Columns: 5
#> $ date <date> 2013-01-01, 2013-01-02, 2013-01-03, 2013-01-04, 2013-01-~
#> $ meantemp <dbl> 10.000000, 7.400000, 7.166667, 8.666667, 6.000000, 7.0000~
#> $ humidity <dbl> 84.50000, 92.00000, 87.00000, 71.33333, 86.83333, 82.8000~
#> $ wind_speed <dbl> 0.0000000, 2.9800000, 4.6333333, 1.2333333, 3.7000000, 1.~
#> $ meanpressure <dbl> 1015.667, 1017.800, 1018.667, 1017.167, 1016.500, 1018.00~
# climate_test - change data type/structure : `date` from chr to date
climate_test <- climate_test %>%
mutate(date = lubridate::ymd(date))
glimpse(climate_test)#> Rows: 114
#> Columns: 5
#> $ date <date> 2017-01-01, 2017-01-02, 2017-01-03, 2017-01-04, 2017-01-~
#> $ meantemp <dbl> 15.91304, 18.50000, 17.11111, 18.70000, 18.38889, 19.3181~
#> $ humidity <dbl> 85.86957, 77.22222, 81.88889, 70.05000, 74.94444, 79.3181~
#> $ wind_speed <dbl> 2.743478, 2.894444, 4.016667, 4.545000, 3.300000, 8.68181~
#> $ meanpressure <dbl> 59.000, 1018.278, 1018.333, 1015.700, 1014.333, 1011.773,~
Based on the process above, we use ymd() from lubridate packages to change the data type related to date.
We use anyNA(), is.na() and colSums() function to check the missing value.
# check missing value - climate_train
anyNA(climate_train)#> [1] FALSE
climate_train %>%
is.na() %>%
colSums()#> date meantemp humidity wind_speed meanpressure
#> 0 0 0 0 0
# check missing value - climate_test
anyNA(climate_test)#> [1] FALSE
climate_test %>%
is.na() %>%
colSums()#> date meantemp humidity wind_speed meanpressure
#> 0 0 0 0 0
There is no missing value on the data. We can continue the next progress.
boxplot(scale(climate_train[,2:5]))On the object climate_train, there is outlier on the attributes wind_speed and meanpressure.
boxplot(scale(climate_test[,2:5])) On the object
climate_test, there is outlier on the attributes meanpressure.
We can use histogram to check distribution data on the objects climate_train and climate_test.
# Histogram for each Attribute - climate_train
climate_train %>%
gather(Attributes, value, 2:5) %>%
ggplot(aes(x=value, fill=Attributes)) +
geom_histogram(colour="black", show.legend=FALSE) +
facet_wrap(~Attributes, scales="free_x") +
labs(x="Values", y="Frequency",
title="Climate-train Attributes - Histograms") +
theme_bw()# Histogram for each Attribute - climate_test
climate_test %>%
gather(Attributes, value, 2:5) %>%
ggplot(aes(x=value, fill=Attributes)) +
geom_histogram(colour="black", show.legend=FALSE) +
facet_wrap(~Attributes, scales="free_x") +
labs(x="Values", y="Frequency",
title="Climate-train Attributes - Histograms") +
theme_bw()There is outlier at attribute meanpressure at points 1000 - 1100 on the objects climate_train and climate_test.
In this data processing, there are some steps to do, such as : change data into time series format, do decompose and model fitting. In this LBB, We use data object climate_train only.
We find the range time for atribute date on the objects by use range() function.
# range
range(climate_train$date)#> [1] "2013-01-01" "2017-01-01"
Use ts() function to change data into time series format.
# object train_ts
train_ts <- ts(data = climate_train$meantemp,# target variable
start = c(2013,1,1), #start period
frequency = 365)# daily data so we use 365 days
head(train_ts,12)#> Time Series:
#> Start = c(2013, 1)
#> End = c(2013, 12)
#> Frequency = 365
#> [1] 10.000000 7.400000 7.166667 8.666667 6.000000 7.000000 7.000000
#> [8] 8.857143 14.000000 11.000000 15.714286 14.000000
# check missing value - train_ts
anyNA(train_ts)#> [1] FALSE
# train_ts %>%
# is.na() %>%
# colSums()There is no missing value on the data train_ts. We can do the next process.
We check the data visualization to get the pattern.
# visualization object train_ts
train_ts %>%
autoplot()We can see the pattern data for object train_ts is additive. Use this type data for the next decompose() function.
# decompose - train_ts
train_dc<- decompose(train_ts, type = "additive")
train_dc%>% autoplot()From the result on above, we can see the pattern is the same for some periode. So we can say type data seasonal is additive.
Before do model fitting, we do splitting data, with : * Data train: 2013 - 2015 (3 tahun) and save with the object name clim_train. * Data test: 2016 (1 tahun) and save with the object name clim_test.
Then, we continue the next process by checking the missing value on the data clim_train and clim_test.
# test use `tail()`
clim_test <- train_ts %>% tail(365) # take 12 last data
# train use `head()`
clim_train <- train_ts %>% head(-365) # don't take 12 last data# check missing value - clim_train, clim_test, train_dc
anyNA(clim_train)#> [1] FALSE
anyNA(clim_test)#> [1] FALSE
anyNA(train_dc)#> [1] FALSE
# climate_train %>%
# is.na() %>%
# colSums()There is no missing value on the data clim_train, clim_test and train_dc. We can continue to do the next process which is the fitting model.
For model fitting, we use 3 kind models, such as : Simple Moving Average (SMA), Triple Exponential Smoothing (Holts Winter) and Auto Regresive Integrated Moving Average (ARIMA).
In this model, we use SMA() function from data train_ts with
# Build SMA model with orde = 3 and 10
train_sma3 <- SMA(x = train_ts, n = 3)
train_sma10 <- SMA(x = train_ts, n = 10)Based on the result of train_sma3 and train_sma10, there is missing value in the some rows. We cannot drop NA because it can cause the missing data.
# anyNA(train_sma3)
# train_sma_3 <-drop_na(train_sma3, "NA")
# train_sma_3# Visualization data result from SMA model
# model_sma <- train_ts %>%
# autoplot() +
# forecast::autolayer(train_sma3) +
# forecast::autolayer(train_sma10) class(train_sma3)#> [1] "numeric"
This model is simple and it can not do the accuracy() function from forecasting packages. So we stop the process for the sma model here. The next model is model fitting by Triple Exponential Smoothing (Holts Winter).
In this model, we use HoltWinters() function with data clim_train and additive for the seasonal. Then, we save the data model with the object name model_train.
# model fitting with train_ts
model_train <- HoltWinters(x = clim_train, seasonal = "additive")
model_train#> Holt-Winters exponential smoothing with trend and additive seasonal component.
#>
#> Call:
#> HoltWinters(x = clim_train, seasonal = "additive")
#>
#> Smoothing parameters:
#> alpha: 0.6770537
#> beta : 0
#> gamma: 1
#>
#> Coefficients:
#> [,1]
#> a 2.602363e+01
#> b 7.786026e-04
#> s1 -1.198739e+01
#> s2 -1.201948e+01
#> s3 -1.208472e+01
#> s4 -1.335531e+01
#> s5 -1.341066e+01
#> s6 -1.273301e+01
#> s7 -1.179621e+01
#> s8 -1.185330e+01
#> s9 -1.225452e+01
#> s10 -1.196065e+01
#> s11 -1.046138e+01
#> s12 -9.993057e+00
#> s13 -1.159085e+01
#> s14 -1.183360e+01
#> s15 -1.214046e+01
#> s16 -1.222146e+01
#> s17 -1.049410e+01
#> s18 -1.022124e+01
#> s19 -1.077042e+01
#> s20 -9.972734e+00
#> s21 -8.614474e+00
#> s22 -1.073655e+01
#> s23 -1.185561e+01
#> s24 -1.056419e+01
#> s25 -8.944911e+00
#> s26 -8.754930e+00
#> s27 -9.452906e+00
#> s28 -9.464328e+00
#> s29 -9.385187e+00
#> s30 -9.786707e+00
#> s31 -8.705496e+00
#> s32 -7.170183e+00
#> s33 -6.867985e+00
#> s34 -7.347541e+00
#> s35 -6.330757e+00
#> s36 -5.164852e+00
#> s37 -8.854825e+00
#> s38 -1.071593e+01
#> s39 -1.193176e+01
#> s40 -1.137633e+01
#> s41 -1.152675e+01
#> s42 -9.496429e+00
#> s43 -1.024789e+01
#> s44 -1.176785e+01
#> s45 -1.077741e+01
#> s46 -9.761975e+00
#> s47 -9.490288e+00
#> s48 -9.047401e+00
#> s49 -8.720213e+00
#> s50 -8.227624e+00
#> s51 -8.042738e+00
#> s52 -8.364720e+00
#> s53 -7.908749e+00
#> s54 -6.944919e+00
#> s55 -7.304191e+00
#> s56 -6.738413e+00
#> s57 -7.697524e+00
#> s58 -9.552880e+00
#> s59 -8.870290e+00
#> s60 -7.907670e+00
#> s61 -6.435495e+00
#> s62 -5.384627e+00
#> s63 -5.619107e+00
#> s64 -6.529959e+00
#> s65 -4.999649e+00
#> s66 -3.872528e+00
#> s67 -4.543274e+00
#> s68 -3.994859e+00
#> s69 -3.735427e+00
#> s70 -4.767116e+00
#> s71 -4.914178e+00
#> s72 -4.895460e+00
#> s73 -1.898573e+00
#> s74 7.963801e-01
#> s75 1.930300e-01
#> s76 -1.622856e+00
#> s77 -2.219351e+00
#> s78 -1.414227e+00
#> s79 -1.516522e-01
#> s80 -1.477008e+00
#> s81 -1.720296e+00
#> s82 -3.185110e-01
#> s83 7.962206e-01
#> s84 -9.144463e-01
#> s85 -5.987792e-01
#> s86 -4.509573e-02
#> s87 -8.908691e-01
#> s88 -1.177850e-01
#> s89 4.493147e-01
#> s90 2.254912e+00
#> s91 -4.489233e-01
#> s92 -5.144484e-01
#> s93 2.443984e+00
#> s94 4.886696e+00
#> s95 2.861116e+00
#> s96 2.198044e+00
#> s97 1.499231e+00
#> s98 1.942867e+00
#> s99 3.287239e+00
#> s100 2.770645e+00
#> s101 2.452824e+00
#> s102 3.220669e+00
#> s103 4.363059e+00
#> s104 2.254409e+00
#> s105 2.766362e+00
#> s106 1.214408e+00
#> s107 1.648391e+00
#> s108 2.318209e+00
#> s109 2.606305e+00
#> s110 3.669800e+00
#> s111 4.169232e+00
#> s112 4.621317e+00
#> s113 5.288762e+00
#> s114 5.729720e+00
#> s115 6.285048e+00
#> s116 6.891578e+00
#> s117 7.304470e+00
#> s118 8.559226e+00
#> s119 9.174117e+00
#> s120 9.114483e+00
#> s121 7.964882e+00
#> s122 7.354071e+00
#> s123 6.158270e+00
#> s124 5.889066e+00
#> s125 7.045911e+00
#> s126 6.318557e+00
#> s127 6.228680e+00
#> s128 5.528708e+00
#> s129 4.677026e+00
#> s130 3.374079e+00
#> s131 -3.650718e-01
#> s132 1.864083e+00
#> s133 3.389892e+00
#> s134 5.828839e+00
#> s135 6.332719e+00
#> s136 7.229460e+00
#> s137 6.170136e+00
#> s138 6.757834e+00
#> s139 8.500125e+00
#> s140 8.624006e+00
#> s141 9.755148e+00
#> s142 7.194046e+00
#> s143 7.376386e+00
#> s144 5.134703e+00
#> s145 5.925452e+00
#> s146 8.177301e+00
#> s147 9.805533e+00
#> s148 7.842959e+00
#> s149 9.795927e+00
#> s150 6.933521e+00
#> s151 8.306507e+00
#> s152 8.486835e+00
#> s153 1.080591e+01
#> s154 1.183884e+01
#> s155 1.259195e+01
#> s156 1.379096e+01
#> s157 1.354257e+01
#> s158 1.350898e+01
#> s159 1.238927e+01
#> s160 1.227420e+01
#> s161 9.418578e+00
#> s162 5.102697e+00
#> s163 4.080900e+00
#> s164 8.659279e+00
#> s165 1.098113e+01
#> s166 1.210620e+01
#> s167 1.035881e+01
#> s168 1.090681e+01
#> s169 1.148372e+01
#> s170 1.091575e+01
#> s171 1.058657e+01
#> s172 8.115540e+00
#> s173 9.977863e+00
#> s174 4.790759e+00
#> s175 8.177398e+00
#> s176 9.829933e+00
#> s177 1.049189e+01
#> s178 7.905602e+00
#> s179 7.322194e+00
#> s180 7.155751e+00
#> s181 6.922562e+00
#> s182 6.350669e+00
#> s183 7.922029e+00
#> s184 9.314715e+00
#> s185 4.509044e+00
#> s186 6.554235e+00
#> s187 7.325155e+00
#> s188 5.387483e+00
#> s189 4.255590e+00
#> s190 5.819591e+00
#> s191 5.178832e+00
#> s192 6.454543e+00
#> s193 6.435384e+00
#> s194 7.587076e+00
#> s195 5.708966e+00
#> s196 5.283614e+00
#> s197 4.991974e+00
#> s198 6.251457e+00
#> s199 4.340584e+00
#> s200 4.872696e+00
#> s201 4.841755e+00
#> s202 5.554948e+00
#> s203 5.324924e+00
#> s204 4.220181e+00
#> s205 4.820982e+00
#> s206 4.961836e+00
#> s207 5.797714e+00
#> s208 5.849649e+00
#> s209 6.440888e+00
#> s210 5.515838e+00
#> s211 5.229031e+00
#> s212 5.779635e+00
#> s213 7.517130e+00
#> s214 6.737311e+00
#> s215 5.865234e+00
#> s216 4.678301e+00
#> s217 3.794647e+00
#> s218 3.569938e+00
#> s219 2.191900e+00
#> s220 4.158276e+00
#> s221 4.487895e+00
#> s222 5.848860e+00
#> s223 6.231407e+00
#> s224 4.810258e+00
#> s225 3.349423e+00
#> s226 2.670219e+00
#> s227 3.516602e+00
#> s228 3.032772e+00
#> s229 4.292575e+00
#> s230 3.364114e+00
#> s231 2.845801e+00
#> s232 4.083058e+00
#> s233 5.190897e+00
#> s234 6.616419e+00
#> s235 6.202298e+00
#> s236 6.609839e+00
#> s237 7.158783e+00
#> s238 5.311021e+00
#> s239 6.121021e+00
#> s240 5.187595e+00
#> s241 4.207190e+00
#> s242 5.199997e+00
#> s243 6.087984e+00
#> s244 6.922225e+00
#> s245 7.070691e+00
#> s246 6.374606e+00
#> s247 6.792467e+00
#> s248 7.172922e+00
#> s249 5.561277e+00
#> s250 5.521302e+00
#> s251 4.979986e+00
#> s252 4.866787e+00
#> s253 6.582701e+00
#> s254 6.811925e+00
#> s255 6.315701e+00
#> s256 7.115734e+00
#> s257 6.182785e+00
#> s258 5.146724e+00
#> s259 4.740289e+00
#> s260 4.968169e+00
#> s261 4.200353e+00
#> s262 2.839160e+00
#> s263 3.112356e+00
#> s264 3.910461e+00
#> s265 4.812431e+00
#> s266 4.765990e+00
#> s267 4.482814e+00
#> s268 4.310929e+00
#> s269 4.675208e+00
#> s270 4.570342e+00
#> s271 3.477899e+00
#> s272 3.164084e+00
#> s273 3.949424e+00
#> s274 4.648610e+00
#> s275 2.333475e+00
#> s276 2.818144e+00
#> s277 3.809919e+00
#> s278 4.749270e+00
#> s279 4.206872e+00
#> s280 4.932077e+00
#> s281 4.874837e+00
#> s282 2.047586e+00
#> s283 1.834447e+00
#> s284 2.556250e+00
#> s285 1.777814e+00
#> s286 1.442464e+00
#> s287 1.492022e+00
#> s288 2.339262e+00
#> s289 1.609453e+00
#> s290 1.325911e+00
#> s291 5.807043e-01
#> s292 2.745502e-01
#> s293 1.237791e+00
#> s294 8.498795e-01
#> s295 5.606081e-01
#> s296 -1.466818e-01
#> s297 -1.248188e+00
#> s298 -1.775563e+00
#> s299 -2.944453e+00
#> s300 -1.660302e+00
#> s301 -1.743883e+00
#> s302 -1.458948e+00
#> s303 -1.275488e+00
#> s304 -1.796295e+00
#> s305 -2.541282e+00
#> s306 -3.066512e+00
#> s307 -4.992446e+00
#> s308 -4.227070e+00
#> s309 -4.373254e+00
#> s310 -5.265555e+00
#> s311 -5.032966e+00
#> s312 -6.453988e+00
#> s313 -7.052221e+00
#> s314 -7.305308e+00
#> s315 -7.117091e+00
#> s316 -7.017000e+00
#> s317 -7.081618e+00
#> s318 -7.693883e+00
#> s319 -8.958847e+00
#> s320 -7.938604e+00
#> s321 -7.048718e+00
#> s322 -5.892132e+00
#> s323 -6.417941e+00
#> s324 -6.316405e+00
#> s325 -7.351809e+00
#> s326 -5.859603e+00
#> s327 -4.359161e+00
#> s328 -2.790935e+00
#> s329 -4.577928e+00
#> s330 -4.738208e+00
#> s331 -5.604709e+00
#> s332 -5.655111e+00
#> s333 -6.110323e+00
#> s334 -6.610442e+00
#> s335 -5.985412e+00
#> s336 -7.208772e+00
#> s337 -8.101604e+00
#> s338 -8.218172e+00
#> s339 -8.282624e+00
#> s340 -7.916427e+00
#> s341 -8.475868e+00
#> s342 -7.120550e+00
#> s343 -8.286985e+00
#> s344 -9.198028e+00
#> s345 -8.401650e+00
#> s346 -7.867668e+00
#> s347 -8.390164e+00
#> s348 -8.556913e+00
#> s349 -1.036563e+01
#> s350 -1.013688e+01
#> s351 -9.495622e+00
#> s352 -9.487741e+00
#> s353 -9.614455e+00
#> s354 -9.752089e+00
#> s355 -9.680601e+00
#> s356 -1.064618e+01
#> s357 -1.072739e+01
#> s358 -1.075701e+01
#> s359 -1.095665e+01
#> s360 -1.218979e+01
#> s361 -1.305511e+01
#> s362 -1.279254e+01
#> s363 -1.136668e+01
#> s364 -1.063268e+01
#> s365 -1.202363e+01
We check the data model with variable fitted.
head(model_train$fitted, 10)#> Time Series:
#> Start = c(2014, 1)
#> End = c(2014, 10)
#> Frequency = 365
#> xhat level trend season
#> 2014.000 13.60364 24.66347 0.0007786026 -11.06061
#> 2014.003 11.08200 24.50945 0.0007786026 -13.42822
#> 2014.005 12.52996 24.45470 0.0007786026 -11.92552
#> 2014.008 12.88023 24.43520 0.0007786026 -11.55574
#> 2014.011 12.37850 24.43243 0.0007786026 -12.05471
#> 2014.014 11.42272 24.43084 0.0007786026 -13.00890
#> 2014.016 12.12759 24.43558 0.0007786026 -12.30877
#> 2014.019 11.85167 24.44670 0.0007786026 -12.59581
#> 2014.022 12.80911 24.46327 0.0007786026 -11.65494
#> 2014.025 12.35665 24.48045 0.0007786026 -12.12458
We check the visualization of model_ _train by using plot() function.
plot(model_train, main="The Daily Report of Climate")The plot shows the original time series in black, and the forecasts as a red line. We can see from the picture that the forecasts time series agree pretty well with the original values, although they tend to lag behind the original values a little bit.
As a measure of the accuracy of the forecasts, we can calculate the sum of squared errors for the in-sample forecast errors, which is stored in the named element SSE:
# model_train$seasonal
# model_train$alpha
# model_train$beta
# model_train$gamma
# accuracy : way 1 by using `SSE`
model_train$SSE#> [1] 2289.19
That is, here the sum-of-squared-errors is 2289.19.
We have the second way to measure the accuracy of the forecasts. We continue the next process is to see the error measurement with the accuracy() function from the forecast package. We take the column xhat of the data model_train$fitted based on the data clim_train.
# accuracy : way 2 by using `accuracy()` function
forecast::accuracy(object = model_train$fitted[,1], clim_train)#> ME RMSE MAE MPE MAPE ACF1 Theil's U
#> Test set 0.001594478 1.76842 1.18208 -0.1981344 5.143228 0.1470544 1.144072
Base on the result above, we get the accuracy measurement for the holt winters model are : MAPE : 5.14 % MAE : 1.18 % RMSE : 1.77 %
In this model, we use adf.test() to check if the data is already stationer.
clim_train %>%
adf.test()#>
#> Augmented Dickey-Fuller Test
#>
#> data: .
#> Dickey-Fuller = -1.7903, Lag order = 10, p-value = 0.6671
#> alternative hypothesis: stationary
We make the hypothesis of the stationary data as like: H0: data no-stationary H1: data stationary
Based on the data above, we find the P-value (0.6671) > the alpha(0.05), so it means the data is not stationary. We must set the differencing in the seasonal data to make the data stationary.
clim_train %>%
diff(lag = 12) %>% #untuk diff seasonal (D = 1)
diff(lag = 1) %>% #untuk diff trend (d=1)
adf.test()#>
#> Augmented Dickey-Fuller Test
#>
#> data: .
#> Dickey-Fuller = -9.7415, Lag order = 10, p-value = 0.01
#> alternative hypothesis: stationary
Based on the data above, we find the P-value (0.01) < the alpha(0.05), so it means the data is already stationary. Then we use the tsdisplay() function to see the adjusted data clim_train graphic.
clim_train %>%
diff(lag = 12) %>%
diff(lag = 1) %>%
tsdisplay(lag.max = 20)Based on the graphic above, we set the combined model ARIMA.
For seasosal : AR
* P : 0 * D : 1 * Q : 0
For all data : MA * p: 1,2,3 * d: 1 * q: 0,1
Combine the model
SARIMA: (p, d, q)(P, D, Q)[m]
Use Arima() and auto.arima() function to create model Arima and save the data model into the new objects.
# create model using auto.arima()
sarima1 <- Arima(y = clim_train, order = c(1,1,0), seasonal = c(0,1,0))
sarima2 <- Arima(y = clim_train, order = c(1,1,1), seasonal = c(0,1,0))
sarima3 <- Arima(y = clim_train, order = c(2,1,0), seasonal = c(0,1,0))
sarima4 <- Arima(y = clim_train, order = c(2,1,1), seasonal = c(0,1,0))
sarima5 <- Arima(y = clim_train, order = c(3,1,0), seasonal = c(0,1,0))
sarima6 <- Arima(y = clim_train, order = c(3,1,1), seasonal = c(0,1,0))
sarima_auto <- auto.arima(y = clim_train)After building the combined model, we check the aic value for each model.
# check AIC value
sarima1$aic#> [1] 3249.425
sarima2$aic#> [1] 3160.268
sarima3$aic#> [1] 3242.748
sarima4$aic#> [1] 3162.192
sarima_auto$aic#> [1] 3150.97
# forecast, h = 18
clim_forecast <- forecast(object = sarima_auto, h = 18)
clim_forecast#> Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
#> 2016.0055 14.491288 11.829716 17.15286 10.420765 18.56181
#> 2016.0082 13.666039 10.379749 16.95233 8.640092 18.69198
#> 2016.0110 13.667601 10.097068 17.23813 8.206942 19.12826
#> 2016.0137 11.759262 8.048331 15.47019 6.083883 17.43464
#> 2016.0164 9.450647 5.668157 13.23314 3.665827 15.23547
#> 2016.0192 9.873726 6.054237 13.69322 4.032321 15.71513
#> 2016.0219 10.533547 6.694793 14.37230 4.662680 16.40441
#> 2016.0247 11.058766 7.209946 14.90759 5.172504 16.94503
#> 2016.0274 10.952031 7.097941 14.80612 5.057709 16.84635
#> 2016.0301 10.590259 6.733408 14.44711 4.691714 16.48880
#> 2016.0329 12.224839 8.366541 16.08314 6.324081 18.12560
#> 2016.0356 11.981777 8.122720 15.84083 6.079858 17.88370
#> 2016.0384 12.361802 8.502347 16.22126 6.459275 18.26433
#> 2016.0411 12.990442 9.130778 16.85011 7.087595 18.89329
#> 2016.0438 13.493077 9.633304 17.35285 7.590063 19.39609
#> 2016.0466 12.994986 9.135155 16.85482 7.091884 18.89809
#> 2016.0493 13.746369 9.886508 17.60623 7.843221 19.64952
#> 2016.0521 13.372370 9.512493 17.23225 7.469198 19.27554
# menggunakan autoplot dan autolayer
clim_train %>%
autoplot()+
autolayer(clim_test, series = "data test")+
autolayer(clim_forecast$mean, series = "forecast")+
autolayer(sarima_auto$fitted, series = "fitted values")# accuracy
forecast::accuracy(object = clim_forecast$mean, clim_test)#> ME RMSE MAE MPE MAPE ACF1 Theil's U
#> Test set 3.005627 4.165464 3.317985 17.94666 20.48799 0.7312294 3.263169
Base on the result above, we get the accuracy measurement for the holt winters model are : MAPE : 20.49 % MAE : 3.32 % RMSE : 4.16 %
Assumptions in the time series are tested to measure whether the residuals obtained from modeling results are good enough to describe and capture information on the data. Why use residual data? Because by using residual data, we can get information from actual data as well as from prediction results using models.
H0 : residual has no-autocorrelation H1 : residual has autocorrelation
We noted that p-value > 0.05 (alpha), so the residual have no-autocorrelation.
# use visualization plot
sarima_auto$residuals %>%
tsdisplay(lag.max = 20)We hope the model have no lag that over blue line.
# use Ljung-Box test
Box.test(sarima_auto$residuals, type = "Ljung-Box")#>
#> Box-Ljung test
#>
#> data: sarima_auto$residuals
#> X-squared = 0.065276, df = 1, p-value = 0.7983
From the result above, we can conclude the assumption that the model with p-value 0.7983 have no autocorrelation (The model data meet H0 : residual has no-autocorrelation).
H0: Residual have norm distribution H1: Residual have no norm distribution
We noted that p-value > 0.05 (alpha), so the residual have norm distribution.
# use saphiro.test()
shapiro.test(sarima_auto$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: sarima_auto$residuals
#> W = 0.92283, p-value < 2.2e-16
hist(sarima_auto$residuals)From the result above, we can conclude the assumption that the model with p-value 0.00000000000000022 have norm distribution (The model data meet H0: Residual have norm distribution).
We recommended that the data meet the assumption (have norm distribution and no autocorrelation) and the best model used is Holt-Winters with least error.
I get the data from https://www.kaggle.com/sumanthvrao/daily-climate-time-series-data/activity .