1 Introduction

Weather is something that humans always face in their daily life. Every day, the weather can change from rain, scorching heat, snow, or cloudy. In contrast to climate, the weather has the characteristics of a limited period and a location that is limited to city/municipal coverage.

Because of the importance of weather for human life, every day, there is always weather forecast information in the news, internet, television, or other platforms. For example, the weather can determine whether today is a good day to dry clothes or determine whether one should bring an umbrella or raincoat outdoors.

This notebook aims to explain weather changes in Delhi, India, with the Time-Series method and forecasting. Data obtained from Kaggle.com under the title "Daily Climate time series data.

2 Workflow

2.1 Data Preparation

The first stage is data preparation, inspection, and data recognition. Three crucial things to assess a time series data are:
1. The data must be sorted according to the period from the first data to the last data.
2. Data must be complete, and there is no Missing Value.
3. The time interval between data must be consistent or the same.

It is crucial to change the character data type (the default data type when it is first to read) to the date data type.

2.2 Time Series Analysis and Modelling

Time Series Analysis is an analysis used to see data movement in a time range and is sampled at the same interval. In this notebook, we will create time-series objects and visualize time series. Data visualization will help us to recognize the characteristics of the data. Some things to know are:
1. Trend : a general symptom of time series data in all time ranges. It is commonly known that the type of trend is up or down.

Seasonality : recurring symptoms in time series data that occur within a certain period.

From the image below, the data pattern with seasonality is shown in the graph on the right. In contrast, the graph on the left does not have seasonality. The chart pattern has repeated similarities, while the left chart is more random and irregular.

In Time Series, seasonality is divided into two types:

Additive Seasonality
It is a time series model with constant variance following trend and seasonality. The first row shows this type of seasonality in the four graphs below. The left graph has additive seasonality but no trend. The right graph shows the trend and additive seasonality. Additive seasonality follows a mathematical model:

\[Y_t = T_t + S_t + E_t\]

Data = Trend + Seasonal + Error

Multiplicative Seasonality It is a time series model with a variance that changes (increases or decreases) following trends and seasonality. The second line indicates this type, both with a trend and without a trend. Multiplicative seasonality, as the name implies, uses the multiplication operation:

\[Y_t = T_t * S_t * E_t\] Data = Trend * Seasonal * Error

2.3 Forecasting, and Assumptions

If in machine learning analysis on problems such as regression, or classification, predictions are usually made; in time series and forecasting, predictions are replaced by forecasting or forecasts. Based on the available data, the time series algorithm can forecast for some time in the future. For example, forecasting the weather on a particular day can be done using the time series method.

After doing forecasting, we can evaluate the forecasting results by comparing the forecast results with testing data or data that the model has not studied. Similar to metrics in the case of regression, SSE, MSE, MAE, MAPE can be applied in this case by considering the type and suitability of the data with the evaluation calculations. In this case, we will apply MAE to the evaluation of the time series model.

Not only arriving at assumptions, but the time series also has assumptions that must be met. This is similar to the assumptions in the regression model. This is because the regression model and the time series model have similarities. One of them is to have an intercept; each data has a weight, and as a result with a continuous value.

The assumption tests that will be carried out are 2: residuals normality distribution and no auto-correlation. The residual normality test will test whether the residuals/errors from the data are distributed normally using the Shapiro-Wilk test. Then, the no auto-correlation test aims to find out whether each data has a correlation. The autocorrelation test used is the Ljung-Box test. The two tests expect that we get a normal distribution of residuals, and there is no autocorrelation.

3 Solution

3.1 Data Import and Preparation

Importing relevant libraries.

library(tidyverse)
library(lubridate) # date manipulation
library(padr) # TS padding
library(zoo) # imputasi missing value TS
library(fpp) # TS dataset
library(TSstudio) # TS interactive viz

library(forecast) # algoritma forecasting
library(TTR) # SMA function
library(tseries) # adf.test

theme_set(theme_minimal())

# Membaca data
weat_train <- read.csv("DailyDelhiClimateTrain.csv")
weat_test <- read.csv("DailyDelhiClimateTest.csv")

The data obtained has been divided into two parts for training and testing. We just have to save the data according to the name and function of each.

glimpse(weat_train)

#> Rows: 1,462
#> Columns: 5
#> $ date         <chr> "2013-01-01", "2013-01-02", "2013-01-03", "2013-01-04", "~
#> $ meantemp     <dbl> 10.0, 7.4, 7.2, 8.7, 6.0, 7.0, 7.0, 8.9, 14.0, 11.0, 15.7~
#> $ humidity     <dbl> 84, 92, 87, 71, 87, 83, 79, 64, 51, 62, 51, 74, 75, 88, 7~
#> $ wind_speed   <dbl> 0.00, 2.98, 4.63, 1.23, 3.70, 1.48, 6.30, 7.14, 12.50, 7.~
#> $ meanpressure <dbl> 1016, 1018, 1019, 1017, 1016, 1018, 1020, 1019, 1017, 101~

glimpse(weat_test)

#> Rows: 114
#> Columns: 5
#> $ date         <chr> "2017-01-01", "2017-01-02", "2017-01-03", "2017-01-04", "~
#> $ meantemp     <dbl> 16, 18, 17, 19, 18, 19, 15, 16, 15, 12, 11, 12, 13, 13, 1~
#> $ humidity     <dbl> 86, 77, 82, 70, 75, 79, 96, 84, 81, 72, 72, 75, 67, 74, 7~
#> $ wind_speed   <dbl> 2.7, 2.9, 4.0, 4.5, 3.3, 8.7, 10.0, 1.9, 6.5, 9.4, 9.8, 6~
#> $ meanpressure <dbl> 59, 1018, 1018, 1016, 1014, 1012, 1011, 1016, 1016, 1017,~

The training data has 1,462 rows and five columns; test data has 114 data and five columns. In this case, what we want to discuss is the average temperature (mean temperature) in the city of Delhi. Therefore only the date and meantemp columns will be used for analysis.

3.1.1 Data Type

weat_train <- weat_train %>% 
  mutate(date = ymd(date)) %>% 
  arrange(date) %>% 
  select(date, meantemp) %>% 
  pad() %>% 
  mutate(meantemp = na.fill(meantemp, na.aggregate(meantemp)))

glimpse(weat_train)

#> Rows: 1,462
#> Columns: 2
#> $ date     <date> 2013-01-01, 2013-01-02, 2013-01-03, 2013-01-04, 2013-01-05, ~
#> $ meantemp <dbl> 10.0, 7.4, 7.2, 8.7, 6.0, 7.0, 7.0, 8.9, 14.0, 11.0, 15.7, 14~

weat_test <- weat_test %>% 
  mutate(date = ymd(date)) %>% 
  arrange(date) %>% 
  select(date, meantemp) %>% 
  pad() %>% 
  mutate(meantemp = na.fill(meantemp, na.aggregate(meantemp)))

glimpse(weat_test)

#> Rows: 114
#> Columns: 2
#> $ date     <date> 2017-01-01, 2017-01-02, 2017-01-03, 2017-01-04, 2017-01-05, ~
#> $ meantemp <dbl> 16, 18, 17, 19, 18, 19, 15, 16, 15, 12, 11, 12, 13, 13, 16, 1~

The above code changes the char data type to the date column and takes only the meantemp column for further use.

One of the requirements in time series analysis is that the time interval should not be missed. Therefore, in the code above, it has been anticipated by padding pad() to fill in the gaps and filling the resulting padding value with the aggregate value, namely the mean of the data. Filling with the average is more suitable than the value 0 because the data used is weather data.

3.1.2 NA Checking

colSums(is.na(weat_train))

#>     date meantemp 
#>        0        0

After using using anyNA, there is no missing value or NA value. We do not need to do imputation.

3.1.3 Time Range

The data used has daily intervals. The training data spans from January 1, 2013, to January 1, 2017.

#Training
range(weat_train$date)

#> [1] "2013-01-01" "2017-01-01"

Data testing has a period of January 1, 2017, to April 24, 2017.

#testing
range(weat_test$date)

#> [1] "2017-01-01" "2017-04-24"

3.2 Time Series and Decomposition

3.2.1 Visualisasi Time Series

plotly::ggplotly(
  ggplot(data = weat_train, aes(x = date, y = meantemp)) +
  geom_line())

The plot above shows a visualization of average temperature train data from 2013 to 2017. The temperature ranges from about 7 degrees to close to 40 degrees with an up and down pattern each year. This is the influence of the season that affects the daily temperature in the city. From the visualization, 2 questions must be answered: a. does the data have a trend? b. does the data have seasonality?

The data has an up and down pattern that occurs repeatedly. However, if a trend line is drawn, the data does not have a trend. It was concluded that the temperature data used did not have a trend but had seasonality (seasonal patterns).

3.2.2 Single Seasonality Time Series

We created 2-time series objects, namely Quarter and Year. Why? Delhi is a city that has four seasons or subtropics, so the seasons affect the average temperature for the day. The purpose of this Quarterly analysis is to see Trends and Seasonality. Then, in one year, there are four seasons changes, so it is hoped that seasonality patterns and trends in the average annual temperature will appear, whether up or down.

# quarter
weat_ts_q <- ts(data = weat_train$meantemp,
              frequency = 7*4*3,
              start = c(2013, 1))
# tahunan
weat_ts_y <- ts(data = weat_train$meantemp,
              frequency = 7*4*12,
              start = c(2017, 01))

3.2.3 Decomposition

With decomposition, we can see the components of trend, seasonality, and error (remainder). Time Series Quarterly seems to have no trend, fluctuating and seasonality patterns with too large a variance value.

# Time Series Quarterly
weat_ts_q %>% 
  decompose() %>% 
  autoplot()

In the Decomposition Time Series Yearly, a trend pattern tends to increase from 2013 to 2017. It appears that there is an increase even though the trend shows fluctuations. From this decomposition, we argue that the data cannot be said to have a trend. It is not easy to see a solid and consistent trend.

With the current global warming phenomenon, this analysis does not say that there is no increase in the average temperature of the earth’s surface. Due to the limit of the data used, we analyze only limited to one place and a short period. If the data is added, it is possible that other analyzes can be generated.

# Time Series Yearly
weat_ts_y %>% 
  decompose() %>% 
  autoplot()

3.2.4 Multiple Seasonality Time Series

When using the ts() function, which can only be used for single seasonality, we run into not finding a smooth trend (there are still ups and downs). In order to overcome this problem, multiple seasonality can be done.

Considering our case is the daily average temperature in a city, the temperature is affected by the seasons and the increase in the average temperature of the earth. Seasonal changes occur every three months. We assume the increase in the average temperature of the earth occurs in years. Therefore, we create two seasonalities, namely Quarterly and Yearly. Using msts(), we can create multiple seasonality time series.

train_ts <- msts(data = weat_train$meantemp,
     seasonal.periods = c(7*4*3,7*4*12 ))
train_ts %>% mstl() %>% autoplot()

By using mstl() to decompose, we get a trend that tends to increase. We can conclude that the Quarterly and Annual seasonality patterns have succeeded in identifying the data patterns. Then we can use the time series object to make smoothing or forecasting.

3.3 Modelling and Forecasting

In forecasting the data, we already know that the data has a trend and seasonality, so the algorithm that will be used is:

Holt-Winters Exponential Smoothing
ETS Model
SARIMA with STLM
TBATS Model
AUTO ARIMA

3.3.1 Holt Exponential Smoothing

We set the value of beta = F because there is no trend and set seasonal = “additive” because there is an additive seasonality pattern.

# fitting model with seasonal = additive
weat_hw <- HoltWinters(train_ts, seasonal = "additive")
weat_hw$alpha

#> alpha 
#>  0.73

weat_hw$beta

#> beta 
#>    0

weat_hw$gamma

#> gamma 
#>     1

Based on the Holt-Winters model, recommended alpha values of 0.73 and gamma 1. This shows that trend (beta) is not taken into account, but smoothing for error (alpha) seasonality (gamma) has a huge effect.

# forecasting holt winters
weat_hw_forecast <- forecast(weat_hw, h = 114) # panjang data yang di-forecast adalah 114 karena data test memiliki 114 data

# model accuracy
acc_hw <- data.frame(accuracy(weat_hw_forecast, weat_test$meantemp))
acc_hw

3.3.2 ETS Model

# model fitting dengan ETS
weat_ets <- stlm(train_ts, method = "ets")

# forecasting
weat_ets_forecast <- forecast(weat_ets, h = 114)

# model accuracy
acc_ets <- data.frame(accuracy(weat_ets_forecast, weat_test$meantemp))
acc_ets

# visualization
train_ts %>% 
  autoplot(series = "Actual") +
  autolayer(weat_ets$fitted, series = "ETS")

# forecasting mean temperature for 3 months
plot(forecast(weat_ets, h = 114))

3.3.3 ARIMA with STLM

# model fitting
weat_arima <- stlm(train_ts, method = "arima")

# forecasting mean temperature for 3 months
weat_arima_forecast <- forecast(weat_arima, h = 114)

# model accuracy
acc_arima<- data.frame(accuracy(weat_arima_forecast, weat_test$meantemp))
acc_arima

# visualization
train_ts %>% 
  autoplot(series = "Actual") +
  autolayer(weat_arima$fitted, series = "ETS")

# forecasting mean temperature for 3 months
plot(forecast(weat_arima, h = 114))

3.3.4 TBATS Model

# model fitting
weat_tbats <- tbats(train_ts)

# forecasting mean temperature for 3 months
weat_tbats_forecast <- forecast(weat_tbats, h = 114)

# model accuracy
acc_tbats <- data.frame(accuracy(weat_tbats_forecast, weat_test$meantemp))
acc_tbats

# visualization
train_ts %>% 
  autoplot(series = "Actual") +
  autolayer(weat_tbats$fitted, series = "TBATS")

# forecasting mean temperature for 3 months
plot(forecast(weat_tbats, h = 114))

3.3.5 AUTO ARIMA Model

The second model used is Auto SARIMA which can be used as a baseline model.

# weat_auto <- auto.arima(train_ts, seasonal = T)
# weat_auto

# saveRDS(weat_auto, file = "auto_arima.RDS")
weat_autoarima <- readRDS(file = "auto_arima.RDS")
weat_autoarima

#> Series: train_ts 
#> ARIMA(5,1,3) 
#> 
#> Coefficients:
#>        ar1    ar2    ar3    ar4     ar5   ma1    ma2   ma3
#>       1.24  0.099  -0.38  0.079  -0.047  -1.5  0.067  0.42
#> s.e.  0.79  1.335   0.49  0.106   0.055   0.8  1.533  0.74
#> 
#> sigma^2 estimated as 2.53:  log likelihood=-2749
#> AIC=5516   AICc=5516   BIC=5563

# forecasting
weat_auto_forecast <- forecast(weat_autoarima, h = 114)

# accuracy of tbats model
acc_auto <- data.frame(accuracy(weat_auto_forecast, weat_test$meantemp))
acc_auto

train_ts %>%
  autoplot(series = "Actual") +
  autolayer(weat_autoarima$fitted, series = "ARIMA(5,1,3)")

# forecasting mean temperature for 3 months
plot(forecast(weat_autoarima, h = 114))

3.4 Compare Model

Of the five Forecasting models that we have created, we use MAE as a model performance reference. Based on the lowest MAE value in the Training and Test Set, the ARIMA model with STLM and ETS has a low MAE score. ARIMA has an MAE of 0.97 on Training and 5.10 on testing, while ETS has 0.98 and 4.36. It is concluded that the ARIMA and ETS models are reliable for forecasting.

acc_arima

acc_auto

acc_ets

acc_hw

acc_tbats

3.5 Assumption Test

We test our assumptions for the ARIMA model. We hope to fulfill two assumptions by the model we have produced, namely Normality Residuals and No-Auto Correlation. We will use the Shapiro Wilk Test in the first test, while the second test we will do with the Ljung-Box Test.

3.5.1 Normality

We produce 2 Arima models, namely ARIMA with STLM and ARIMA using ‘auto.arima()’. Therefore we will test the assumption of Normality of Residuals from both models.

H0 : Residuals are normally distributed,[p-value > alpha (0.05)
H1 : Residuals are not normally distributed, [-value < alpha (0.05)]

hist(weat_arima$residuals, breaks = 100)

hist(weat_arima$residuals, breaks = 100)

shapiro.test(weat_arima$residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  weat_arima$residuals
#> W = 1, p-value = 0.00000001

shapiro.test(weat_autoarima$residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  weat_autoarima$residuals
#> W = 1, p-value = 0.000000000000005

The two ARIMA models produced have very small p-values (below 0.05), so we conclude to reject H0 and accept H1 that both models do not have normally distributed Residuals.

3.5.2 No autocorrelation

We generate 2 Arima models, namely ARIMA with STLM and ARIMA using ‘auto.arima()’; therefore, we will test the assumption of Auto-correlation of the two models.

H0 : There is no autocorrelation,[p-value > alpha (0.05)]
H1 : There is autocorrelation, [-value < alpha (0.05)]

Box.test(weat_arima$residuals, type = "Ljung-Box")

#> 
#>  Box-Ljung test
#> 
#> data:  weat_arima$residuals
#> X-squared = 0.1, df = 1, p-value = 0.7

Box.test(weat_autoarima$residuals, type = "Ljung-Box")

#> 
#>  Box-Ljung test
#> 
#> data:  weat_autoarima$residuals
#> X-squared = 0.004, df = 1, p-value = 0.9

The Ljung Box test value for the two ARIMA models is above the alpha value. Therefore we decide to accept H0, i.e., the model does not have auto-correlation symptoms.

3.6 Conclusion

Using the forecasting method allows us to forecast future events. The algorithms used in this project are Holt-Winters Exponential Smoothing, ETS, ARIMA, and TBATS. Before doing forecasting, it is crucial first to analyze seasonality patterns and trends. In this project, we find multiple seasonality of data.

By looking at the MAE values of all models, the smallest MAE value is the ARIMA model with STLM and ETS. Both models produce the smallest MAE values in training and testing datasets. This shows the best model capability compared to other models. However, the models we have used are inseparable from several points of note.

The general symptom that arises from all models is that the model has an excellent fitting result compared to the original data. However, in terms of predicting, all models have a reasonably high MAE error. This can indicate overfitting in the train data so that it is not sensitive when predicting using test data. Even so, we can still use these models to make forecasts, but with a note that we must be careful and as much as possible to optimize the model.

Finally, in the ASSUMPTION test, the ARIMA model does not meet the Normality of Residuals test. To solve this problem, we can perform data transformation before fitting it into the model. An example of a transformation that can be done is the Box-Cox Transformation, commonly used to overcome this Normality of Residuals problem.

Given that our ARIMA model does not meet one of the two assumptions, we can use another option to use the ETS model.

Delhi Cilmate Time Series and Forecast

M. Asadullah Al Ghozi

June 19, 2021