1 Preface

knitr::include_graphics("Pictures for projects/weather pict.png")

Time series weather forecasting is a critical aspect of meteorology, with applications ranging from short-term prediction of severe storms to long-term climate modeling. Accurate weather forecasting relies on complex algorithms and vast amounts of data, and the field of time series analysis has made significant contributions to this field. Time series techniques have been used for decades to forecast weather patterns, and continue to be a valuable tool for meteorologists and researchers. From daily weather forecasts to long-term climate projections, the study of time series data is essential for understanding and predicting weather patterns. In this session I’m going to apply machine learning model to forecast temperature in the period of time using “Climate Data Daily IDN” data set.

Objective: Predict temperature for a period of time

2 Library Preparation

library(tidyverse)
library(lubridate)
library(ggplot2)
library(plotly)
library(glue)
library(padr)
library(forecast)
library(stats)
library(xts)
library(TTR) 
library(tseries) 
library(MLmetrics)
library(devtools)

3 Read Data & Data Information

In this section let’s save our data set into climate variable

climate <- read.csv("DataSet/Ina Temperature/climate_data.csv")
climate

DATA INFORMATION

  1. date : date of recorded data
  2. Tn : min temperature (°C)
  3. Tx : max temperature (°C)
  4. Tavg : average of temperature (°C)
  5. RH_avg : average of humidity (%)
  6. RR : rainfall (mm)
  7. ss : duration of sunshine (hour)
  8. ff_x : max wind speed (m/s)
  9. ddd_x : wind direction at maximum speed (°)
  10. ff_avg : avg wind speed (m/s)
  11. ddd_car : most wind direction (°)
  12. station_id : station id which record the data. Detail of the station can be found in station_detail.csv

As you may have noticed our climate data frame stored the information of several variables from numerous region in Indonesia (identified by the unique number of station_id) The link where the data set from, also provides another data set called station_detail.csv which stored the detail information about the station_id. As we’re going to predict only from one specific region, we need the station_detail.csv to get the information of a region name according to it’s station_id.

Save the station_detail.csv to variable called station

station <- read.csv("DataSet/Ina Temperature/station_detail.csv")
station

4 Data Wrangling

In this section we’re going to clean out data frame. Let’s specify our objective to predict the average temperature from Jakarta Pusat Region. Therefore we need to filter the station data frame

1. Filter Region Name

station %>% 
  filter(region_name == "Kota Adm. Jakarta Pusat")

2. Subset Data based on Condition

Now we get the station_id information of Jakarta Pusat, we can now subset our climate data frame and put the Jakarta’s climate data into jkt variable:

jkt <- climate[climate$station_id == 96745,]
jkt

3. Check NA Value

# Check column that contains NA value
colSums(is.na(jkt))
#>       date         Tn         Tx       Tavg     RH_avg         RR         ss 
#>          0         56         27         12         12        342        124 
#>       ff_x      ddd_x     ff_avg    ddd_car station_id 
#>          0          0          1          0          0
# Sum all NA Value
sum(is.na(jkt))
#> [1] 574

From above chunk we get the information that jkt data frame has 7 columns that contains NA value or as equal as 574 of total NA value, however our main focus is to the Tavg column as we’re only to predict the average temperature from Jakarta Pusat. Therefore, I’m going to fill NA value to the Tavg column only.

4. Range Date

Before we fill all the NA value in the Tavg column. As we’ve noticed that our data frame stored the information of daily data, we need to make sure that the jkt data frame record all data from the first date of initial year to the end date of the data. This step is need to be done in order ro satisfy the time series object precondition.

Let’s check the range of jkt’s date column:

range(jkt$date)
#> [1] "01-01-2010" "31-12-2020"
head(jkt, 3)

Range: 1st January 2010 to 31st December 2020, date column is written with day-month-year format

5.Data Adjustment

In this step, we’re going to :

  1. Change date column to date time type using dmy() function
  2. Select only date and Tavg columns using select() function
  3. Recheck and include all date from 1st January 2010 to 31st December 2020 using pad() function
  4. Save to new variable called jkt.temp
jkt.temp <- jkt %>% 
  mutate(date = dmy(date)) %>% 
  select(date, Tavg) %>% 
  pad()
jkt.temp
# Recheck NA Value
colSums(is.na(jkt.temp))
#> date Tavg 
#>    0   26

Total NA value is increasing, it means that jkt data frame has missing several days informations

6.Fill NA Value

To fill NA value I’m going to use based on the previous day data, with the assumption of yesterday/ today / tomorrow’s temperature will has the close degree of temperature, in the other words the assumption is that yesterday’s temperature may influence today’s temperature, and so today’s temperature may determines tomorrow’s temperature. Therefore I’m going to set the argument down to .direction parameter from the fill() function:

jkt.temp <- jkt.temp %>% fill(Tavg, .direction='down')

# Recheck NA Value
colSums(is.na(jkt.temp))
#> date Tavg 
#>    0    0
jkt.temp

5 EDA

avg.temp <- jkt.temp %>% 
  mutate(yearmonth = as.yearmon(date)) %>% 
  group_by(yearmonth) %>% 
  summarize(temp = round(mean(Tavg), 1)) %>% 
  mutate(label = glue(
    "Period: {yearmonth}
     Temperature: {temp}"
  ))
avg.temp
plot1 <- ggplot(avg.temp, aes(x = yearmonth, y = temp)) +
  geom_line(col = "orange") +
  geom_point(aes(text = label)) +
  labs(title = "Jakarta Pusat's Monthly Average Temperature",
       x = "Period",
       y = "Temperature") +
  theme_minimal() +
  theme(plot.title = element_text(size = 14, face = "bold"))
  

ggplotly(plot1, tooltip = 'text')

Insight: Both lowest and highest monthly average temperature happened in the year of 2014, compared to other years, 2014 has wider range of average temperature

6 TS Object

Our data frame is ready, now we can create the time series object with ts() function and do some analysis. I’m going to set these arguments to the ts object:

  1. data = jkt.temp$Tavg (as both target and predictor variable)
  2. start = c(2010, 1) (start from year 2010, day 1)
  3. end = c(2020, 365) (end at the 366 days from 2020)
  4. frequency = 365 (daily data for 10 years, to separate the data into yearly period)
jkt.ts <- ts(data = jkt.temp$Tavg,
             start = c(2010, 1),
             end = c(2020, 365),
             frequency = 365)
jkt.ts %>% autoplot

6.1 Decompose

To check the detail information from our ts object, we can use the decomposing method as below example:

decompose(jkt.ts) %>% autoplot()

Highlight Point : Our decompose information shows us that the trend line doesn’t have a smooth stroke, there’s still fluctuation occuring which means there’s still uncapture seasonality, in the other words our data has multi seasonality properties.

6.2 Handling Multi Seasonality

To handle multi seasonal data, we can use msts() function, this function allows us to specify all of the frequencies that might be relevant. As out previous data use 365 frequency, let’s try to increase more period of seasonality that might occur

jkt.temp$Tavg %>% 
  msts(seasonal.periods = c(365, 365*2)) %>% # yearly to 2 years period
  mstl() %>% 
  autoplot() 

> Highlight Point : The trend’s line is smoother than previous plot, hovewer we need higher frequency to make it reduce the fluctuation

jkt.temp$Tavg %>% 
  msts(seasonal.periods = c(365, 365*3)) %>% 
  mstl() %>% 
  autoplot() 

>Highlight Point : The trend’s line is smooth enough we can use this set up

Now we can apply the last set up to create the ts object and save it to jkt.msts variable:

jkt.msts <- jkt.temp$Tavg %>% 
  msts(seasonal.periods = c(365, 365*3))

length(jkt.msts)
#> [1] 4018

7 Cross Validation

Before we build model to predict the temperature, we need to split our data into train and test to measure how well our model performance.

I’m going to set data.test and data.train sequentially where data.test is the last 365 days of the data, and the data.train will contains the initial days excluding the data.test rows

data.test <- tail(jkt.msts, 365)
data.train <- head(jkt.msts,  -length(data.test))

8 Build Model

9 STLM

To use STLM model I’m going to set the parameter lambda to 0 because our data type is additive :

model.stlm <- data.train %>%
  stlm(lambda = 0) %>% 
  forecast(h = length(data.test))
plot(model.stlm)

Above plot shows the prediction of our data test, to get clearer view of the comparison between the prediction and the real value, we can see from below plot:

data.train %>% 
  autoplot() +
  autolayer(data.test, series = "actual") +
  autolayer(model.stlm$fitted, series = "fitted values") +
  autolayer(model.stlm$mean, series = "forecast data test")

It seems that our STLM model prediction has great performance, however to make sure our STLM model we can see from the MAPE calculation, that measure how much error produced from our model:

MAPE(model.stlm$mean, data.test)*100
#> [1] 3.287645

Our STLM model has low number of error or about 3.2%. To get the best prediction we need some different types of model for comparison, then we can conclude which model has better performance.

9.1 TBATS Model

A TBATS model differs from dynamic harmonic regression in that the seasonality is allowed to change slowly over time in a TBATS model, while harmonic regression terms force the seasonal patterns to repeat periodically without changing.

model.tbats <- data.train %>%
            tbats(use.box.cox = T, #Set as True to normalize data distribution
                  use.trend = F, #Set as False as our data doesn't have significant trend
                  use.damped.trend = F, )
f.tbats <-  forecast(model.tbats,h=length(data.test)) 
plot(f.tbats)

data.train %>% 
  autoplot() +
  autolayer(data.test, series = "actual") + 
  autolayer(model.tbats$fitted, series = "fitted values") +
  autolayer(f.tbats$mean, series = "forecast data test")

MAPE(f.tbats$mean, data.test)* 100
#> [1] 2.337441

Compared to the STLM Model, TBATS Model has lower MAPE rate: 2.3%

9.2 Model Arima

model.arima <- stlm(y = log(data.train), s.window = 365, method = "arima")
f.arima <- forecast(model.arima, h = length(data.test))

MAPE(exp(f.arima$mean), data.test)*100
#> [1] 2.545157
data.train %>% 
  autoplot() +
  autolayer(data.test, series = "actual") + 
  autolayer(exp(f.arima$fitted), series = "fitted values") +
  autolayer(exp(f.arima$mean), series = "forecast data test")

> Our Arima model has also a great performance with MAPE rate 2.5%

9.3 Model Evaluation

All our model shows great performance, and yet let’s compare it in a detail information:

model.accuracy <- rbind(accuracy(as.vector(model.stlm$mean), data.test),
      accuracy(as.vector(f.tbats$mean), data.test),
      accuracy(as.vector(exp(f.arima$mean)), data.test)
      )

rownames(model.accuracy) <- c("STLM", "TBATS", "ARIMA")
model.accuracy <- as.data.frame(model.accuracy)
model.accuracy

From above data frame (MAPE and RMSE rate) we can conclude that TBATS has the best performance among 2 other models

10 Assumptions

All models has great performance and accuracy, however time series analysis has 2 assumption that need to be satisfied which are:

Normality of Residuals:

H0 : residuals are normally distributed H1 : residuals are not normally distributed

No Auto correlations:

H0 : No auto correlations in the forecast errors H1 : there is an auto correlations in the forecast errors

10.1 STLM Model Assumption

shapiro.test(model.stlm$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model.stlm$residuals
#> W = 0.98703, p-value < 0.00000000000000022
hist(model.stlm$residuals, main = "Residuals of STLM Model", xlab = "Residuals", col = "#75aadb")

Box.test(model.stlm$residuals, type = "Ljung-Box")
#> 
#>  Box-Ljung test
#> 
#> data:  model.stlm$residuals
#> X-squared = 84.818, df = 1, p-value < 0.00000000000000022

Highlight: both saphiro test (normality residual test) and L-Jung Box (autocorrelation test) at STLM Model shows both asumptions are not statisfied (p-value < 0.05)

10.2 TBATS Model Assumption

shapiro.test(f.tbats$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  f.tbats$residuals
#> W = 0.96365, p-value < 0.00000000000000022
hist(f.tbats$residuals, main = "Residuals of TBATS Model", xlab = "Residuals", col = "#75aadb")

Box.test(f.tbats$residuals, type = "Ljung-Box")
#> 
#>  Box-Ljung test
#> 
#> data:  f.tbats$residuals
#> X-squared = 0.0040806, df = 1, p-value = 0.9491

Highlight : TBATS model has no autocorrelation (p-value > 0.05) and yet residuals is not normally distributed (p-value < 0.05)

10.3 ARIMA Model Assumption

shapiro.test(f.arima$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  f.arima$residuals
#> W = 0.98401, p-value < 0.00000000000000022
hist(f.arima$residuals, main = "Residuals of ARIMA Model", xlab = "Residuals", col = "#75aadb")

Box.test(f.arima$residuals, type = "Ljung-Box")
#> 
#>  Box-Ljung test
#> 
#> data:  f.arima$residuals
#> X-squared = 0.35975, df = 1, p-value = 0.5486

Highlight : Arima model has no autocorrelation (p-value > 0.05) and yet residuals is not normally distributed (p-value < 0.05)

11 Summary

Though that our models has great performance (low rate of MAPE), all 3 models haven’t satisfied the time series assumption. To deal with such condition, we might need more robust and advance model