knitr::include_graphics("Pictures for projects/weather pict.png")Time series weather forecasting is a critical aspect of meteorology, with applications ranging from short-term prediction of severe storms to long-term climate modeling. Accurate weather forecasting relies on complex algorithms and vast amounts of data, and the field of time series analysis has made significant contributions to this field. Time series techniques have been used for decades to forecast weather patterns, and continue to be a valuable tool for meteorologists and researchers. From daily weather forecasts to long-term climate projections, the study of time series data is essential for understanding and predicting weather patterns. In this session I’m going to apply machine learning model to forecast temperature in the period of time using “Climate Data Daily IDN” data set.
Objective: Predict temperature for a period of time
library(tidyverse)
library(lubridate)
library(ggplot2)
library(plotly)
library(glue)
library(padr)
library(forecast)
library(stats)
library(xts)
library(TTR)
library(tseries)
library(MLmetrics)
library(devtools)In this section let’s save our data set into climate
variable
climate <- read.csv("DataSet/Ina Temperature/climate_data.csv")
climateDATA INFORMATION
date : date of recorded dataTn : min temperature (°C)Tx : max temperature (°C)Tavg : average of temperature (°C)RH_avg : average of humidity (%)RR : rainfall (mm)ss : duration of sunshine (hour)ff_x : max wind speed (m/s)ddd_x : wind direction at maximum speed (°)ff_avg : avg wind speed (m/s)ddd_car : most wind direction (°)station_id : station id which record the data. Detail
of the station can be found in station_detail.csvAs you may have noticed our climate data frame stored
the information of several variables from numerous region in Indonesia
(identified by the unique number of station_id) The link
where the data set from, also provides another data set called
station_detail.csv which stored the detail information
about the station_id. As we’re going to predict only from
one specific region, we need the station_detail.csv to get
the information of a region name according to it’s
station_id.
Save the station_detail.csv to variable called
station
station <- read.csv("DataSet/Ina Temperature/station_detail.csv")
stationIn this section we’re going to clean out data frame. Let’s specify
our objective to predict the average temperature from Jakarta
Pusat Region. Therefore we need to filter the
station data frame
1. Filter Region Name
station %>%
filter(region_name == "Kota Adm. Jakarta Pusat")2. Subset Data based on Condition
Now we get the station_id information of Jakarta Pusat,
we can now subset our climate data frame and put the Jakarta’s climate
data into jkt variable:
jkt <- climate[climate$station_id == 96745,]
jkt3. Check NA Value
# Check column that contains NA value
colSums(is.na(jkt))#> date Tn Tx Tavg RH_avg RR ss
#> 0 56 27 12 12 342 124
#> ff_x ddd_x ff_avg ddd_car station_id
#> 0 0 1 0 0
# Sum all NA Value
sum(is.na(jkt))#> [1] 574
From above chunk we get the information that jkt data
frame has 7 columns that contains NA value or as equal as 574 of total
NA value, however our main focus is to the Tavg column as
we’re only to predict the average temperature from Jakarta Pusat.
Therefore, I’m going to fill NA value to the Tavg column
only.
4. Range Date
Before we fill all the NA value in the Tavg column. As
we’ve noticed that our data frame stored the information of daily data,
we need to make sure that the jkt data frame record all
data from the first date of initial year to the end date of the data.
This step is need to be done in order ro satisfy the time series object
precondition.
Let’s check the range of jkt’s date
column:
range(jkt$date)#> [1] "01-01-2010" "31-12-2020"
head(jkt, 3)Range: 1st January 2010 to 31st December 2020,
datecolumn is written with day-month-year format
5.Data Adjustment
In this step, we’re going to :
dmy()
functiondate and Tavg columns using
select() functionpad() functionjkt.tempjkt.temp <- jkt %>%
mutate(date = dmy(date)) %>%
select(date, Tavg) %>%
pad()
jkt.temp# Recheck NA Value
colSums(is.na(jkt.temp))#> date Tavg
#> 0 26
Total NA value is increasing, it means that
jktdata frame has missing several days informations
6.Fill NA Value
To fill NA value I’m going to use based on the previous day data,
with the assumption of yesterday/ today / tomorrow’s temperature will
has the close degree of temperature, in the other words the assumption
is that yesterday’s temperature may influence today’s temperature, and
so today’s temperature may determines tomorrow’s temperature. Therefore
I’m going to set the argument down to .direction
parameter from the fill() function:
jkt.temp <- jkt.temp %>% fill(Tavg, .direction='down')
# Recheck NA Value
colSums(is.na(jkt.temp))#> date Tavg
#> 0 0
jkt.tempavg.temp <- jkt.temp %>%
mutate(yearmonth = as.yearmon(date)) %>%
group_by(yearmonth) %>%
summarize(temp = round(mean(Tavg), 1)) %>%
mutate(label = glue(
"Period: {yearmonth}
Temperature: {temp}"
))
avg.tempplot1 <- ggplot(avg.temp, aes(x = yearmonth, y = temp)) +
geom_line(col = "orange") +
geom_point(aes(text = label)) +
labs(title = "Jakarta Pusat's Monthly Average Temperature",
x = "Period",
y = "Temperature") +
theme_minimal() +
theme(plot.title = element_text(size = 14, face = "bold"))
ggplotly(plot1, tooltip = 'text')Insight: Both lowest and highest monthly average temperature happened in the year of 2014, compared to other years, 2014 has wider range of average temperature
Our data frame is ready, now we can create the time series object
with ts() function and do some analysis. I’m going to set
these arguments to the ts object:
jkt.ts <- ts(data = jkt.temp$Tavg,
start = c(2010, 1),
end = c(2020, 365),
frequency = 365)
jkt.ts %>% autoplotTo check the detail information from our ts object, we can use the decomposing method as below example:
decompose(jkt.ts) %>% autoplot()Highlight Point : Our decompose information shows us that the trend line doesn’t have a smooth stroke, there’s still fluctuation occuring which means there’s still uncapture seasonality, in the other words our data has multi seasonality properties.
To handle multi seasonal data, we can use msts()
function, this function allows us to specify all of the frequencies that
might be relevant. As out previous data use 365 frequency, let’s try to
increase more period of seasonality that might occur
jkt.temp$Tavg %>%
msts(seasonal.periods = c(365, 365*2)) %>% # yearly to 2 years period
mstl() %>%
autoplot()
> Highlight Point : The trend’s line is smoother
than previous plot, hovewer we need higher frequency to make it reduce
the fluctuation
jkt.temp$Tavg %>%
msts(seasonal.periods = c(365, 365*3)) %>%
mstl() %>%
autoplot()
>Highlight Point : The trend’s line is smooth enough
we can use this set up
Now we can apply the last set up to create the ts object and save it
to jkt.msts variable:
jkt.msts <- jkt.temp$Tavg %>%
msts(seasonal.periods = c(365, 365*3))
length(jkt.msts)#> [1] 4018
Before we build model to predict the temperature, we need to split our data into train and test to measure how well our model performance.
I’m going to set data.test and data.train
sequentially where data.test is the last 365 days of the
data, and the data.train will contains the initial days
excluding the data.test rows
data.test <- tail(jkt.msts, 365)
data.train <- head(jkt.msts, -length(data.test))To use STLM model I’m going to set the parameter lambda
to 0 because our data type is additive :
model.stlm <- data.train %>%
stlm(lambda = 0) %>%
forecast(h = length(data.test))
plot(model.stlm)
Above plot shows the prediction of our data test, to get clearer view of
the comparison between the prediction and the real value, we can see
from below plot:
data.train %>%
autoplot() +
autolayer(data.test, series = "actual") +
autolayer(model.stlm$fitted, series = "fitted values") +
autolayer(model.stlm$mean, series = "forecast data test")
It seems that our STLM model prediction has great performance, however
to make sure our STLM model we can see from the MAPE calculation, that
measure how much error produced from our model:
MAPE(model.stlm$mean, data.test)*100#> [1] 3.287645
Our STLM model has low number of error or about 3.2%. To get the best prediction we need some different types of model for comparison, then we can conclude which model has better performance.
A TBATS model differs from dynamic harmonic regression in that the seasonality is allowed to change slowly over time in a TBATS model, while harmonic regression terms force the seasonal patterns to repeat periodically without changing.
model.tbats <- data.train %>%
tbats(use.box.cox = T, #Set as True to normalize data distribution
use.trend = F, #Set as False as our data doesn't have significant trend
use.damped.trend = F, )
f.tbats <- forecast(model.tbats,h=length(data.test))
plot(f.tbats)data.train %>%
autoplot() +
autolayer(data.test, series = "actual") +
autolayer(model.tbats$fitted, series = "fitted values") +
autolayer(f.tbats$mean, series = "forecast data test")MAPE(f.tbats$mean, data.test)* 100#> [1] 2.337441
Compared to the STLM Model, TBATS Model has lower MAPE rate: 2.3%
model.arima <- stlm(y = log(data.train), s.window = 365, method = "arima")
f.arima <- forecast(model.arima, h = length(data.test))
MAPE(exp(f.arima$mean), data.test)*100#> [1] 2.545157
data.train %>%
autoplot() +
autolayer(data.test, series = "actual") +
autolayer(exp(f.arima$fitted), series = "fitted values") +
autolayer(exp(f.arima$mean), series = "forecast data test")
> Our Arima model has also a great performance with MAPE rate
2.5%
All our model shows great performance, and yet let’s compare it in a detail information:
model.accuracy <- rbind(accuracy(as.vector(model.stlm$mean), data.test),
accuracy(as.vector(f.tbats$mean), data.test),
accuracy(as.vector(exp(f.arima$mean)), data.test)
)
rownames(model.accuracy) <- c("STLM", "TBATS", "ARIMA")
model.accuracy <- as.data.frame(model.accuracy)
model.accuracyFrom above data frame (MAPE and RMSE rate) we can conclude that TBATS has the best performance among 2 other models
All models has great performance and accuracy, however time series analysis has 2 assumption that need to be satisfied which are:
Normality of Residuals:
H0 : residuals are normally distributed H1 : residuals are not normally distributed
No Auto correlations:
H0 : No auto correlations in the forecast errors H1 : there is an auto correlations in the forecast errors
shapiro.test(model.stlm$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: model.stlm$residuals
#> W = 0.98703, p-value < 0.00000000000000022
hist(model.stlm$residuals, main = "Residuals of STLM Model", xlab = "Residuals", col = "#75aadb")Box.test(model.stlm$residuals, type = "Ljung-Box")#>
#> Box-Ljung test
#>
#> data: model.stlm$residuals
#> X-squared = 84.818, df = 1, p-value < 0.00000000000000022
Highlight: both saphiro test (normality residual test) and L-Jung Box (autocorrelation test) at STLM Model shows both asumptions are not statisfied (p-value < 0.05)
shapiro.test(f.tbats$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: f.tbats$residuals
#> W = 0.96365, p-value < 0.00000000000000022
hist(f.tbats$residuals, main = "Residuals of TBATS Model", xlab = "Residuals", col = "#75aadb")Box.test(f.tbats$residuals, type = "Ljung-Box")#>
#> Box-Ljung test
#>
#> data: f.tbats$residuals
#> X-squared = 0.0040806, df = 1, p-value = 0.9491
Highlight : TBATS model has no autocorrelation (p-value > 0.05) and yet residuals is not normally distributed (p-value < 0.05)
shapiro.test(f.arima$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: f.arima$residuals
#> W = 0.98401, p-value < 0.00000000000000022
hist(f.arima$residuals, main = "Residuals of ARIMA Model", xlab = "Residuals", col = "#75aadb")Box.test(f.arima$residuals, type = "Ljung-Box")#>
#> Box-Ljung test
#>
#> data: f.arima$residuals
#> X-squared = 0.35975, df = 1, p-value = 0.5486
Highlight : Arima model has no autocorrelation (p-value > 0.05) and yet residuals is not normally distributed (p-value < 0.05)
Though that our models has great performance (low rate of MAPE), all 3 models haven’t satisfied the time series assumption. To deal with such condition, we might need more robust and advance model