1 Introduction: Climate in Rio

At the beginning we will read the dataset. The data to be used is city temperature data in the city of Rio de Janeiro, Brazil. This data is city temperature measurement data from 1973 - 2019. In this data there is time information in the yearmonth column and sea level height in the degree column where the unit used is Celsius.

According to a climate change adaptation report from the city government, average air temperatures in Rio have increased by about 0.05 degrees Celsius per year in recent decades. As part of a partnership with NASA’s Earth Science Division, city leaders are working to better understand and adapt to the heat. Planners are using publicly available temperature data from NASA satellites to help devise ways of “climate-proofing” the city from rising temperatures and natural disasters.

2 Basic Concept: Time Series

2.1 Time Series

Time series is a method of analyzing and processing data which the values are affected by time. The action of predicting future values based on its value in the previous period of time is called forecasting.

The data which formatted into a time series (ts) object must have some characteristics: - no missing intervals - no missing values - data should be ordered by time

2.2 Exploratory Data Analysis

A ts object can be decomposed into 3 main components which will be calculated for forecasting. These components are: - trend (T): the movement of mean, globally, throughout an interval; - seasonal (S): the pattern captured on each seasonal interval; - error (E): the pattern/value that cannot be captured by both trend and seasonal.

The relationship between those 3 components of ts object can be either additive or multiplicative, which can be seen on the line plot of the original data: - additive: T + S + E - multiplicative: T * S * E

Most of the analytic tools for time series are available for additive time series data. Even so, we can apply standard additive time series analysis to a multiplicative time series data by previously log-transformed our data. This action will transform our multiplicative time series data into additive time series data. Another way is to state this characteristic when building a time series model. This characteristic can also be stated during decomposition of a ts object during exploratory data analysis.

Besides forecasting, time series object can also be very useful for data analysis. Using seasonality adjustment (removing seasonal of a time series object) we can analyze the trend and detect anomaly events throughout the time series interval.

3 Modelling

3.1 Import Library

library(dplyr) # for data wrangling
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lubridate) # to dea with date
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(padr) # for padding
## Warning: package 'padr' was built under R version 4.2.3
library(forecast) # time series library
## Warning: package 'forecast' was built under R version 4.2.3
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(tseries) # for adf.test
## Warning: package 'tseries' was built under R version 4.2.3
library(MLmetrics) # calculate error
## 
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
## 
##     Recall

3.2 Read Data

rio <- read.csv("data_input/station_rio_new.csv")
rio <- rio %>%
  mutate(yearmonth=mdy(yearmonth))
glimpse(rio)
## Rows: 564
## Columns: 2
## $ yearmonth <date> 1973-01-01, 1973-02-01, 1973-03-01, 1973-04-01, 1973-05-01,…
## $ degree    <dbl> 27.73, 27.73, 27.73, 27.73, 27.73, 27.73, 27.73, 27.73, 27.7…

3.3 Data Preprocessing

Checking no missing sequence data

range(rio$yearmonth)
## [1] "1973-01-01" "2019-12-01"
all(seq.Date(from = as.Date("1973-01-01"), to = as.Date("2019-12-01"), by = "month") == rio$yearmonth)
## [1] TRUE

We have checked no missing value

4 Modeling

4.1 TS Object & EDA

rio_ts <- ts(data = rio$degree,
   start =  c(1973,1),
   frequency = 12) # monthly seasonality
rio_ts %>% autoplot()

In this section, I tried to explore whether our timeseries object has trend and seasonal properties (one-seasonal/multiseasonal).

rio_deco <- decompose(rio_ts)
autoplot(rio_deco)

4.2 Cross Validation

The cross-validation scheme for time series should not be sampled randomly, but splitted sequentially.

rio_test <- tail(rio_ts, 12)
rio_train <- head(rio_ts, -12) 

4.3 Model Building

For model building, I will compare between two of the widely used time series modeling in business and industry: ETS Holt-Winters and Seasonal Arima. I use ETS Holt-Winters because my data contain seasonal. I also want to compare it between seasonal Arima to check whether seasonal ARIMA can give better forecasting performance.

# ets Holt-Winters
rio_ets <- stlm(rio_train, method = "ets", lambda = 0) # no log transformation for addivite data
# SARIMA
rio_arima <- stlm(rio_train, method = "arima", lambda = 0)

4.4 Forecast & Evaluation

# forecast
rio_ets_f <- forecast(rio_ets, h = 28)
rio_arima_f <- forecast(rio_arima, h = 28)
# visualization
library(ggplot2)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
a <- autoplot(rio_ets_f, series = "ETS", fcol = "red") +
  autolayer(rio_ts, series = "Actual", color = "black") + 
  labs(subtitle = "Climate changes on Rio, from Jan - Dec 2020",
       y = "Degree") +
  theme_minimal()

b <- autoplot(rio_arima_f, series = "ARIMA", fcol = "blue") +
  autolayer(rio_ts, series = "Actual", color = "black") +
  labs(subtitle = "Climate changes on Rio, from Jan - Dec 2020",
       y = "Degree") +
  theme_minimal()

grid.arrange(a,b)

## Assumption Check

  1. Normality: Shapiro.test
  • H0 : residuals are normally distributed
  • H1 : residuals are not normally distributed
shapiro.test(rio_arima_f$residuals) # p-value < 0.05; reject H0; accept H1 
## 
##  Shapiro-Wilk normality test
## 
## data:  rio_arima_f$residuals
## W = 0.32384, p-value < 2.2e-16
hist(rio_arima_f$residuals, breaks = 20)

plot(rio_arima_f$residuals)

  1. Autocorrelation: Box.test - Ljng-Box
  • H0 : No autocorrelation in the forecast errors
  • H1 : there is an autocorrelation in the forecast errors
Box.test(rio_arima_f$residuals, type = "Ljung-Box") # there is not enough data to reject H0
## 
##  Box-Ljung test
## 
## data:  rio_arima_f$residuals
## X-squared = 4.2565e-05, df = 1, p-value = 0.9948

Based on the assumption check, there is no autocorrelation on our forecast residuals (p-value > 0.05). Still, our forecast’s residuals are not distributed normally, therefore it’s residuals may not be appeared around its mean as seen in the histogram. But, if we inspect the distribution of residuals through a line plot, it is actually resembles the error plot from our time series object decomposition.

In a time series, such errors might emerge from various unpredictable events and is actually quite unavoidable. One strategy to overcome it is to analyze what kinds of unpredictable events that might occur and occurs frequently. This can be done by time series analysis using seasonality adjustment. From that insight, Rio government can forecast the smart strategies for dealing with such events.