Ozone is a crucial gas present in the Earth’s upper atmosphere. It forms an essential layer that envelops our planet. This layer, often referred to as the ozone layer, exists predominantly in the stratosphere, a region extending about 10 to 30 miles above the Earth’s surface. Its primary function is to act as a protective shield, absorbing a significant portion of the sun’s harmful ultraviolet (UV) radiation. Without the ozone layer, these UV rays would reach the Earth’s surface in much higher intensities, posing a serious threat to all forms of life.
The chemical composition of ozone is quite interesting. It is made up of three oxygen atoms (O₃), which is different from the oxygen we are familiar with in the air we breathe, which has only two oxygen atoms (O₂). This unique structure of three oxygen atoms is what gives ozone its ability to absorb UV radiation effectively. It is this absorption characteristic that makes the ozone layer so vital for protecting life on Earth.
Understanding the significance of ozone extends beyond its composition and protective function. The ozone layer is crucial for maintaining a balance in the planet’s ecosystems. It protects humans from harmful UV rays that can lead to skin cancer, cataracts, and other health issues. Moreover, it shields the environment, preserving the integrity of various ecosystems and wildlife. This protective layer ensures that life on Earth can thrive without the constant threat of UV radiation damage.
Consequently, predicting the thickness of the ozone layer becomes a matter of great importance. Monitoring and predicting changes in the ozone layer help us understand the health of our environment. It also guides us in taking necessary precautions to protect ourselves, especially during periods when the ozone layer might be thinner. This forecasting is not just about immediate health concerns but is also crucial for long-term environmental strategies and policies aimed at preserving the Earth’s atmosphere.
In this forecast, we acquire data by scraping it from OMI/OMPS Ozone Time Series Data. This tool allows us to retrieve ozone values as a time series for a specified range within certain longitude and latitude coordinates. For this project, we set the longitude to 106.8106 and the latitude to -6.261493, specifically targeting the coordinates of South Jakarta. The data is retrieved for a timespan of 5 years (January 1st 2019 - January 1st 2024). The data obtained from scraping is presented in the following columns:
yyyy: the year when specific data was retrieved.MM: the month when specific data was retrieved.dd: the day of the month when specific data was
retrieved.DD: the day of the year when specific data was
retrieved.O3_Nearest: daily average total column ozone values
inO3_Interpolated: daily average total column ozone
values in Dobson units, interpolated.InterpolPos: a logical value indicating the
interpolation process.Given the background and dataset, our aim in this project is:
To analyze the periodic components of ozone values and utilize them to forecast future ozone levels.
Below are the packages that we will use during the analysis.
# for data loading - data preprocessing
library(dplyr)
library(tidyr)
library(padr)
library(zoo)
library(lubridate)
# data visualization
library(ggplot2)
library(plotly)
library(glue)
# modeling - evaluation
library(prophet)
library(forecast)
library(TTR)
library(tseries)
library(fpp)
library(MLmetrics)The retrieval of data is done twice: January 2019 - December 2020 and
January 2021 - early January 2024. For these, we have 2
.csv files in total. We will load them as df1
and df2. Subsequently, we will merge df1 and
df2 as df.
df1 <- read.csv('data/SatO3DataServlet_January2019-December2020.csv')
df2 <- read.csv('data/SatO3DataServlet_January2021-January2024.csv')In the initial phase of our analysis, we will focus on data cleansing to guarantee precision in our analysis and modeling process. Data cleansing involves identifying and rectifying any inaccuracies, corruptions, formatting errors, duplications, or incomplete information within our dataset. Let’s check our data structure.
#> Rows: 1,826
#> Columns: 7
#> $ yyyy <int> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, …
#> $ MM <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ dd <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
#> $ DDD <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
#> $ O3_Nearest <dbl> 256, 0, 259, 257, 257, 258, 262, 262, 0, 261, 273, 267…
#> $ O3_Interpolated <dbl> 255.546, 0.000, 258.719, 257.352, 257.908, 259.386, 26…
#> $ InterpolPos <chr> " Y", " N", " Y", " …
As seen above, our datetime details are separated into individual columns, which is impractical for time series analysis. Therefore, we will create a new column that combines all these time details.
Recheck our data.
#> Rows: 1,826
#> Columns: 8
#> $ yyyy <int> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, …
#> $ MM <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ dd <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
#> $ DDD <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
#> $ O3_Nearest <dbl> 256, 0, 259, 257, 257, 258, 262, 262, 0, 261, 273, 267…
#> $ O3_Interpolated <dbl> 255.546, 0.000, 258.719, 257.352, 257.908, 259.386, 26…
#> $ InterpolPos <chr> " Y", " N", " Y", " …
#> $ datetime <date> 2019-01-01, 2019-01-02, 2019-01-03, 2019-01-04, 2019-…
For our analysis, we will select only the columns of interest:
datetime and O3_Interpolated. Both of columns
will be saved as a new dataframe named ozone.
As we may already know, time series data is self-explanatory: it consists of a series of data points ordered in time. Consequently, these time points must be arranged from past to present. Accordingly, we will ensure that our data is arranged in the required format.
The second requirement that needs to be fulfilled is time completeness. Time completeness ensures that our data has a constant interval between each point. In our case, the data is in a daily format, meaning the interval between each data point is one day.
We will create full_date as a complete time sequence
covering our timespan. Next, we will compare this full time sequence
with our time sequence to identify any differences, checking for any
missing times.
#> [1] 1827
If our data is complete, then it should have 1827 rows in it.
#> [1] 1826 2
#> [1] "2023-12-03"
Since our data only has 1826 rows, we assume that one day is missing, which is December 3rd, 2023. We will address this by performing time padding. We will restructure our data to include a complete timeline.
As we can see, the record for December 3rd, 2023 is identified as
NA. We will impute this missing value with 0.
We observe that on several days, specifically 534 out of 1827 days in our dataset, the record shows a value of 0. This means that approximately 30% of the days are recorded with a value of 0.
There could be many reasons for this, but it is most likely due to a recording error by the satellite. To address this, we will impute the 0 value with the ozone value from the previous day.
# replace zeros with NA
ozone <- ozone %>% mutate(O3 = ifelse(O3 == 0, NA, O3))
# fill NA with the previous non-NA value
ozone <- ozone %>% fill(O3, .direction = "down")Now that our data is clean, we are ready to move on to the next steps.
Before we begin modeling the time series, let’s first examine our data for any patterns. Visualizing the data is key here, as it helps us spot trends and anomalies that might not be obvious just from numbers alone. This step is crucial for choosing the right approach to our analysis.
From this visualization, we can see that there are quite interesting
patterns:
To further investigate this, we will explore the times at which the ozone values fall below 250 Dobson units and above 275 Dobson units.
below_250 <- ozone %>% filter(O3 <= 250)
below_250$month_name <- month(below_250$date, label = TRUE, abbr = FALSE)
table(below_250$month_name)#>
#> January February March April May June July August
#> 26 10 11 18 38 31 47 8
#> September October November December
#> 0 4 8 28
above_275 <- ozone %>% filter(O3 >= 275)
above_275$month_name <- month(above_275$date, label = TRUE, abbr = FALSE)
table(above_275$month_name)#>
#> January February March April May June July August
#> 0 8 13 8 3 0 1 0
#> September October November December
#> 4 34 11 8
These tables suggest a tendency for ozone values to reach their lowest points in the early and middle parts of the year (January, May, June, July, December). In addition to this, the ozone values reach their peak around the middle of a six-month period (March, April, October, November).
#> [1] 235.908 340.408
Furthermore, it has been investigated that over a five-year period, ozone values consistently fall between 235 and 275 units. Given that the average global ozone value is 300 units, this range is considered low, suggesting a low level of ozone.
We have gained insights into the data’s seasonality, identifying the periods of peak and low values. In this section, we will experiment with building models to forecast future ozone levels. We will explore various machine learning techniques, including Prophet by Meta and ARIMA. Before proceeding, we will divide our data into two parts: the training set and the testing set. Specifically, we will use data from 2019 to 2022 as the training set and data from 2023 as the testing set.
cutoff <- as.Date("2023-01-01")
train <- ozone %>% filter(date < cutoff)
test <- ozone %>% filter(date >= cutoff) #> [1] 1461 2
#> [1] 366 2
The model will use 1461 days of data to learn the trends and seasonality. The remaining data will be used for model evaluation.
prophet is a package released by Meta, designed for
handling time series data using the General Additive Model (GAM). Time
series data is composed of trend, seasonality, and error.
As a starting point, prophet requires our data to be
organized into two columns: ds and y.
ds: datetime details.y: numerical value.To accommodate this requirement, we will rename our columns.
For the modeling phase, we will create the model object using the
prophet() function and feed our training data into it. By
using prophet(), we declare the base model. Use the default
values for each parameter as they are. With the code below, our model
will learn the corresponding trends and seasonality.
Next, we want our model to predict future values. The first thing we
do is provide a list of times we want to predict. Here, we set the
parameter periods to 366, indicating our intention to
forecast one year ahead. The choice of 366 is customized to our needs
and our data. That is, our data is in daily format. Hence, the
periods is set to the same unit as our data. If we want to
forecast for 6 months, then periods = 180. Furthermore,
setting periods = 366 makes it easier for us to perform
model evaluation.
By running the above code, we create a future_base
dataframe. future_base includes the timespan of our
historical data and 366 days after the last day in our training
data.
Subsequently, we use the predict() function to make
predictions by passing the model m_base and a list of times
we want to predict in future_base.
The model returns several columns as prediction results, but we focus on the following:
ds: the datetime.yhat: the predicted value at the specific time.trend: the long-term movement of our data.weekly: weekly seasonality, reflecting the cyclic
pattern over 7 days.yearly: yearly seasonality, indicating the cyclic
pattern over 365 days.Remember that the prophet models our data using
Generalized Additive Models (GAM). In this context:
\[yhat = trend + weekly + yearly\] Our prediction comprises the trend and seasonal components.
prophet offers the capability to visualize both the
actual data and the predicted values. This can be achieved using the
plot() function by passing both the model and the
forecasting results to it.
We can plot the forecast in more interactive way with the following
code.
Additionally, we can plot the results by component, allowing us to visualize the trend and seasonality separately.
💡 Insight:
In the next step, we will evaluate our model’s performance when
making predictions on the training set and testing set. To do this, we
need to subset our forecast_base. Forecast values between
the cutoff date represent the forecast values of the training set, and
the remaining values are the forecast values of the testing set.
forecast_train <- filter(forecast_base, ds < cutoff)
forecast_test <- filter(forecast_base, ds >= cutoff )For the evaluation metrics, we will use MAPE metrics due to its ease of interpretability.
#> [1] 0.06395343
#> [1] 0.01897595
💡 Insight:
For a clear representation of our actual data and the prediction, let’s plot them together.
p1 <- ggplot() +
# plot actual train
geom_jitter(data=train,
mapping = aes(ds, y,
color = 'Actual Train')) +
# plot prediction of training set
geom_line(data=forecast_train,
mapping = aes(as.Date(ds),
yhat,
color = 'Prediction'),
size = 2) +
# plot actual test
geom_jitter(data=test,
mapping = aes(ds, y,
color = 'Actual Test')) +
# plot prediction of testing set
geom_line(data=forecast_test ,
mapping = aes(as.Date(ds), yhat,
color = 'Prediction'),
size = 2) +
xlab("Time") +
ylab("O3") +
ggtitle("Actual and Prediction 03 Values") +
scale_color_manual(values = c("Actual Train" = "black",
"Prediction" = "red",
"Actual Test" = "darkgrey")) +
theme_minimal()
ggplotly(p1)As we can see from the plot, our prediction line has failed to fit the actual values of the testing set. We can also observe that our red prediction line is overly fitted to the training data, failing to capture the general pattern.
We can address this issue by reducing the model complexity.
Prophet offers a number of parameters that can be
adjusted when defining a model object. From the previous analysis, we
identified an anomaly where the trend prediction significantly differs
from that of previous years: the ozone values are increasing throughout
the year. To address this, we will attempt to tune the parameter related
to the trend component: changepoint.prior.scale. This
parameter controls the flexibility of the trend. By default,
changepoint.prior.scale is set to 0.05. Within the
recommended range of 0.001 to 0.5, the greater the value, the more
flexible the trend becomes. In other words, it better fits the training
data, although this increases the probability of overfitting. In this
section, we will set the parameter value to 0.001.
The subsequent parts go as the same as before.
future_tune <- make_future_dataframe(m_tune, periods = 366)
forecast_tune <- predict(m_tune, future_tune)forecast_train_tune <- filter(forecast_tune, ds < cutoff)
forecast_test_tune <- filter(forecast_tune, ds >= cutoff )#> [1] 0.03773785
#> [1] 0.02203781
Compared to the previous one, our tuned model’s performance is significantly better than that of the base model. The gap between the training performance and testing performance is relatively smaller than with the base model. Although there is an increase in training error, it is not so significant. This suggests that we can be more confident that our model is not overfitting.
Below is code for plotting the actual and predicted data.
p2 <- ggplot() +
# plot actual train
geom_jitter(data=train,
mapping = aes(ds, y, color = 'Actual Train')) +
# plot prediction of training set
geom_line(data=forecast_train_tune,
mapping = aes(as.Date(ds), yhat, color = 'Prediction'),
size = 2) +
# plot actual test
geom_jitter(data=test,
mapping = aes(ds, y, color = 'Actual Test')) +
# plot prediction of testing set
geom_line(data=forecast_test_tune ,
mapping = aes(as.Date(ds), yhat, color = 'Prediction'),
size = 2) +
xlab("Time") +
ylab("O3") +
ggtitle("Actual and Prediction 03 Values") +
scale_color_manual(values = c("Actual Train" = "black",
"Prediction" = "red",
"Actual Test" = "darkgrey")) +
theme_minimal()
ggplotly(p2)In this section, we will explore another method: exponential smoothing. To provide a quick introduction, exponential smoothing can be categorized into three types: simple exponential smoothing, double exponential smoothing, and triple exponential smoothing.
In the previous section, we observed that our data exhibits both seasonality and trends. Therefore, it is appropriate to model the data using the latter method of exponential smoothing.
Before modeling the data with Holt-Winters, we first convert our time
series data into a time series object using ts(). This
function includes start and frequency
parameters, which denote the starting point of the time series data and
the timespan over which the pattern repetition occurs, respectively.
Once we have our time series object, we can plot them by
autoplot() function.
The plot shows no differences compared to the previous plot we created
with
prophet. There is a sinusoidal pattern where the peak
of ozone values occurs in the middle of a 6-month period, resulting in
two peaks per year. We can further analyze and verify the trend and
seasonality with the following code.
Now, let’s create our Holt-Winters exponential smoothing model using
the HoltWinters() function. We will pass our training data
to the function and specify the type of seasonality.
Let’s plot our fitted and actual training data.
As can be seen from the plot above, we can conclude that the
Holt-Winters prediction line (in red) fits the actual data (in black)
efficiently, indicating its ability to learn the general pattern. Let’s
now examine the model’s performance on the training data.
#> ME RMSE MAE MPE MAPE ACF1 Theil's U
#> Test set 0.07375392 7.406057 5.1959 -0.01380456 2.00436 0.3842999 1.159542
From these metrics, we will focus solely on MAPE for a fair comparison. Our model exhibits an error rate of 2.00%, which we can consider a good metric. We will also assess the model’s performance on the testing data. Since our testing data consists of 366 observations, we will use this model to forecast 366 days ahead.
#> [1] 0.03370775
On the testing set, our model achieves a MAPE score of 3.40%. This metric does not show a significant gap compared to the training metrics. Therefore, we can conclude that this model is well-fit, neither underfitting nor overfitting.
Let’s plot the predictions of our model to gain a clearer picture of whether it accurately captures the correct trend and seasonality.
The prediction is represented by the blue and grey parts of the graph.
As we can see, for 2023, the model successfully depicts the same pattern
as in previous years, where there are two peaks and one lowest point in
ozone values. This consistency in capturing the peaks and troughs of
ozone values suggests that the model is effectively identifying and
replicating the underlying seasonal trends observed in historical
data.
In this article, we have experimented with modeling the thickness of the ozone layer’s time series data in the area of South Jakarta from January 1st, 2019, to January 1st, 2024. We tried two models with the following results:
Prophet Model: Base Modeling
Prophet Model: Fine Tuning
Based on the results above, the Holt-Winters model and the fine-tuned
Prophet model are the most suitable for modeling the time
series of ozone layer thickness. In the future, we can experiment with
more advanced techniques, such as deep learning, to model time series
data.