1 Introduction

1.1 Background

Ozone is a crucial gas present in the Earth’s upper atmosphere. It forms an essential layer that envelops our planet. This layer, often referred to as the ozone layer, exists predominantly in the stratosphere, a region extending about 10 to 30 miles above the Earth’s surface. Its primary function is to act as a protective shield, absorbing a significant portion of the sun’s harmful ultraviolet (UV) radiation. Without the ozone layer, these UV rays would reach the Earth’s surface in much higher intensities, posing a serious threat to all forms of life.

The chemical composition of ozone is quite interesting. It is made up of three oxygen atoms (O₃), which is different from the oxygen we are familiar with in the air we breathe, which has only two oxygen atoms (O₂). This unique structure of three oxygen atoms is what gives ozone its ability to absorb UV radiation effectively. It is this absorption characteristic that makes the ozone layer so vital for protecting life on Earth.

Understanding the significance of ozone extends beyond its composition and protective function. The ozone layer is crucial for maintaining a balance in the planet’s ecosystems. It protects humans from harmful UV rays that can lead to skin cancer, cataracts, and other health issues. Moreover, it shields the environment, preserving the integrity of various ecosystems and wildlife. This protective layer ensures that life on Earth can thrive without the constant threat of UV radiation damage.

Consequently, predicting the thickness of the ozone layer becomes a matter of great importance. Monitoring and predicting changes in the ozone layer help us understand the health of our environment. It also guides us in taking necessary precautions to protect ourselves, especially during periods when the ozone layer might be thinner. This forecasting is not just about immediate health concerns but is also crucial for long-term environmental strategies and policies aimed at preserving the Earth’s atmosphere.

1.2 Dataset

In this forecast, we acquire data by scraping it from OMI/OMPS Ozone Time Series Data. This tool allows us to retrieve ozone values as a time series for a specified range within certain longitude and latitude coordinates. For this project, we set the longitude to 106.8106 and the latitude to -6.261493, specifically targeting the coordinates of South Jakarta. The data is retrieved for a timespan of 5 years (January 1st 2019 - January 1st 2024). The data obtained from scraping is presented in the following columns:

  • yyyy: the year when specific data was retrieved.
  • MM: the month when specific data was retrieved.
  • dd: the day of the month when specific data was retrieved.
  • DD: the day of the year when specific data was retrieved.
  • O3_Nearest: daily average total column ozone values in
  • O3_Interpolated: daily average total column ozone values in Dobson units, interpolated.
  • InterpolPos: a logical value indicating the interpolation process.

1.3 Problem Statement

Given the background and dataset, our aim in this project is:

To analyze the periodic components of ozone values and utilize them to forecast future ozone levels.

2 Load Packages

Below are the packages that we will use during the analysis.

# for data loading - data preprocessing
library(dplyr)
library(tidyr)
library(padr)
library(zoo)  
library(lubridate)

# data visualization
library(ggplot2)
library(plotly)
library(glue)

# modeling - evaluation
library(prophet)
library(forecast)  
library(TTR)       
library(tseries)   
library(fpp)       
library(MLmetrics)

3 Load Dataset

The retrieval of data is done twice: January 2019 - December 2020 and January 2021 - early January 2024. For these, we have 2 .csv files in total. We will load them as df1 and df2. Subsequently, we will merge df1 and df2 as df.

df1 <- read.csv('data/SatO3DataServlet_January2019-December2020.csv')
df2 <- read.csv('data/SatO3DataServlet_January2021-January2024.csv')
head(df1, 3)
head(df2, 3)
df <- rbind(df1, df2)
head(df)

4 Data Preparation

In the initial phase of our analysis, we will focus on data cleansing to guarantee precision in our analysis and modeling process. Data cleansing involves identifying and rectifying any inaccuracies, corruptions, formatting errors, duplications, or incomplete information within our dataset. Let’s check our data structure.

glimpse(df)
#> Rows: 1,826
#> Columns: 7
#> $ yyyy            <int> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, …
#> $ MM              <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ dd              <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
#> $ DDD             <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
#> $ O3_Nearest      <dbl> 256, 0, 259, 257, 257, 258, 262, 262, 0, 261, 273, 267…
#> $ O3_Interpolated <dbl> 255.546, 0.000, 258.719, 257.352, 257.908, 259.386, 26…
#> $ InterpolPos     <chr> "           Y", "           N", "           Y", "     …

As seen above, our datetime details are separated into individual columns, which is impractical for time series analysis. Therefore, we will create a new column that combines all these time details.

df$datetime <- as.Date(paste(df$yyyy, df$MM, df$dd, sep = "-"), format = "%Y-%m-%d")

Recheck our data.

glimpse(df)
#> Rows: 1,826
#> Columns: 8
#> $ yyyy            <int> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, …
#> $ MM              <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ dd              <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
#> $ DDD             <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
#> $ O3_Nearest      <dbl> 256, 0, 259, 257, 257, 258, 262, 262, 0, 261, 273, 267…
#> $ O3_Interpolated <dbl> 255.546, 0.000, 258.719, 257.352, 257.908, 259.386, 26…
#> $ InterpolPos     <chr> "           Y", "           N", "           Y", "     …
#> $ datetime        <date> 2019-01-01, 2019-01-02, 2019-01-03, 2019-01-04, 2019-…

For our analysis, we will select only the columns of interest: datetime and O3_Interpolated. Both of columns will be saved as a new dataframe named ozone.

ozone <- df %>% select(datetime, O3_Interpolated)
colnames(ozone) <- c('date','O3')
head(ozone, 5)

As we may already know, time series data is self-explanatory: it consists of a series of data points ordered in time. Consequently, these time points must be arranged from past to present. Accordingly, we will ensure that our data is arranged in the required format.

ozone <- ozone %>%  arrange(date)

The second requirement that needs to be fulfilled is time completeness. Time completeness ensures that our data has a constant interval between each point. In our case, the data is in a daily format, meaning the interval between each data point is one day.

We will create full_date as a complete time sequence covering our timespan. Next, we will compare this full time sequence with our time sequence to identify any differences, checking for any missing times.

full_date = seq.Date(from = min(ozone$date), 
                     to = max(ozone$date), 
                     by = "day")
length(full_date)
#> [1] 1827

If our data is complete, then it should have 1827 rows in it.

dim(ozone)
#> [1] 1826    2
missing_dates <- as.Date(setdiff(c(full_date), c(ozone$date)))
missing_dates
#> [1] "2023-12-03"

Since our data only has 1826 rows, we assume that one day is missing, which is December 3rd, 2023. We will address this by performing time padding. We will restructure our data to include a complete timeline.

ozone <- ozone %>% pad() 
tail(ozone, 30)

As we can see, the record for December 3rd, 2023 is identified as NA. We will impute this missing value with 0.

ozone <- ozone %>% 
  mutate(O3 = na.fill(O3, fill = 0))
tail(ozone, 30)

We observe that on several days, specifically 534 out of 1827 days in our dataset, the record shows a value of 0. This means that approximately 30% of the days are recorded with a value of 0.

ozone %>% filter(O3 == 0)

There could be many reasons for this, but it is most likely due to a recording error by the satellite. To address this, we will impute the 0 value with the ozone value from the previous day.

# replace zeros with NA
ozone <- ozone %>% mutate(O3 = ifelse(O3 == 0, NA, O3))

# fill NA with the previous non-NA value
ozone <- ozone %>% fill(O3, .direction = "down")

Now that our data is clean, we are ready to move on to the next steps.

5 Exploratory Data Analysis

Before we begin modeling the time series, let’s first examine our data for any patterns. Visualizing the data is key here, as it helps us spot trends and anomalies that might not be obvious just from numbers alone. This step is crucial for choosing the right approach to our analysis.

ggplot(data = ozone, 
       aes(x = date, 
           y = O3)) +
  geom_point(color = "#00008B") +
  theme_minimal()

From this visualization, we can see that there are quite interesting patterns:

  • Over a timespan of five years, the ozone thickness pattern follows a sinusoidal shape.
  • Within a one-year timespan, the ozone thickness pattern generally shows a peak twice.

To further investigate this, we will explore the times at which the ozone values fall below 250 Dobson units and above 275 Dobson units.

below_250 <- ozone %>% filter(O3 <= 250)
below_250$month_name <- month(below_250$date, label = TRUE, abbr = FALSE)
table(below_250$month_name)
#> 
#>   January  February     March     April       May      June      July    August 
#>        26        10        11        18        38        31        47         8 
#> September   October  November  December 
#>         0         4         8        28
above_275 <- ozone %>% filter(O3 >= 275)
above_275$month_name <- month(above_275$date, label = TRUE, abbr = FALSE)
table(above_275$month_name)
#> 
#>   January  February     March     April       May      June      July    August 
#>         0         8        13         8         3         0         1         0 
#> September   October  November  December 
#>         4        34        11         8

These tables suggest a tendency for ozone values to reach their lowest points in the early and middle parts of the year (January, May, June, July, December). In addition to this, the ozone values reach their peak around the middle of a six-month period (March, April, October, November).

range(ozone$O3)
#> [1] 235.908 340.408

Furthermore, it has been investigated that over a five-year period, ozone values consistently fall between 235 and 275 units. Given that the average global ozone value is 300 units, this range is considered low, suggesting a low level of ozone.

6 Modeling

We have gained insights into the data’s seasonality, identifying the periods of peak and low values. In this section, we will experiment with building models to forecast future ozone levels. We will explore various machine learning techniques, including Prophet by Meta and ARIMA. Before proceeding, we will divide our data into two parts: the training set and the testing set. Specifically, we will use data from 2019 to 2022 as the training set and data from 2023 as the testing set.

cutoff <- as.Date("2023-01-01")

train <- ozone %>% filter(date < cutoff)
test <- ozone %>% filter(date >= cutoff) 
dim(train)
#> [1] 1461    2
dim(test)
#> [1] 366   2

The model will use 1461 days of data to learn the trends and seasonality. The remaining data will be used for model evaluation.

6.1 Prophet

6.1.1 How Prophet Works

prophet is a package released by Meta, designed for handling time series data using the General Additive Model (GAM). Time series data is composed of trend, seasonality, and error.

  • Trend: captures the long-term movement of our time series data.
  • Seasonality: captures the periodic patterns in our data.
  • Error: represents random patterns that are not captured by trend and seasonality.

6.1.2 Data Preparation

As a starting point, prophet requires our data to be organized into two columns: ds and y.

  • ds: datetime details.
  • y: numerical value.

To accommodate this requirement, we will rename our columns.

colnames(train) <- c('ds','y')
colnames(test) <- c('ds','y')

6.1.3 Base Modeling

6.1.3.1 Model Object

For the modeling phase, we will create the model object using the prophet() function and feed our training data into it. By using prophet(), we declare the base model. Use the default values for each parameter as they are. With the code below, our model will learn the corresponding trends and seasonality.

m_base <- prophet(train)

6.1.3.2 Prediction

Next, we want our model to predict future values. The first thing we do is provide a list of times we want to predict. Here, we set the parameter periods to 366, indicating our intention to forecast one year ahead. The choice of 366 is customized to our needs and our data. That is, our data is in daily format. Hence, the periods is set to the same unit as our data. If we want to forecast for 6 months, then periods = 180. Furthermore, setting periods = 366 makes it easier for us to perform model evaluation.

future_base <- make_future_dataframe(m_base, periods = 366)
tail(future_base, 30)

By running the above code, we create a future_base dataframe. future_base includes the timespan of our historical data and 366 days after the last day in our training data.

Subsequently, we use the predict() function to make predictions by passing the model m_base and a list of times we want to predict in future_base.

forecast_base <- predict(m_base, future_base)
head(forecast_base)

The model returns several columns as prediction results, but we focus on the following:

  • ds: the datetime.
  • yhat: the predicted value at the specific time.
  • trend: the long-term movement of our data.
  • weekly: weekly seasonality, reflecting the cyclic pattern over 7 days.
  • yearly: yearly seasonality, indicating the cyclic pattern over 365 days.

Remember that the prophet models our data using Generalized Additive Models (GAM). In this context:

\[yhat = trend + weekly + yearly\] Our prediction comprises the trend and seasonal components.

6.1.3.3 Visualization

prophet offers the capability to visualize both the actual data and the predicted values. This can be achieved using the plot() function by passing both the model and the forecasting results to it.

# static plotting
plot(m_base, forecast_base) 

We can plot the forecast in more interactive way with the following code.

# interactive plotting
dyplot.prophet(m_base, forecast_base)

Additionally, we can plot the results by component, allowing us to visualize the trend and seasonality separately.

prophet_plot_components(m_base, forecast_base)

💡 Insight:

  1. The ozone levels showed a decreasing trend from 2019 to early 2022. Subsequently, they began to increase until the end of 2023 and into early 2024.
  2. From the weekly seasonality analysis, we know that the ozone levels from Sunday to Wednesday tend to be similar. They drastically decrease on Thursday and significantly increase on Friday, followed by a decrease again on Saturday.
  3. From the annual seasonality perspective, it is evident that the peak in ozone levels occurs twice a year, specifically in March and October. The lowest ozone levels throughout the year are observed in mid-year, namely in July.
  4. From the overall plot, we observe that the trend of ozone values from early 2019 to 2022 tends to decrease and is predicted to increase throughout 2023. This pattern is quite different from other years.

6.1.3.4 Model Performance

In the next step, we will evaluate our model’s performance when making predictions on the training set and testing set. To do this, we need to subset our forecast_base. Forecast values between the cutoff date represent the forecast values of the training set, and the remaining values are the forecast values of the testing set.

forecast_train <- filter(forecast_base, ds < cutoff)
forecast_test <- filter(forecast_base, ds >= cutoff )

For the evaluation metrics, we will use MAPE metrics due to its ease of interpretability.

MAPE(forecast_test$yhat, test$y)
#> [1] 0.06395343
MAPE(forecast_train$yhat, train$y)
#> [1] 0.01897595

💡 Insight:

  • The error rate in the training set is 1.90%, and in the testing set, it is 6.40%. This results in a 4.5% difference, suggesting that our model is too good at learning the training data but has less ability to generalize to unseen data. This is an early sign of overfitting.

For a clear representation of our actual data and the prediction, let’s plot them together.

p1 <- ggplot() + 
  # plot actual train
  geom_jitter(data=train, 
              mapping = aes(ds, y, 
                            color = 'Actual Train')) + 
  # plot prediction of training set
  geom_line(data=forecast_train, 
            mapping = aes(as.Date(ds), 
                          yhat, 
                          color = 'Prediction'), 
            size = 2) +
  # plot actual test
  geom_jitter(data=test, 
              mapping = aes(ds, y, 
                            color = 'Actual Test')) + 
  # plot prediction of testing set
  geom_line(data=forecast_test , 
            mapping = aes(as.Date(ds), yhat, 
                          color = 'Prediction'), 
            size = 2) +
  xlab("Time") + 
  ylab("O3") + 
  ggtitle("Actual and Prediction 03 Values") + 
  scale_color_manual(values = c("Actual Train" = "black", 
                                "Prediction" = "red", 
                                "Actual Test" = "darkgrey")) +
  theme_minimal()

ggplotly(p1)

As we can see from the plot, our prediction line has failed to fit the actual values of the testing set. We can also observe that our red prediction line is overly fitted to the training data, failing to capture the general pattern.

We can address this issue by reducing the model complexity.

6.1.4 Model Improvement: Hyperparameter Tuning

Prophet offers a number of parameters that can be adjusted when defining a model object. From the previous analysis, we identified an anomaly where the trend prediction significantly differs from that of previous years: the ozone values are increasing throughout the year. To address this, we will attempt to tune the parameter related to the trend component: changepoint.prior.scale. This parameter controls the flexibility of the trend. By default, changepoint.prior.scale is set to 0.05. Within the recommended range of 0.001 to 0.5, the greater the value, the more flexible the trend becomes. In other words, it better fits the training data, although this increases the probability of overfitting. In this section, we will set the parameter value to 0.001.

m_tune <- prophet(train, changepoint.prior.scale = 0.001)

The subsequent parts go as the same as before.

future_tune <- make_future_dataframe(m_tune, periods = 366)
forecast_tune <- predict(m_tune, future_tune)
plot(m_tune, forecast_tune) 

# interactive plotting
dyplot.prophet(m_tune, forecast_tune)
prophet_plot_components(m_tune, forecast_tune)

forecast_train_tune <- filter(forecast_tune, ds < cutoff)
forecast_test_tune <- filter(forecast_tune, ds >= cutoff )
MAPE(forecast_test_tune$yhat, test$y)
#> [1] 0.03773785
MAPE(forecast_train_tune$yhat, train$y)
#> [1] 0.02203781

Compared to the previous one, our tuned model’s performance is significantly better than that of the base model. The gap between the training performance and testing performance is relatively smaller than with the base model. Although there is an increase in training error, it is not so significant. This suggests that we can be more confident that our model is not overfitting.

Below is code for plotting the actual and predicted data.

p2 <- ggplot() + 
  # plot actual train
  geom_jitter(data=train, 
              mapping = aes(ds, y, color = 'Actual Train')) + 
  # plot prediction of training set
  geom_line(data=forecast_train_tune, 
            mapping = aes(as.Date(ds), yhat, color = 'Prediction'), 
            size = 2) +
  # plot actual test
  geom_jitter(data=test, 
              mapping = aes(ds, y, color = 'Actual Test')) + 
  # plot prediction of testing set
  geom_line(data=forecast_test_tune , 
            mapping = aes(as.Date(ds), yhat, color = 'Prediction'), 
            size = 2) +
  xlab("Time") + 
  ylab("O3") + 
  ggtitle("Actual and Prediction 03 Values") + 
  scale_color_manual(values = c("Actual Train" = "black", 
                                "Prediction" = "red", 
                                "Actual Test" = "darkgrey")) +
  theme_minimal()

ggplotly(p2)

6.2 Exponential Smoothing

In this section, we will explore another method: exponential smoothing. To provide a quick introduction, exponential smoothing can be categorized into three types: simple exponential smoothing, double exponential smoothing, and triple exponential smoothing.

  • Simple exponential smoothing is a technique used to model time series data that does not exhibit a trend or seasonality.
  • Double exponential smoothing, also known as Holt’s exponential smoothing, is a technique used to model time series data that has a trend but no seasonality.
  • Triple exponential smoothing, often referred to as Holt-Winters exponential smoothing, is a technique used to model time series data that exhibits both trends and seasonality.

In the previous section, we observed that our data exhibits both seasonality and trends. Therefore, it is appropriate to model the data using the latter method of exponential smoothing.

Before modeling the data with Holt-Winters, we first convert our time series data into a time series object using ts(). This function includes start and frequency parameters, which denote the starting point of the time series data and the timespan over which the pattern repetition occurs, respectively.

train_ts <- ts(data = train$y,
               start = c(2019,1,1),
               frequency  = 365) 

Once we have our time series object, we can plot them by autoplot() function.

train_ts %>% autoplot()

The plot shows no differences compared to the previous plot we created with prophet. There is a sinusoidal pattern where the peak of ozone values occurs in the middle of a 6-month period, resulting in two peaks per year. We can further analyze and verify the trend and seasonality with the following code.

train_ts %>% decompose() %>% plot()

Now, let’s create our Holt-Winters exponential smoothing model using the HoltWinters() function. We will pass our training data to the function and specify the type of seasonality.

o3_hw <- HoltWinters(train_ts, seasonal = "additive")

Let’s plot our fitted and actual training data.

plot(o3_hw)

As can be seen from the plot above, we can conclude that the Holt-Winters prediction line (in red) fits the actual data (in black) efficiently, indicating its ability to learn the general pattern. Let’s now examine the model’s performance on the training data.

accuracy(o3_hw$fitted[,1], train_ts)
#>                  ME     RMSE    MAE         MPE    MAPE      ACF1 Theil's U
#> Test set 0.07375392 7.406057 5.1959 -0.01380456 2.00436 0.3842999  1.159542

From these metrics, we will focus solely on MAPE for a fair comparison. Our model exhibits an error rate of 2.00%, which we can consider a good metric. We will also assess the model’s performance on the testing data. Since our testing data consists of 366 observations, we will use this model to forecast 366 days ahead.

o3_preds <- data.frame(forecast(o3_hw, h=366))

head(o3_preds)
MAPE(o3_preds$Point.Forecast, test$y)
#> [1] 0.03370775

On the testing set, our model achieves a MAPE score of 3.40%. This metric does not show a significant gap compared to the training metrics. Therefore, we can conclude that this model is well-fit, neither underfitting nor overfitting.

Let’s plot the predictions of our model to gain a clearer picture of whether it accurately captures the correct trend and seasonality.

plot(forecast(o3_hw, h=366))

The prediction is represented by the blue and grey parts of the graph. As we can see, for 2023, the model successfully depicts the same pattern as in previous years, where there are two peaks and one lowest point in ozone values. This consistency in capturing the peaks and troughs of ozone values suggests that the model is effectively identifying and replicating the underlying seasonal trends observed in historical data.

7 Conclusion

In this article, we have experimented with modeling the thickness of the ozone layer’s time series data in the area of South Jakarta from January 1st, 2019, to January 1st, 2024. We tried two models with the following results:

  • Prophet Model: Base Modeling
    • MAPE on training data: 1.90%.
    • MAPE on testing data: 6.40%.
  • Prophet Model: Fine Tuning
    • MAPE on training data: 2.20%.
    • MAPE on testing data: 3.80%.
  • Holt-Winters Exponential Smoothing
    • MAPE on training data: 2.00%.
    • MAPE on testing data: 3.37%.

Based on the results above, the Holt-Winters model and the fine-tuned Prophet model are the most suitable for modeling the time series of ozone layer thickness. In the future, we can experiment with more advanced techniques, such as deep learning, to model time series data.