INTRODUCTION

The main causes and effects of air pollution are always related to humans. Humans are the main and biggest cause of air pollution. Humans also feel the worst effects of air pollution. Air pollution is one of the environmental damage, in the form of a decrease in air quality due to the entry of harmful elements into the air or the earth’s atmosphere. One of the dangerous elements that enter the atmosphere is carbon monoxide (CO). Government, NGO, and other organization analyze the CO concentration so they could make better action for the future.

This is a report of my forecasting result and seasonality explanation for hourly CO concentration.

IMPORT DATA

Load The Packages

library(lubridate)
library(dplyr)
library(forecast)
library(ggplot2)
library(ggthemes)
library(nortest)
library(scales)
library(padr)
library(zoo)

Read Data

airquality <- read.csv("data/PRSA_Data_Tiantan_20130301-20170228.csv")

head(airquality)

The air-quality data are from the Beijing Municipal Environmental Monitoring Center. The meteorological data in each air-quality site are matched with the nearest weather station from the China Meteorological Administration which is Tiantan station for this dataset. The time period is from March 1st, 2013 to February 28th, 2017.

The dataset includes information about:

  • No: row number
  • year: year of data in this row
  • month: month of data in this row
  • day: day of data in this row
  • hour: hour of data in this row
  • PM2.5: PM2.5 concentration (ug/m^3)
  • PM10: PM10 concentration (ug/m^3)
  • SO2: SO2 concentration (ug/m^3)
  • NO2: NO2 concentration (ug/m^3)
  • CO: CO concentration (ug/m^3)
  • O3: O3 concentration (ug/m^3)
  • TEMP: temperature (degree Celsius)
  • PRES: pressure (hPa)
  • DEWP: dew point temperature (degree Celsius)
  • RAIN: precipitation (mm)
  • wd: wind direction
  • WSPM: wind speed (m/s)
  • station: name of the air-quality monitoring site

DATA PREPROCESSING

Data Aggregation

1. Arrange Data and Adjust Data Type

airquality <- airquality %>%
  mutate(date = make_datetime(year, month, day, hour)) %>%
  arrange(date)

2. Time Series Padding

airquality <- airquality %>%
  pad(start_val = ymd_hms("2013-03-01 00:00:00"))

3. Select Data

airquality <- airquality %>%
  select(date, CO)

4. Check and Replace NA Value

colSums(is.na(airquality))
#> date   CO 
#>    0 1126

NA value pada kasus disebabkan oleh tidak adanya transaksi, misalnya saat toko tutup. Dengan demikian, NA value dapat digantikan dengan nilai 0.

airquality <- airquality %>% 
  mutate(CO = na.fill(CO, fill = "extend"))

SEASONALITY ANALYSIS

Time Series Object

ts <- ts(
  data = airquality$CO,
  frequency = 24)

ts %>%
  decompose() %>%
  autoplot()

Decomposing a time series means separating it into it’s constituent components.

  1. The first panel from top is the original, observed time series. Note that a series with multiplicative effects can often be transformed into one with additive effect through a log transformation.
  2. The second panel plots the seasonal component, with the figure being computed by taking the average for each time unit over all periods and then centering it around the mean.
  3. The third panel plots the trend component, We see that the estimated trend component shows a pattern. This pattern in trend might be sourced from uncaptured extra seasonality from higher natural period in this case, so it can be considered as multi-seasonal data. To solve this complex seasonality, we need to convert the data into msts() object which accept multiple frequency setting.
  4. The bottom-most panel the error component, which is determined by removing the trend and seasonal figure.

To deal with such series, we will use the msts class which handles multiple seasonality time series. This allows us to specify all of the frequencies that might be relevant. It is also flexible enough to handle non-integer frequencies.

Multi-seasonality Time Series Object

msts <- msts(airquality$CO,
             seasonal.periods = c(24, 24*7))
msts %>%
  mstl() %>%
  autoplot()

There are three seasonal patterns shown, one for the time of hour (the third panel) and one for the time of day (the fourth panel).

Hourly Seasonality

hourly <- ts %>% 
  decompose()

hourly$seasonal %>% 
  matrix(ncol = 24, byrow = T) %>%
  t() %>% 
  as.data.frame() %>% 
  select(V1) %>% 
  mutate(hour = hour(head(airquality$date, 24))) %>% 
  ggplot(aes(hour, V1, fill = V1)) +
  geom_col(show.legend = F) +
  geom_hline(yintercept = 0 ) +
  labs(x = NULL, y = "seasonality", title = "Hourly Seasonality") +
  scale_fill_gradient2(low = "firebrick4", mid = "skyblue", high = "dodgerblue4") +
  theme_pander()

First, we will look at the hourly seasonality. From the graph above, we can see that positive seasonality happened during 9 pm until 11 am, while negative seasonality happened during 12 pm until 8 pm.

Daily Seasonality

daily <- airquality %>%
  ts(frequency = 24*7) %>% 
  decompose()

daily$seasonal %>% 
  matrix(ncol = 7, byrow = T) %>%
  t() %>% 
  as.data.frame() %>% 
  select(V1) %>% 
  mutate(day = unique(wday(head(airquality$date, 24*7), label = T))) %>% 
  ggplot(aes(day, V1, fill = V1)) +
  geom_col(show.legend = F) +
  geom_hline(yintercept = 0 ) +
  labs(x = NULL, y = "seasonality", title = "Daily Seasonality") +
  scale_fill_gradient2(low = "firebrick4", mid = "skyblue", high = "dodgerblue4") +
  theme_pander()

From the graph above, we can see that strong positive seasonality happened during Friday until Sunday, and there is no day has negative seasonality.

CO is a pollutant produced from incomplete combustion processes, such as burning waste, burning in household activities, motor vehicles, and industrial activities. The high concentration of CO at a certain time is caused by high human activities.

CO can cause disruption to human health if it is in high concentrations. Besides being harmful to humans, CO is also harmful to the environment.

MODEL FITTING AND EVALUATION

Cross Validation

We need to do cross-validation in choosing the model that has the best performance. msts data will be split into train and test. The amount of data test is adjusted to the amount of data to be predicted, which is 24 hours for 7 days, while the amount of data train is all data that is not used in test.

train <- head(msts, -24*7)
test <- tail(msts, 24*7)

Model Fitting

STLF Model

stlf_forecast <- train %>%
  stlf(h = 24*7)

train %>% 
  autoplot(series = "actual") +
  autolayer(stlf_forecast$fitted, series = "train") +
  labs(title = "Forecast from STL and ETS",  y = "CO concentration (ug/m^3)", x  = NULL) +
  theme_pander()

### STLM (ARIMA) Model

stlm_forecast <- train %>%
  stlm(method = "arima") %>%
  forecast(h = 24*7)

train %>% 
  autoplot(series = "actual") +
  autolayer(stlm_forecast$fitted, series = "train") +
  labs(title = "Forecast from STL and ARIMA", y = "CO concentration (ug/m^3)", x  = NULL) +
  theme_pander()

### Arima Model

arima_model <- auto.arima(train, seasonal = F)

arima_forecast <- arima_model %>%
  forecast(h = 24*7)

train %>% 
  autoplot(series = "actual") +
  autolayer(arima_forecast$fitted, series = "train") +
  labs(title = "Forecast from ARIMA", y = "CO concentration (ug/m^3)", x  = NULL) +
  theme_pander()

### TBATS Model

tbats_forecast <- train %>%
  tbats() %>%
  forecast(h = 24*7)

train %>% 
  autoplot(series = "actual") +
  autolayer(tbats_forecast$fitted, series = "train") +
  labs(title = "Forecast from TBATS", y = "CO concentration (ug/m^3)", x  = NULL) +
  theme_pander()

Comparison

accuracy(stlf_forecast, test)
#>                        ME     RMSE      MAE       MPE     MAPE      MASE
#> Training set   0.01894629 297.0241 165.3797 -2.882172 18.78098 0.1645756
#> Test set     474.75035054 966.2391 696.2147 38.768736 95.88540 0.6928294
#>                       ACF1 Theil's U
#> Training set -0.0001102239        NA
#> Test set      0.8990134747  1.246412
accuracy(stlm_forecast, test)
#>                        ME     RMSE      MAE       MPE     MAPE      MASE
#> Training set   0.03148155 295.8550 165.2600 -3.047554 18.80590 0.1644564
#> Test set     447.11350248 953.1877 688.6042 33.346423 95.39323 0.6852559
#>                       ACF1 Theil's U
#> Training set 0.00002349512        NA
#> Test set     0.89857745412  1.229984
accuracy(arima_forecast, test)
#>                        ME     RMSE      MAE         MPE      MAPE      MASE
#> Training set    0.0239463 366.8657 169.7674   -3.927403  15.89813 0.1689420
#> Test set     -366.2269049 648.5806 585.3792 -127.047964 138.37156 0.5825328
#>                      ACF1 Theil's U
#> Training set -0.006540516        NA
#> Test set      0.844857437  1.525917
accuracy(tbats_forecast, test)
#>                      ME     RMSE      MAE         MPE      MAPE      MASE
#> Training set   31.16354 363.2754 173.7102   -4.012447  16.04725 0.1728655
#> Test set     -254.24326 658.9122 570.6766 -110.077932 127.65282 0.5679017
#>                    ACF1 Theil's U
#> Training set 0.01851385        NA
#> Test set     0.87891799  1.499478

From the four models above, it can be seen that the ARIMA model has the lowest RMSE and MAE, while STLF model has the highest RMSE and MAE.

CONCLUSION

  1. No Autocorrelation
Box.test(arima_forecast$residuals, type = "Ljung-Box")
#> 
#>  Box-Ljung test
#> 
#> data:  arima_forecast$residuals
#> X-squared = 1.4929, df = 1, p-value = 0.2218

\(H_0\): residual has no-autocorrelation \(H_1\): residual has autocorrelation

“Because the p-value is greater than 0.05, this model fails to reject H0. ARIMA model does not have any autocorrelation.

  1. Normality of Residuals
ad.test(arima_forecast$residuals)$p.value
#> [1] 0.0000000000000000000000037

\(H_0\): residuals are normally distributed \(H_1\): residuals are not normally distributed

Based on the result, ARIMA model didn’t fulfill the normality assumption for the residuals. The positive mean of error signify that the model is underestimate the forecast while negative mean error means the model is overestimate. This suggest that we can improve the model further in order to get better performance.

  1. Based on seasonality, Friday-Sunday (daily seasonality) and 9 pm - 11 am (hourly seasonality) have the highest visitors.

REFERENCES

Zhang, S., Guo, B., Dong, A., He, J., Xu, Z. and Chen, S.X. (2017) Cautionary Tales on Air-Quality Improvement in Beijing. Proceedings of the Royal Society A, Volume 473, No. 2205, Pages 20170457.