The main causes and effects of air pollution are always related to humans. Humans are the main and biggest cause of air pollution. Humans also feel the worst effects of air pollution. Air pollution is one of the environmental damage, in the form of a decrease in air quality due to the entry of harmful elements into the air or the earth’s atmosphere. One of the dangerous elements that enter the atmosphere is carbon monoxide (CO). Government, NGO, and other organization analyze the CO concentration so they could make better action for the future.
This is a report of my forecasting result and seasonality explanation for hourly CO concentration.
library(lubridate)
library(dplyr)
library(forecast)
library(ggplot2)
library(ggthemes)
library(nortest)
library(scales)
library(padr)
library(zoo)airquality <- read.csv("data/PRSA_Data_Tiantan_20130301-20170228.csv")
head(airquality)The air-quality data are from the Beijing Municipal Environmental Monitoring Center. The meteorological data in each air-quality site are matched with the nearest weather station from the China Meteorological Administration which is Tiantan station for this dataset. The time period is from March 1st, 2013 to February 28th, 2017.
The dataset includes information about:
airquality <- airquality %>%
mutate(date = make_datetime(year, month, day, hour)) %>%
arrange(date)airquality <- airquality %>%
pad(start_val = ymd_hms("2013-03-01 00:00:00"))airquality <- airquality %>%
select(date, CO)colSums(is.na(airquality))#> date CO
#> 0 1126
NA value pada kasus disebabkan oleh tidak adanya transaksi, misalnya saat toko tutup. Dengan demikian, NA value dapat digantikan dengan nilai 0.
airquality <- airquality %>%
mutate(CO = na.fill(CO, fill = "extend"))ts <- ts(
data = airquality$CO,
frequency = 24)
ts %>%
decompose() %>%
autoplot() Decomposing a time series means separating it into it’s constituent components.
To deal with such series, we will use the msts class which handles multiple seasonality time series. This allows us to specify all of the frequencies that might be relevant. It is also flexible enough to handle non-integer frequencies.
msts <- msts(airquality$CO,
seasonal.periods = c(24, 24*7))
msts %>%
mstl() %>%
autoplot() There are three seasonal patterns shown, one for the time of hour (the third panel) and one for the time of day (the fourth panel).
hourly <- ts %>%
decompose()
hourly$seasonal %>%
matrix(ncol = 24, byrow = T) %>%
t() %>%
as.data.frame() %>%
select(V1) %>%
mutate(hour = hour(head(airquality$date, 24))) %>%
ggplot(aes(hour, V1, fill = V1)) +
geom_col(show.legend = F) +
geom_hline(yintercept = 0 ) +
labs(x = NULL, y = "seasonality", title = "Hourly Seasonality") +
scale_fill_gradient2(low = "firebrick4", mid = "skyblue", high = "dodgerblue4") +
theme_pander() First, we will look at the hourly seasonality. From the graph above, we can see that positive seasonality happened during 9 pm until 11 am, while negative seasonality happened during 12 pm until 8 pm.
daily <- airquality %>%
ts(frequency = 24*7) %>%
decompose()
daily$seasonal %>%
matrix(ncol = 7, byrow = T) %>%
t() %>%
as.data.frame() %>%
select(V1) %>%
mutate(day = unique(wday(head(airquality$date, 24*7), label = T))) %>%
ggplot(aes(day, V1, fill = V1)) +
geom_col(show.legend = F) +
geom_hline(yintercept = 0 ) +
labs(x = NULL, y = "seasonality", title = "Daily Seasonality") +
scale_fill_gradient2(low = "firebrick4", mid = "skyblue", high = "dodgerblue4") +
theme_pander() From the graph above, we can see that strong positive seasonality happened during Friday until Sunday, and there is no day has negative seasonality.
CO is a pollutant produced from incomplete combustion processes, such as burning waste, burning in household activities, motor vehicles, and industrial activities. The high concentration of CO at a certain time is caused by high human activities.
CO can cause disruption to human health if it is in high concentrations. Besides being harmful to humans, CO is also harmful to the environment.
We need to do cross-validation in choosing the model that has the best performance. msts data will be split into train and test. The amount of data test is adjusted to the amount of data to be predicted, which is 24 hours for 7 days, while the amount of data train is all data that is not used in test.
train <- head(msts, -24*7)
test <- tail(msts, 24*7)stlf_forecast <- train %>%
stlf(h = 24*7)
train %>%
autoplot(series = "actual") +
autolayer(stlf_forecast$fitted, series = "train") +
labs(title = "Forecast from STL and ETS", y = "CO concentration (ug/m^3)", x = NULL) +
theme_pander() ### STLM (ARIMA) Model
stlm_forecast <- train %>%
stlm(method = "arima") %>%
forecast(h = 24*7)
train %>%
autoplot(series = "actual") +
autolayer(stlm_forecast$fitted, series = "train") +
labs(title = "Forecast from STL and ARIMA", y = "CO concentration (ug/m^3)", x = NULL) +
theme_pander() ### Arima Model
arima_model <- auto.arima(train, seasonal = F)
arima_forecast <- arima_model %>%
forecast(h = 24*7)
train %>%
autoplot(series = "actual") +
autolayer(arima_forecast$fitted, series = "train") +
labs(title = "Forecast from ARIMA", y = "CO concentration (ug/m^3)", x = NULL) +
theme_pander() ### TBATS Model
tbats_forecast <- train %>%
tbats() %>%
forecast(h = 24*7)
train %>%
autoplot(series = "actual") +
autolayer(tbats_forecast$fitted, series = "train") +
labs(title = "Forecast from TBATS", y = "CO concentration (ug/m^3)", x = NULL) +
theme_pander()accuracy(stlf_forecast, test)#> ME RMSE MAE MPE MAPE MASE
#> Training set 0.01894629 297.0241 165.3797 -2.882172 18.78098 0.1645756
#> Test set 474.75035054 966.2391 696.2147 38.768736 95.88540 0.6928294
#> ACF1 Theil's U
#> Training set -0.0001102239 NA
#> Test set 0.8990134747 1.246412
accuracy(stlm_forecast, test)#> ME RMSE MAE MPE MAPE MASE
#> Training set 0.03148155 295.8550 165.2600 -3.047554 18.80590 0.1644564
#> Test set 447.11350248 953.1877 688.6042 33.346423 95.39323 0.6852559
#> ACF1 Theil's U
#> Training set 0.00002349512 NA
#> Test set 0.89857745412 1.229984
accuracy(arima_forecast, test)#> ME RMSE MAE MPE MAPE MASE
#> Training set 0.0239463 366.8657 169.7674 -3.927403 15.89813 0.1689420
#> Test set -366.2269049 648.5806 585.3792 -127.047964 138.37156 0.5825328
#> ACF1 Theil's U
#> Training set -0.006540516 NA
#> Test set 0.844857437 1.525917
accuracy(tbats_forecast, test)#> ME RMSE MAE MPE MAPE MASE
#> Training set 31.16354 363.2754 173.7102 -4.012447 16.04725 0.1728655
#> Test set -254.24326 658.9122 570.6766 -110.077932 127.65282 0.5679017
#> ACF1 Theil's U
#> Training set 0.01851385 NA
#> Test set 0.87891799 1.499478
From the four models above, it can be seen that the ARIMA model has the lowest RMSE and MAE, while STLF model has the highest RMSE and MAE.
Box.test(arima_forecast$residuals, type = "Ljung-Box")#>
#> Box-Ljung test
#>
#> data: arima_forecast$residuals
#> X-squared = 1.4929, df = 1, p-value = 0.2218
\(H_0\): residual has no-autocorrelation \(H_1\): residual has autocorrelation
“Because the p-value is greater than 0.05, this model fails to reject H0. ARIMA model does not have any autocorrelation.
ad.test(arima_forecast$residuals)$p.value#> [1] 0.0000000000000000000000037
\(H_0\): residuals are normally distributed \(H_1\): residuals are not normally distributed
Based on the result, ARIMA model didn’t fulfill the normality assumption for the residuals. The positive mean of error signify that the model is underestimate the forecast while negative mean error means the model is overestimate. This suggest that we can improve the model further in order to get better performance.
Zhang, S., Guo, B., Dong, A., He, J., Xu, Z. and Chen, S.X. (2017) Cautionary Tales on Air-Quality Improvement in Beijing. Proceedings of the Royal Society A, Volume 473, No. 2205, Pages 20170457.