I have chosen Beijing Air Quality time-series dataset for data analysis. Data can be found on https://archive.ics.uci.edu/ml/datasets/Beijing+Multi-Site+Air-Quality+Data# .
This hourly data set considers 6 main air pollutants and 6 relevant meteorological variables at multiple sites in Beijing.
changping<-read.csv("C:/Users/Hetal Sawant/Desktop/Spring sem/Forecasting/HW/PRSA_Data_Changping_20130301-20170228.csv")
attach(changping)
str(changping)
## 'data.frame': 35064 obs. of 18 variables:
## $ No : int 1 2 3 4 5 6 7 8 9 10 ...
## $ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int 3 3 3 3 3 3 3 3 3 3 ...
## $ day : int 1 1 1 1 1 1 1 1 1 1 ...
## $ hour : int 0 1 2 3 4 5 6 7 8 9 ...
## $ PM2.5 : num 3 3 3 3 3 3 4 3 9 11 ...
## $ PM10 : num 6 3 3 6 3 3 6 6 25 29 ...
## $ SO2 : num 13 6 22 12 14 10 12 25 13 5 ...
## $ NO2 : num 7 6 13 8 8 17 22 39 42 18 ...
## $ CO : int 300 300 400 300 300 400 500 600 700 500 ...
## $ O3 : num 85 85 74 81 81 71 65 48 46 73 ...
## $ TEMP : num -2.3 -2.5 -3 -3.6 -3.5 -4.5 -4.5 -2.1 -0.2 0.6 ...
## $ PRES : num 1021 1021 1021 1022 1022 ...
## $ DEWP : num -19.7 -19 -19.9 -19.1 -19.4 -19.5 -19.5 -20 -20.5 -20.4 ...
## $ RAIN : num 0 0 0 0 0 0 0 0 0 0 ...
## $ wd : chr "E" "ENE" "ENE" "NNE" ...
## $ WSPM : num 0.5 0.7 0.2 1 2.1 1.7 1.8 2.5 2.8 3.8 ...
## $ station: chr "Changping" "Changping" "Changping" "Changping" ...
For this assignment, I have chosen to analyze two air pollutants- SO2 and PM10 in mid-October in two different years. I want to check how winter season impacts the air pollutant levels within a span of 24 hours.
As below graphs display, Non-stationary mean and variance is evidently observed in the pollutant levels.
As the data appears to have non-stationary variance, I am using BoxCox transformation to transform the time-series data.
Below graphs represent the transformed time-series for SO2 and PM10 pollutants after the application of BoxCox transformation.
BoxCox transformation helps best to normalize the variance in the time-series data
# Testing for non-stationarity:
As the both ADF test and KPSS test indicate mean is non-stationary, because, p-values are greater than 0.05
##
## Augmented Dickey-Fuller Test
##
## data: SO2_var_transform$SO2_box
## Dickey-Fuller = -1.4761, Lag order = 2, p-value = 0.772
## alternative hypothesis: stationary
##
## KPSS Test for Level Stationarity
##
## data: SO2_var_transform$SO2_box
## KPSS Level = 0.2925, Truncation lag parameter = 2, p-value = 0.1
##
## Augmented Dickey-Fuller Test
##
## data: PM10_var_transform$PM10_box
## Dickey-Fuller = -2.9703, Lag order = 2, p-value = 0.2027
## alternative hypothesis: stationary
##
## KPSS Test for Level Stationarity
##
## data: PM10_var_transform$PM10_box
## KPSS Level = 0.22455, Truncation lag parameter = 2, p-value = 0.1
As below graphs indicate, results of First Difference seem mean and variance stationary.
And validating the results again with ADF and KPSS testing for both air pollutants- SO2 and PM10 ….
##
## Augmented Dickey-Fuller Test
##
## data: SO2_mean_transform$SO2_box_diff[!is.na(SO2_mean_transform$SO2_box_diff)]
## Dickey-Fuller = -2.2843, Lag order = 2, p-value = 0.4641
## alternative hypothesis: stationary
##
## KPSS Test for Level Stationarity
##
## data: SO2_mean_transform$SO2_box_diff[!is.na(SO2_mean_transform$SO2_box_diff)]
## KPSS Level = 0.20602, Truncation lag parameter = 2, p-value = 0.1
##
## Augmented Dickey-Fuller Test
##
## data: PM10_mean_transform$PM10_box_diff[!is.na(PM10_mean_transform$PM10_box_diff)]
## Dickey-Fuller = -3.0144, Lag order = 2, p-value = 0.1859
## alternative hypothesis: stationary
##
## KPSS Test for Level Stationarity
##
## data: PM10_mean_transform$PM10_box_diff[!is.na(PM10_mean_transform$PM10_box_diff)]
## KPSS Level = 0.22715, Truncation lag parameter = 2, p-value = 0.1
SO2 levels time-series is likely to have ‘hourly’ seasonality. Values are higher as the day goes and they start to drop in the evening.
As the ACF plot for SO2 shows ‘dampening effect’, we can conclude that it is a 1st order Auto-Regressive process (AR1).
From the below ACF and PACF plots, we can conclude that PM10 time-series is neither an Auto-Regressive nor a Moving Average process.
In this section, I have fitted different ARIMA models to SO2 time-series to find out the best fit model.
As model_3 (arima(SO2_mean_transform$SO2_box,order=c(1,1,0))) has the lowest AIC as 100.12 and lowest BIC as 102.3935, model 3 is the best fit model.
##
## Call:
## arima(x = SO2_mean_transform$SO2_box, order = c(0, 1, 1))
##
## Coefficients:
## ma1
## 0.5284
## s.e. 0.1785
##
## sigma^2 estimated as 3.81: log likelihood = -48.18, aic = 100.37
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.01467618 1.910888 1.284452 -4.880027 24.36836 0.8190551
## ACF1
## Training set 0.0660325
##
## Call:
## arima(x = SO2_mean_transform$SO2_box, order = c(0, 1, 0))
##
##
## sigma^2 estimated as 5.127: log likelihood = -51.43, aic = 104.87
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 3.953293e-05 2.216632 1.502909 -7.571516 26.41165 0.9583585
## ACF1
## Training set 0.5116884
##
## Call:
## arima(x = SO2_mean_transform$SO2_box, order = c(1, 1, 0))
##
## Coefficients:
## ar1
## 0.4940
## s.e. 0.1736
##
## sigma^2 estimated as 3.778: log likelihood = -48.06, aic = 100.12
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.01813481 1.902786 1.224336 -2.208114 21.64098 0.7807216
## ACF1
## Training set 0.06524748
##
## Call:
## arima(x = SO2_mean_transform$SO2_box, order = c(1, 0, 0))
##
## Coefficients:
## ar1 intercept
## 0.8896 3.4022
## s.e. 0.0770 3.2147
##
## sigma^2 estimated as 4.763: log likelihood = -53.57, aic = 113.14
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.1464655 2.182495 1.540662 -19.65965 38.88726 0.9824323 0.5297
##
## Call:
## arima(x = SO2_mean_transform$SO2_box, order = c(2, 0, 0))
##
## Coefficients:
## ar1 ar2 intercept
## 1.3976 -0.5553 4.3970
## s.e. 0.1579 0.1594 2.1866
##
## sigma^2 estimated as 3.131: log likelihood = -48.94, aic = 105.89
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.06322245 1.769512 1.294584 -23.95343 41.37393 0.825516
## ACF1
## Training set -0.006231726
## df BIC
## arima(SO2_mean_transform$SO2_box, order = c(0, 1, 1)) 2 102.6366
## arima(SO2_mean_transform$SO2_box, order = c(0, 1, 0)) 1 106.0010
## arima(SO2_mean_transform$SO2_box, order = c(1, 1, 0)) 2 102.3935
## arima(SO2_mean_transform$SO2_box, order = c(1, 0, 0)) 3 116.6727
## arima(SO2_mean_transform$SO2_box, order = c(2, 0, 0)) 4 110.5999
Here, residuals appear to be a white noise.
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,0)
## Q* = 4.1495, df = 4, p-value = 0.3862
##
## Model df: 1. Total lags used: 5
Conducting Box-Ljung test for residual autocorrelation: There is no residual auto-correlation at the lag.
##
## Box-Ljung test
##
## data: mod3$residuals
## X-squared = 0.1155, df = 1, p-value = 0.734
##
## Box-Ljung test
##
## data: mod3$residuals
## X-squared = 4.1495, df = 5, p-value = 0.5281
## Series: SO2_mean_transform$SO2_box
## ARIMA(2,0,0) with zero mean
##
## Coefficients:
## ar1 ar2
## 1.4236 -0.5118
## s.e. 0.1623 0.1616
##
## sigma^2 = 3.664: log likelihood = -49.99
## AIC=105.97 AICc=107.17 BIC=109.51
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.4328338 1.832726 1.221029 8.145442 25.08879 0.7786125
## ACF1
## Training set 0.004153311
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,0,0) with zero mean
## Q* = 4.0299, df = 3, p-value = 0.2583
##
## Model df: 2. Total lags used: 5
##
## Box-Ljung test
##
## data: mod1_auto$residuals
## X-squared = 0.000468, df = 1, p-value = 0.9827
##
## Box-Ljung test
##
## data: mod1_auto$residuals
## X-squared = 4.0299, df = 5, p-value = 0.5451
RMSE is 3904.509
## [1] 3904.509
We can observe that the prediction is correct and as the late evening hours, SO2 levels drop.
## List of 2
## $ pred: Time-Series [1:5] from 25 to 29: 0.4132 0.1027 -0.0653 -0.1455 -0.1738
## $ se : Time-Series [1:5] from 25 to 29: 1.91 3.33 4.42 5.19 5.72
## Joining, by = "hour"