Time Series Dataset : Beijing Air Quality

Section 1

I have chosen Beijing Air Quality time-series dataset for data analysis. Data can be found on https://archive.ics.uci.edu/ml/datasets/Beijing+Multi-Site+Air-Quality+Data# .

This hourly data set considers 6 main air pollutants and 6 relevant meteorological variables at multiple sites in Beijing.

changping<-read.csv("C:/Users/Hetal Sawant/Desktop/Spring sem/Forecasting/HW/PRSA_Data_Changping_20130301-20170228.csv")
attach(changping)
str(changping)
## 'data.frame':    35064 obs. of  18 variables:
##  $ No     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ year   : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month  : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ day    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ hour   : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ PM2.5  : num  3 3 3 3 3 3 4 3 9 11 ...
##  $ PM10   : num  6 3 3 6 3 3 6 6 25 29 ...
##  $ SO2    : num  13 6 22 12 14 10 12 25 13 5 ...
##  $ NO2    : num  7 6 13 8 8 17 22 39 42 18 ...
##  $ CO     : int  300 300 400 300 300 400 500 600 700 500 ...
##  $ O3     : num  85 85 74 81 81 71 65 48 46 73 ...
##  $ TEMP   : num  -2.3 -2.5 -3 -3.6 -3.5 -4.5 -4.5 -2.1 -0.2 0.6 ...
##  $ PRES   : num  1021 1021 1021 1022 1022 ...
##  $ DEWP   : num  -19.7 -19 -19.9 -19.1 -19.4 -19.5 -19.5 -20 -20.5 -20.4 ...
##  $ RAIN   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ wd     : chr  "E" "ENE" "ENE" "NNE" ...
##  $ WSPM   : num  0.5 0.7 0.2 1 2.1 1.7 1.8 2.5 2.8 3.8 ...
##  $ station: chr  "Changping" "Changping" "Changping" "Changping" ...

For this assignment, I have chosen to analyze two air pollutants- SO2 and PM10 in mid-October in two different years. I want to check how winter season impacts the air pollutant levels within a span of 24 hours.

As below graphs display, Non-stationary mean and variance is evidently observed in the pollutant levels.

As the data appears to have non-stationary variance, I am using BoxCox transformation to transform the time-series data.

Below graphs represent the transformed time-series for SO2 and PM10 pollutants after the application of BoxCox transformation.

BoxCox transformation helps best to normalize the variance in the time-series data

# Testing for non-stationarity:

As the both ADF test and KPSS test indicate mean is non-stationary, because, p-values are greater than 0.05

## 
##  Augmented Dickey-Fuller Test
## 
## data:  SO2_var_transform$SO2_box
## Dickey-Fuller = -1.4761, Lag order = 2, p-value = 0.772
## alternative hypothesis: stationary
## 
##  KPSS Test for Level Stationarity
## 
## data:  SO2_var_transform$SO2_box
## KPSS Level = 0.2925, Truncation lag parameter = 2, p-value = 0.1
## 
##  Augmented Dickey-Fuller Test
## 
## data:  PM10_var_transform$PM10_box
## Dickey-Fuller = -2.9703, Lag order = 2, p-value = 0.2027
## alternative hypothesis: stationary
## 
##  KPSS Test for Level Stationarity
## 
## data:  PM10_var_transform$PM10_box
## KPSS Level = 0.22455, Truncation lag parameter = 2, p-value = 0.1

Applying First difference for mean stationarity:

As below graphs indicate, results of First Difference seem mean and variance stationary.

And validating the results again with ADF and KPSS testing for both air pollutants- SO2 and PM10 ….

## 
##  Augmented Dickey-Fuller Test
## 
## data:  SO2_mean_transform$SO2_box_diff[!is.na(SO2_mean_transform$SO2_box_diff)]
## Dickey-Fuller = -2.2843, Lag order = 2, p-value = 0.4641
## alternative hypothesis: stationary
## 
##  KPSS Test for Level Stationarity
## 
## data:  SO2_mean_transform$SO2_box_diff[!is.na(SO2_mean_transform$SO2_box_diff)]
## KPSS Level = 0.20602, Truncation lag parameter = 2, p-value = 0.1
## 
##  Augmented Dickey-Fuller Test
## 
## data:  PM10_mean_transform$PM10_box_diff[!is.na(PM10_mean_transform$PM10_box_diff)]
## Dickey-Fuller = -3.0144, Lag order = 2, p-value = 0.1859
## alternative hypothesis: stationary
## 
##  KPSS Test for Level Stationarity
## 
## data:  PM10_mean_transform$PM10_box_diff[!is.na(PM10_mean_transform$PM10_box_diff)]
## KPSS Level = 0.22715, Truncation lag parameter = 2, p-value = 0.1

SO2 levels time-series is likely to have ‘hourly’ seasonality. Values are higher as the day goes and they start to drop in the evening.

Producing ACF and PACF plots:

As the ACF plot for SO2 shows ‘dampening effect’, we can conclude that it is a 1st order Auto-Regressive process (AR1).

From the below ACF and PACF plots, we can conclude that PM10 time-series is neither an Auto-Regressive nor a Moving Average process.

Section 2 :

In this section, I have fitted different ARIMA models to SO2 time-series to find out the best fit model.

As model_3 (arima(SO2_mean_transform$SO2_box,order=c(1,1,0))) has the lowest AIC as 100.12 and lowest BIC as 102.3935, model 3 is the best fit model.

## 
## Call:
## arima(x = SO2_mean_transform$SO2_box, order = c(0, 1, 1))
## 
## Coefficients:
##          ma1
##       0.5284
## s.e.  0.1785
## 
## sigma^2 estimated as 3.81:  log likelihood = -48.18,  aic = 100.37
## 
## Training set error measures:
##                       ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.01467618 1.910888 1.284452 -4.880027 24.36836 0.8190551
##                   ACF1
## Training set 0.0660325
## 
## Call:
## arima(x = SO2_mean_transform$SO2_box, order = c(0, 1, 0))
## 
## 
## sigma^2 estimated as 5.127:  log likelihood = -51.43,  aic = 104.87
## 
## Training set error measures:
##                        ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 3.953293e-05 2.216632 1.502909 -7.571516 26.41165 0.9583585
##                   ACF1
## Training set 0.5116884
## 
## Call:
## arima(x = SO2_mean_transform$SO2_box, order = c(1, 1, 0))
## 
## Coefficients:
##          ar1
##       0.4940
## s.e.  0.1736
## 
## sigma^2 estimated as 3.778:  log likelihood = -48.06,  aic = 100.12
## 
## Training set error measures:
##                       ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.01813481 1.902786 1.224336 -2.208114 21.64098 0.7807216
##                    ACF1
## Training set 0.06524748
## 
## Call:
## arima(x = SO2_mean_transform$SO2_box, order = c(1, 0, 0))
## 
## Coefficients:
##          ar1  intercept
##       0.8896     3.4022
## s.e.  0.0770     3.2147
## 
## sigma^2 estimated as 4.763:  log likelihood = -53.57,  aic = 113.14
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE     MAPE      MASE   ACF1
## Training set 0.1464655 2.182495 1.540662 -19.65965 38.88726 0.9824323 0.5297
## 
## Call:
## arima(x = SO2_mean_transform$SO2_box, order = c(2, 0, 0))
## 
## Coefficients:
##          ar1      ar2  intercept
##       1.3976  -0.5553     4.3970
## s.e.  0.1579   0.1594     2.1866
## 
## sigma^2 estimated as 3.131:  log likelihood = -48.94,  aic = 105.89
## 
## Training set error measures:
##                      ME     RMSE      MAE       MPE     MAPE     MASE
## Training set 0.06322245 1.769512 1.294584 -23.95343 41.37393 0.825516
##                      ACF1
## Training set -0.006231726
##                                                       df      BIC
## arima(SO2_mean_transform$SO2_box, order = c(0, 1, 1))  2 102.6366
## arima(SO2_mean_transform$SO2_box, order = c(0, 1, 0))  1 106.0010
## arima(SO2_mean_transform$SO2_box, order = c(1, 1, 0))  2 102.3935
## arima(SO2_mean_transform$SO2_box, order = c(1, 0, 0))  3 116.6727
## arima(SO2_mean_transform$SO2_box, order = c(2, 0, 0))  4 110.5999

Here, residuals appear to be a white noise.

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,0)
## Q* = 4.1495, df = 4, p-value = 0.3862
## 
## Model df: 1.   Total lags used: 5

Conducting Box-Ljung test for residual autocorrelation: There is no residual auto-correlation at the lag.

## 
##  Box-Ljung test
## 
## data:  mod3$residuals
## X-squared = 0.1155, df = 1, p-value = 0.734
## 
##  Box-Ljung test
## 
## data:  mod3$residuals
## X-squared = 4.1495, df = 5, p-value = 0.5281

Finding best fit model using auto.arima() and comparing for the residuals:

## Series: SO2_mean_transform$SO2_box 
## ARIMA(2,0,0) with zero mean 
## 
## Coefficients:
##          ar1      ar2
##       1.4236  -0.5118
## s.e.  0.1623   0.1616
## 
## sigma^2 = 3.664:  log likelihood = -49.99
## AIC=105.97   AICc=107.17   BIC=109.51
## 
## Training set error measures:
##                     ME     RMSE      MAE      MPE     MAPE      MASE
## Training set 0.4328338 1.832726 1.221029 8.145442 25.08879 0.7786125
##                     ACF1
## Training set 0.004153311

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,0) with zero mean
## Q* = 4.0299, df = 3, p-value = 0.2583
## 
## Model df: 2.   Total lags used: 5
## 
##  Box-Ljung test
## 
## data:  mod1_auto$residuals
## X-squared = 0.000468, df = 1, p-value = 0.9827
## 
##  Box-Ljung test
## 
## data:  mod1_auto$residuals
## X-squared = 4.0299, df = 5, p-value = 0.5451

In-sample root mean squared error of the model based on the residuals

RMSE is 3904.509

## [1] 3904.509

Predicting SO2 levels for next 5 hours :

We can observe that the prediction is correct and as the late evening hours, SO2 levels drop.

## List of 2
##  $ pred: Time-Series [1:5] from 25 to 29: 0.4132 0.1027 -0.0653 -0.1455 -0.1738
##  $ se  : Time-Series [1:5] from 25 to 29: 1.91 3.33 4.42 5.19 5.72
## Joining, by = "hour"