Time Series Dataset : Beijing Air Quality

I have chosen Beijing Air Quality time-series dataset for data analysis. Data can be found on https://archive.ics.uci.edu/ml/datasets/Beijing+Multi-Site+Air-Quality+Data# .

This hourly data set considers 6 main air pollutants and 6 relevant meteorological variables at multiple sites in Beijing.

This data set includes hourly air pollutants data from 12 nationally-controlled air-quality monitoring sites. The air-quality data are from the Beijing Municipal Environmental Monitoring Center. The meteorological data in each air-quality site are matched with the nearest weather station from the China Meteorological Administration. The time period is from March 1st, 2013 to February 28th, 2017.

For this particular assignment I have taken ‘Changping’ monitoring site under study.

Attribute information is as below:

No: row number | year: year of data in this row | month: month of data in this row | day: day of data in this row | hour: hour of data in this row | PM2.5: PM2.5 concentration (ug/m^3) | PM10: PM10 concentration (ug/m^3) | SO2: SO2 concentration (ug/m^3) | NO2: NO2 concentration (ug/m^3) | CO: CO concentration (ug/m^3) | O3: O3 concentration (ug/m^3) | TEMP: temperature (degree Celsius) | PRES: pressure (hPa) | DEWP: dew point temperature (degree Celsius) | RAIN: precipitation (mm) | wd: wind direction | WSPM: wind speed (m/s) | station: name of the air-quality monitoring site

Part 1: Exploratory Data Analysis

Section 1: Reading the time-series dataset

changping<-read.csv("C:/Users/Hetal Sawant/Desktop/Spring sem/flex 1/Forecasting/HW/PRSA_Data_Changping_20130301-20170228.csv")
attach(changping)
#str(changping)
head(changping)
##   No year month day hour PM2.5 PM10 SO2 NO2  CO O3 TEMP   PRES  DEWP RAIN  wd
## 1  1 2013     3   1    0     3    6  13   7 300 85 -2.3 1020.8 -19.7    0   E
## 2  2 2013     3   1    1     3    3   6   6 300 85 -2.5 1021.3 -19.0    0 ENE
## 3  3 2013     3   1    2     3    3  22  13 400 74 -3.0 1021.3 -19.9    0 ENE
## 4  4 2013     3   1    3     3    6  12   8 300 81 -3.6 1021.8 -19.1    0 NNE
## 5  5 2013     3   1    4     3    3  14   8 300 81 -3.5 1022.3 -19.4    0   N
## 6  6 2013     3   1    5     3    3  10  17 400 71 -4.5 1022.6 -19.5    0 NNW
##   WSPM   station
## 1  0.5 Changping
## 2  0.7 Changping
## 3  0.2 Changping
## 4  1.0 Changping
## 5  2.1 Changping
## 6  1.7 Changping

Handling Missing values and nulls:

Missing data are denoted as NA in the dataset.In this section, I have omitted the rows which has NA value in them

## [1] 5166
##         num_missing
## No                0
## year              0
## month             0
## day               0
## hour              0
## PM2.5           774
## PM10            582
## SO2             628
## NO2             667
## CO             1521
## O3              604
## TEMP             53
## PRES             50
## DEWP             53
## RAIN             51
## wd              140
## WSPM             43
## station           0
##        No             year          month             day       
##  Min.   :    1   Min.   :2013   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.: 9384   1st Qu.:2014   1st Qu.: 3.000   1st Qu.: 8.00  
##  Median :17910   Median :2015   Median : 7.000   Median :16.00  
##  Mean   :17877   Mean   :2015   Mean   : 6.507   Mean   :15.72  
##  3rd Qu.:26546   3rd Qu.:2016   3rd Qu.:10.000   3rd Qu.:23.00  
##  Max.   :35064   Max.   :2017   Max.   :12.000   Max.   :31.00  
##       hour           PM2.5             PM10             SO2        
##  Min.   : 0.00   Min.   :  3.00   Min.   :  2.00   Min.   :  1.00  
##  1st Qu.: 6.00   1st Qu.: 18.00   1st Qu.: 33.00   1st Qu.:  2.00  
##  Median :11.00   Median : 46.00   Median : 72.00   Median :  7.00  
##  Mean   :11.51   Mean   : 70.31   Mean   : 94.09   Mean   : 15.06  
##  3rd Qu.:18.00   3rd Qu.: 99.00   3rd Qu.:130.00   3rd Qu.: 18.00  
##  Max.   :23.00   Max.   :662.00   Max.   :992.00   Max.   :310.00  
##       NO2               CO              O3                TEMP      
##  Min.   :  2.00   Min.   :  100   Min.   :  0.2142   Min.   :-16.6  
##  1st Qu.: 22.00   1st Qu.:  500   1st Qu.: 15.0000   1st Qu.:  3.1  
##  Median : 36.00   Median :  800   Median : 46.0000   Median : 14.1  
##  Mean   : 44.32   Mean   : 1152   Mean   : 57.4245   Mean   : 13.4  
##  3rd Qu.: 61.00   3rd Qu.: 1400   3rd Qu.: 79.0000   3rd Qu.: 23.1  
##  Max.   :208.00   Max.   :10000   Max.   :429.0000   Max.   : 41.4  
##       PRES             DEWP              RAIN               wd           
##  Min.   : 982.4   Min.   :-35.100   Min.   : 0.00000   Length:32681      
##  1st Qu.: 999.5   1st Qu.:-10.600   1st Qu.: 0.00000   Class :character  
##  Median :1007.7   Median :  1.100   Median : 0.00000   Mode  :character  
##  Mean   :1008.0   Mean   :  1.135   Mean   : 0.06074                     
##  3rd Qu.:1016.3   3rd Qu.: 13.900   3rd Qu.: 0.00000                     
##  Max.   :1036.5   Max.   : 27.200   Max.   :52.10000                     
##       WSPM          station         
##  Min.   : 0.000   Length:32681      
##  1st Qu.: 1.000   Class :character  
##  Median : 1.500   Mode  :character  
##  Mean   : 1.866                     
##  3rd Qu.: 2.300                     
##  Max.   :10.000
## 'data.frame':    32681 obs. of  18 variables:
##  $ No     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ year   : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month  : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ day    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ hour   : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ PM2.5  : num  3 3 3 3 3 3 4 3 9 11 ...
##  $ PM10   : num  6 3 3 6 3 3 6 6 25 29 ...
##  $ SO2    : num  13 6 22 12 14 10 12 25 13 5 ...
##  $ NO2    : num  7 6 13 8 8 17 22 39 42 18 ...
##  $ CO     : int  300 300 400 300 300 400 500 600 700 500 ...
##  $ O3     : num  85 85 74 81 81 71 65 48 46 73 ...
##  $ TEMP   : num  -2.3 -2.5 -3 -3.6 -3.5 -4.5 -4.5 -2.1 -0.2 0.6 ...
##  $ PRES   : num  1021 1021 1021 1022 1022 ...
##  $ DEWP   : num  -19.7 -19 -19.9 -19.1 -19.4 -19.5 -19.5 -20 -20.5 -20.4 ...
##  $ RAIN   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ wd     : chr  "E" "ENE" "ENE" "NNE" ...
##  $ WSPM   : num  0.5 0.7 0.2 1 2.1 1.7 1.8 2.5 2.8 3.8 ...
##  $ station: chr  "Changping" "Changping" "Changping" "Changping" ...
##  - attr(*, "na.action")= 'omit' Named int [1:2383] 27 28 29 123 124 125 179 220 316 412 ...
##   ..- attr(*, "names")= chr [1:2383] "27" "28" "29" "123" ...
## The following objects are masked from changping:
## 
##     CO, day, DEWP, hour, month, No, NO2, O3, PM10, PM2.5, PRES, RAIN,
##     SO2, station, TEMP, wd, WSPM, year

Section 2: Visualizing the time-series dataset

#NO2

As shown in below histogram, concentration of NO2, PM2.5 and PM10 pollutants is is right skewed over the period of 5 years.

SO2

From the SO2 concentration graph shown below, we can see that there is an outlier (=265) present in the month of March

Standard deviation (Dispersion in the SO2 concentration values) is 21.05757.

## [1] 21.05757

CO

From below graphs, we can see that there are outlier values present for CO concentration in the month of February and October.

Dispersion value is 1105.647 for CO concentration.

## [1] 1105.647

O3

Outliers are not observed in Ozone layer.

Standard deviation for Ozone is 53.7926.

## [1] 53.7926

SO2 concentration over the 5 years

Below is the presentation of aggregate SO2 concentration in the air observed at 5 am and 5 pm respectively over the 5 years. Over the years, concentration has decreased. Also we can observe that SO2 is more concentrated at 5 pm than 5 am due to various factors like vehicles and amount emitted from different factories etc.

NO2 concentration over the 5 years

Below is the presentation of aggregate NO2 concentration in the air observed at 3 am and 3 pm respectively over the 5 years. We can observe that NO2 is more concentrated at 3 am than 3 pm.

Section 3:Regression model fit

## 
## Call:
## lm(formula = SO2 ~ year, data = clean_changping)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.716 -11.901  -6.994   3.284 296.099 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7887.06791  196.56323   40.12   <2e-16 ***
## year          -3.90728    0.09756  -40.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.56 on 32679 degrees of freedom
## Multiple R-squared:  0.04678,    Adjusted R-squared:  0.04675 
## F-statistic:  1604 on 1 and 32679 DF,  p-value: < 2.2e-16

Here, RSE is 20.56 which signifies the estimate of standard deviation of ε (error term). Also p-value is < 0.0001 which signifies that there is an association between the predictor and response i.e. As the times is passed SO2 concentration levels have increased.

Part 2: ARIMA modeling

changping<-read.csv("C:/Users/Hetal Sawant/Desktop/Spring sem/flex 1/Forecasting/HW/PRSA_Data_Changping_20130301-20170228.csv")
attach(changping)
## The following objects are masked from clean_changping:
## 
##     CO, day, DEWP, hour, month, No, NO2, O3, PM10, PM2.5, PRES, RAIN,
##     SO2, station, TEMP, wd, WSPM, year
## The following objects are masked from changping (pos = 5):
## 
##     CO, day, DEWP, hour, month, No, NO2, O3, PM10, PM2.5, PRES, RAIN,
##     SO2, station, TEMP, wd, WSPM, year
#str(changping)
head(changping)
##   No year month day hour PM2.5 PM10 SO2 NO2  CO O3 TEMP   PRES  DEWP RAIN  wd
## 1  1 2013     3   1    0     3    6  13   7 300 85 -2.3 1020.8 -19.7    0   E
## 2  2 2013     3   1    1     3    3   6   6 300 85 -2.5 1021.3 -19.0    0 ENE
## 3  3 2013     3   1    2     3    3  22  13 400 74 -3.0 1021.3 -19.9    0 ENE
## 4  4 2013     3   1    3     3    6  12   8 300 81 -3.6 1021.8 -19.1    0 NNE
## 5  5 2013     3   1    4     3    3  14   8 300 81 -3.5 1022.3 -19.4    0   N
## 6  6 2013     3   1    5     3    3  10  17 400 71 -4.5 1022.6 -19.5    0 NNW
##   WSPM   station
## 1  0.5 Changping
## 2  0.7 Changping
## 3  0.2 Changping
## 4  1.0 Changping
## 5  2.1 Changping
## 6  1.7 Changping
##   No year month day hour PM2.5 PM10 SO2 NO2  CO O3 TEMP   PRES  DEWP RAIN  wd
## 1  1 2013     3   1    0     3    6  13   7 300 85 -2.3 1020.8 -19.7    0   E
## 2  2 2013     3   1    1     3    3   6   6 300 85 -2.5 1021.3 -19.0    0 ENE
## 3  3 2013     3   1    2     3    3  22  13 400 74 -3.0 1021.3 -19.9    0 ENE
## 4  4 2013     3   1    3     3    6  12   8 300 81 -3.6 1021.8 -19.1    0 NNE
## 5  5 2013     3   1    4     3    3  14   8 300 81 -3.5 1022.3 -19.4    0   N
## 6  6 2013     3   1    5     3    3  10  17 400 71 -4.5 1022.6 -19.5    0 NNW
##   WSPM   station         y1
## 1  0.5 Changping 2013-03-01
## 2  0.7 Changping 2013-03-01
## 3  0.2 Changping 2013-03-01
## 4  1.0 Changping 2013-03-01
## 5  2.1 Changping 2013-03-01
## 6  1.7 Changping 2013-03-01

For this part, I have chosen to analyze two air pollutants- SO2 in mid-December before 2016. I want to check how winter season impacts the air pollutant levels over the span of 4 years.

As the graphs display, Non-stationary mean and variance is evidently observed in the pollutant levels.

As the data appears to have non-stationary variance, I am using BoxCox transformation to transform the time-series data.

Below graphs represent the transformed time-series for SO2 pollutant after the application of BoxCox transformation.

BoxCox transformation helps best to normalize the variance in the time-series data

Testing for non-stationarity:

As the both ADF test and KPSS test indicate mean is non-stationary, because, p-values are greater than 0.05

## 
##  Augmented Dickey-Fuller Test
## 
## data:  SO2_var_transform$SO2_box
## Dickey-Fuller = -1.7638, Lag order = 4, p-value = 0.6715
## alternative hypothesis: stationary
## 
##  KPSS Test for Level Stationarity
## 
## data:  SO2_var_transform$SO2_box
## KPSS Level = 1.0289, Truncation lag parameter = 3, p-value = 0.01

Applying First difference for mean stationarity:

As below graphs indicate, results of First Difference seem mean and variance stationary.

And validating the results again with ADF and KPSS testing for SO2 pollutant…

## 
##  Augmented Dickey-Fuller Test
## 
## data:  SO2_mean_transform$SO2_box_diff[!is.na(SO2_mean_transform$SO2_box_diff)]
## Dickey-Fuller = -3.4458, Lag order = 4, p-value = 0.05548
## alternative hypothesis: stationary
## 
##  KPSS Test for Level Stationarity
## 
## data:  SO2_mean_transform$SO2_box_diff[!is.na(SO2_mean_transform$SO2_box_diff)]
## KPSS Level = 0.021481, Truncation lag parameter = 3, p-value = 0.1

SO2 levels time-series is likely to have ‘hourly’ seasonality. Values are higher as the day goes and they start to drop in the evening. (might edit)

Producing ACF and PACF plots:

As the ACF plot for SO2 shows ‘dampening effect’, we can conclude that it is a 2nd order Auto-Regressive process (AR2).

Section 2 :

In this section, I have fitted different ARIMA models to SO2 time-series to find out the best fit model.

As model_5 (arima(SO2_mean_transform$SO2_box,order=c(2,0,0))) has the lowest AIC as 124.56 and lowest BIC as 133.66, model 3 is the best fit model.

## 
## Call:
## arima(x = SO2_mean_transform$SO2_box, order = c(0, 1, 1))
## 
## Coefficients:
##           ma1
##       -0.9198
## s.e.   0.0402
## 
## sigma^2 estimated as 0.3409:  log likelihood = -63.48,  aic = 130.96
## 
## Training set error measures:
##                       ME      RMSE       MAE       MPE     MAPE      MASE
## Training set -0.09266072 0.5798083 0.5132462 -25.89082 45.75158 0.6596577
##                    ACF1
## Training set -0.4002606
## 
## Call:
## arima(x = SO2_mean_transform$SO2_box, order = c(0, 1, 0))
## 
## 
## sigma^2 estimated as 0.858:  log likelihood = -95.31,  aic = 192.61
## 
## Training set error measures:
##                       ME      RMSE       MAE       MPE     MAPE      MASE
## Training set -0.01702582 0.9198182 0.7672727 -24.92745 60.96428 0.9861494
##                    ACF1
## Training set -0.5113093
## 
## Call:
## arima(x = SO2_mean_transform$SO2_box, order = c(1, 1, 0))
## 
## Coefficients:
##           ar1
##       -0.5038
## s.e.   0.1009
## 
## sigma^2 estimated as 0.6342:  log likelihood = -84.72,  aic = 173.45
## 
## Training set error measures:
##                       ME      RMSE       MAE       MPE     MAPE      MASE
## Training set -0.02566505 0.7907994 0.7114285 -25.48573 57.98106 0.9143747
##                    ACF1
## Training set -0.4878925
## 
## Call:
## arima(x = SO2_mean_transform$SO2_box, order = c(1, 0, 0))
## 
## Coefficients:
##           ar1  intercept
##       -0.2379     1.5194
## s.e.   0.1153     0.0546
## 
## sigma^2 estimated as 0.3274:  log likelihood = -62,  aic = 130
## 
## Training set error measures:
##                       ME      RMSE       MAE       MPE     MAPE      MASE
## Training set 0.001811511 0.5722204 0.5252722 -19.74818 44.82579 0.6751144
##                     ACF1
## Training set -0.07149676
## 
## Call:
## arima(x = SO2_mean_transform$SO2_box, order = c(2, 0, 0))
## 
## Coefficients:
##           ar1      ar2  intercept
##       -0.3269  -0.3220     1.5192
## s.e.   0.1142   0.1146     0.0391
## 
## sigma^2 estimated as 0.2944:  log likelihood = -58.28,  aic = 124.56
## 
## Training set error measures:
##                       ME      RMSE       MAE      MPE     MAPE     MASE
## Training set 0.006518512 0.5425492 0.4532191 -17.9904 39.63995 0.582507
##                   ACF1
## Training set 0.3027177
##                                                       df      BIC
## arima(SO2_mean_transform$SO2_box, order = c(0, 1, 1))  2 135.4809
## arima(SO2_mean_transform$SO2_box, order = c(0, 1, 0))  1 194.8767
## arima(SO2_mean_transform$SO2_box, order = c(1, 1, 0))  2 177.9714
## arima(SO2_mean_transform$SO2_box, order = c(1, 0, 0))  3 136.8302
## arima(SO2_mean_transform$SO2_box, order = c(2, 0, 0))  4 133.6633

Here, residuals appear to be a white noise.

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,0) with non-zero mean
## Q* = 174.01, df = 7, p-value < 2.2e-16
## 
## Model df: 3.   Total lags used: 10

Conducting Box-Ljung test for residual autocorrelation: There is no residual auto-correlation at the lag.

## 
##  Box-Ljung test
## 
## data:  mod5$residuals
## X-squared = 6.8767, df = 1, p-value = 0.008733
## 
##  Box-Ljung test
## 
## data:  mod5$residuals
## X-squared = 83.257, df = 5, p-value = 2.22e-16

Finding best fit model using auto.arima() and comparing for the residuals:

## Series: SO2_mean_transform$SO2_box 
## ARIMA(2,1,0) 
## 
## Coefficients:
##           ar1      ar2
##       -1.0098  -0.9781
## s.e.   0.0227   0.0166
## 
## sigma^2 = 0.02769:  log likelihood = 24.3
## AIC=-42.61   AICc=-42.25   BIC=-35.82
## 
## Training set error measures:
##                       ME      RMSE       MAE       MPE     MAPE      MASE
## Training set -0.02407056 0.1629002 0.1156149 -3.657083 10.92749 0.1485959
##                   ACF1
## Training set 0.2034958

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,0)
## Q* = 8.5245, df = 8, p-value = 0.384
## 
## Model df: 2.   Total lags used: 10
## 
##  Box-Ljung test
## 
## data:  mod1_auto$residuals
## X-squared = 3.1075, df = 1, p-value = 0.07793
## 
##  Box-Ljung test
## 
## data:  mod1_auto$residuals
## X-squared = 5.9029, df = 5, p-value = 0.3158

In-sample root mean squared error of the model based on the residuals

RMSE is 30.38

## [1] 30.38052

Predicting SO2 levels for next 5 hours :

We can observe that the prediction is correct and as the late evening hours, SO2 levels drop.

## List of 2
##  $ pred: Time-Series [1:5] from 73 to 77: 2.048 0.903 0.951 2.022 0.894
##  $ se  : Time-Series [1:5] from 73 to 77: 0.166 0.166 0.166 0.233 0.233
## Joining, by = "hour"

Part 3: Facebook Prophet model

Assesing the Changping SO2 levels over the 5 years of period using Facebook Prophet model:

Estimating Changepoints

We can observe the change-points at the beginning (Jan- Feb) and end (Nov - Dec) of each year.

Assesing saturation points:

It is expected to have positive values for SO2 levels. Maximum SO2 levels observed are 340 and minimum is 10. We can see the same reflecting in below graphs. We also observed predicted values slightly towards negative number.

##Seasonality : With Additive and Multiplicative

Cross-validation to assess performance of the model

##    y         ds     yhat yhat_lower yhat_upper     cutoff
## 1 27 2015-03-25 4.291868  -20.52526   30.75234 2015-03-24
## 2 27 2015-03-25 4.291868  -20.10028   29.56012 2015-03-24
## 3 27 2015-03-25 4.291868  -21.47427   31.25539 2015-03-24
## 4 27 2015-03-25 4.291868  -20.38720   30.47535 2015-03-24
## 5 27 2015-03-25 4.291868  -21.75215   28.87214 2015-03-24
## 6 27 2015-03-25 4.291868  -21.45851   30.36092 2015-03-24
## [1] "2015-03-24 GMT" "2015-04-23 GMT" "2015-05-23 GMT" "2015-06-22 GMT"
## [5] "2015-07-22 GMT" "2015-08-21 GMT" "2015-09-20 GMT" "2015-10-20 GMT"

## <ggproto object: Class ScaleDiscrete, Scale, gg>
##     aesthetics: colour
##     axis_order: function
##     break_info: function
##     break_positions: function
##     breaks: waiver
##     call: call
##     clone: function
##     dimension: function
##     drop: TRUE
##     expand: waiver
##     get_breaks: function
##     get_breaks_minor: function
##     get_labels: function
##     get_limits: function
##     guide: legend
##     is_discrete: function
##     is_empty: function
##     labels: waiver
##     limits: NULL
##     make_sec_title: function
##     make_title: function
##     map: function
##     map_df: function
##     n.breaks.cache: NULL
##     na.translate: TRUE
##     na.value: grey50
##     name: Cutoff
##     palette: function
##     palette.cache: NULL
##     position: left
##     range: <ggproto object: Class RangeDiscrete, Range, gg>
##         range: NULL
##         reset: function
##         train: function
##         super:  <ggproto object: Class RangeDiscrete, Range, gg>
##     rescale: function
##     reset: function
##     scale_name: hue
##     train: function
##     train_df: function
##     transform: function
##     transform_df: function
##     super:  <ggproto object: Class ScaleDiscrete, Scale, gg>

Part 4: Model Comparison and Validation

From the above information, we can say that Facebook Prophet model best indicates the seasonality in the dataset. It also performs better while predicting SO2 levels for out-of-sample data. Hence, Facebook Prophet model will give better results while predicting SO2 air pollutant levels over the years.