I have chosen Beijing Air Quality time-series dataset for data analysis. Data can be found on https://archive.ics.uci.edu/ml/datasets/Beijing+Multi-Site+Air-Quality+Data# .
This hourly data set considers 6 main air pollutants and 6 relevant meteorological variables at multiple sites in Beijing.
This data set includes hourly air pollutants data from 12 nationally-controlled air-quality monitoring sites. The air-quality data are from the Beijing Municipal Environmental Monitoring Center. The meteorological data in each air-quality site are matched with the nearest weather station from the China Meteorological Administration. The time period is from March 1st, 2013 to February 28th, 2017.
For this particular assignment I have taken ‘Changping’ monitoring site under study.
Attribute information is as below:
No: row number | year: year of data in this row | month: month of data in this row | day: day of data in this row | hour: hour of data in this row | PM2.5: PM2.5 concentration (ug/m^3) | PM10: PM10 concentration (ug/m^3) | SO2: SO2 concentration (ug/m^3) | NO2: NO2 concentration (ug/m^3) | CO: CO concentration (ug/m^3) | O3: O3 concentration (ug/m^3) | TEMP: temperature (degree Celsius) | PRES: pressure (hPa) | DEWP: dew point temperature (degree Celsius) | RAIN: precipitation (mm) | wd: wind direction | WSPM: wind speed (m/s) | station: name of the air-quality monitoring site
changping<-read.csv("C:/Users/Hetal Sawant/Desktop/Spring sem/flex 1/Forecasting/HW/PRSA_Data_Changping_20130301-20170228.csv")
attach(changping)
#str(changping)
head(changping)
## No year month day hour PM2.5 PM10 SO2 NO2 CO O3 TEMP PRES DEWP RAIN wd
## 1 1 2013 3 1 0 3 6 13 7 300 85 -2.3 1020.8 -19.7 0 E
## 2 2 2013 3 1 1 3 3 6 6 300 85 -2.5 1021.3 -19.0 0 ENE
## 3 3 2013 3 1 2 3 3 22 13 400 74 -3.0 1021.3 -19.9 0 ENE
## 4 4 2013 3 1 3 3 6 12 8 300 81 -3.6 1021.8 -19.1 0 NNE
## 5 5 2013 3 1 4 3 3 14 8 300 81 -3.5 1022.3 -19.4 0 N
## 6 6 2013 3 1 5 3 3 10 17 400 71 -4.5 1022.6 -19.5 0 NNW
## WSPM station
## 1 0.5 Changping
## 2 0.7 Changping
## 3 0.2 Changping
## 4 1.0 Changping
## 5 2.1 Changping
## 6 1.7 Changping
Missing data are denoted as NA in the dataset.In this section, I have omitted the rows which has NA value in them
## [1] 5166
## num_missing
## No 0
## year 0
## month 0
## day 0
## hour 0
## PM2.5 774
## PM10 582
## SO2 628
## NO2 667
## CO 1521
## O3 604
## TEMP 53
## PRES 50
## DEWP 53
## RAIN 51
## wd 140
## WSPM 43
## station 0
## No year month day
## Min. : 1 Min. :2013 Min. : 1.000 Min. : 1.00
## 1st Qu.: 9384 1st Qu.:2014 1st Qu.: 3.000 1st Qu.: 8.00
## Median :17910 Median :2015 Median : 7.000 Median :16.00
## Mean :17877 Mean :2015 Mean : 6.507 Mean :15.72
## 3rd Qu.:26546 3rd Qu.:2016 3rd Qu.:10.000 3rd Qu.:23.00
## Max. :35064 Max. :2017 Max. :12.000 Max. :31.00
## hour PM2.5 PM10 SO2
## Min. : 0.00 Min. : 3.00 Min. : 2.00 Min. : 1.00
## 1st Qu.: 6.00 1st Qu.: 18.00 1st Qu.: 33.00 1st Qu.: 2.00
## Median :11.00 Median : 46.00 Median : 72.00 Median : 7.00
## Mean :11.51 Mean : 70.31 Mean : 94.09 Mean : 15.06
## 3rd Qu.:18.00 3rd Qu.: 99.00 3rd Qu.:130.00 3rd Qu.: 18.00
## Max. :23.00 Max. :662.00 Max. :992.00 Max. :310.00
## NO2 CO O3 TEMP
## Min. : 2.00 Min. : 100 Min. : 0.2142 Min. :-16.6
## 1st Qu.: 22.00 1st Qu.: 500 1st Qu.: 15.0000 1st Qu.: 3.1
## Median : 36.00 Median : 800 Median : 46.0000 Median : 14.1
## Mean : 44.32 Mean : 1152 Mean : 57.4245 Mean : 13.4
## 3rd Qu.: 61.00 3rd Qu.: 1400 3rd Qu.: 79.0000 3rd Qu.: 23.1
## Max. :208.00 Max. :10000 Max. :429.0000 Max. : 41.4
## PRES DEWP RAIN wd
## Min. : 982.4 Min. :-35.100 Min. : 0.00000 Length:32681
## 1st Qu.: 999.5 1st Qu.:-10.600 1st Qu.: 0.00000 Class :character
## Median :1007.7 Median : 1.100 Median : 0.00000 Mode :character
## Mean :1008.0 Mean : 1.135 Mean : 0.06074
## 3rd Qu.:1016.3 3rd Qu.: 13.900 3rd Qu.: 0.00000
## Max. :1036.5 Max. : 27.200 Max. :52.10000
## WSPM station
## Min. : 0.000 Length:32681
## 1st Qu.: 1.000 Class :character
## Median : 1.500 Mode :character
## Mean : 1.866
## 3rd Qu.: 2.300
## Max. :10.000
## 'data.frame': 32681 obs. of 18 variables:
## $ No : int 1 2 3 4 5 6 7 8 9 10 ...
## $ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int 3 3 3 3 3 3 3 3 3 3 ...
## $ day : int 1 1 1 1 1 1 1 1 1 1 ...
## $ hour : int 0 1 2 3 4 5 6 7 8 9 ...
## $ PM2.5 : num 3 3 3 3 3 3 4 3 9 11 ...
## $ PM10 : num 6 3 3 6 3 3 6 6 25 29 ...
## $ SO2 : num 13 6 22 12 14 10 12 25 13 5 ...
## $ NO2 : num 7 6 13 8 8 17 22 39 42 18 ...
## $ CO : int 300 300 400 300 300 400 500 600 700 500 ...
## $ O3 : num 85 85 74 81 81 71 65 48 46 73 ...
## $ TEMP : num -2.3 -2.5 -3 -3.6 -3.5 -4.5 -4.5 -2.1 -0.2 0.6 ...
## $ PRES : num 1021 1021 1021 1022 1022 ...
## $ DEWP : num -19.7 -19 -19.9 -19.1 -19.4 -19.5 -19.5 -20 -20.5 -20.4 ...
## $ RAIN : num 0 0 0 0 0 0 0 0 0 0 ...
## $ wd : chr "E" "ENE" "ENE" "NNE" ...
## $ WSPM : num 0.5 0.7 0.2 1 2.1 1.7 1.8 2.5 2.8 3.8 ...
## $ station: chr "Changping" "Changping" "Changping" "Changping" ...
## - attr(*, "na.action")= 'omit' Named int [1:2383] 27 28 29 123 124 125 179 220 316 412 ...
## ..- attr(*, "names")= chr [1:2383] "27" "28" "29" "123" ...
## The following objects are masked from changping:
##
## CO, day, DEWP, hour, month, No, NO2, O3, PM10, PM2.5, PRES, RAIN,
## SO2, station, TEMP, wd, WSPM, year
#NO2
As shown in below histogram, concentration of NO2, PM2.5 and PM10 pollutants is is right skewed over the period of 5 years.
From the SO2 concentration graph shown below, we can see that there is an outlier (=265) present in the month of March
Standard deviation (Dispersion in the SO2 concentration values) is 21.05757.
## [1] 21.05757
From below graphs, we can see that there are outlier values present for CO concentration in the month of February and October.
Dispersion value is 1105.647 for CO concentration.
## [1] 1105.647
Outliers are not observed in Ozone layer.
Standard deviation for Ozone is 53.7926.
## [1] 53.7926
Below is the presentation of aggregate SO2 concentration in the air observed at 5 am and 5 pm respectively over the 5 years. Over the years, concentration has decreased. Also we can observe that SO2 is more concentrated at 5 pm than 5 am due to various factors like vehicles and amount emitted from different factories etc.
Below is the presentation of aggregate NO2 concentration in the air observed at 3 am and 3 pm respectively over the 5 years. We can observe that NO2 is more concentrated at 3 am than 3 pm.
##
## Call:
## lm(formula = SO2 ~ year, data = clean_changping)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.716 -11.901 -6.994 3.284 296.099
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7887.06791 196.56323 40.12 <2e-16 ***
## year -3.90728 0.09756 -40.05 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.56 on 32679 degrees of freedom
## Multiple R-squared: 0.04678, Adjusted R-squared: 0.04675
## F-statistic: 1604 on 1 and 32679 DF, p-value: < 2.2e-16
Here, RSE is 20.56 which signifies the estimate of standard deviation of ε (error term). Also p-value is < 0.0001 which signifies that there is an association between the predictor and response i.e. As the times is passed SO2 concentration levels have increased.
changping<-read.csv("C:/Users/Hetal Sawant/Desktop/Spring sem/flex 1/Forecasting/HW/PRSA_Data_Changping_20130301-20170228.csv")
attach(changping)
## The following objects are masked from clean_changping:
##
## CO, day, DEWP, hour, month, No, NO2, O3, PM10, PM2.5, PRES, RAIN,
## SO2, station, TEMP, wd, WSPM, year
## The following objects are masked from changping (pos = 5):
##
## CO, day, DEWP, hour, month, No, NO2, O3, PM10, PM2.5, PRES, RAIN,
## SO2, station, TEMP, wd, WSPM, year
#str(changping)
head(changping)
## No year month day hour PM2.5 PM10 SO2 NO2 CO O3 TEMP PRES DEWP RAIN wd
## 1 1 2013 3 1 0 3 6 13 7 300 85 -2.3 1020.8 -19.7 0 E
## 2 2 2013 3 1 1 3 3 6 6 300 85 -2.5 1021.3 -19.0 0 ENE
## 3 3 2013 3 1 2 3 3 22 13 400 74 -3.0 1021.3 -19.9 0 ENE
## 4 4 2013 3 1 3 3 6 12 8 300 81 -3.6 1021.8 -19.1 0 NNE
## 5 5 2013 3 1 4 3 3 14 8 300 81 -3.5 1022.3 -19.4 0 N
## 6 6 2013 3 1 5 3 3 10 17 400 71 -4.5 1022.6 -19.5 0 NNW
## WSPM station
## 1 0.5 Changping
## 2 0.7 Changping
## 3 0.2 Changping
## 4 1.0 Changping
## 5 2.1 Changping
## 6 1.7 Changping
## No year month day hour PM2.5 PM10 SO2 NO2 CO O3 TEMP PRES DEWP RAIN wd
## 1 1 2013 3 1 0 3 6 13 7 300 85 -2.3 1020.8 -19.7 0 E
## 2 2 2013 3 1 1 3 3 6 6 300 85 -2.5 1021.3 -19.0 0 ENE
## 3 3 2013 3 1 2 3 3 22 13 400 74 -3.0 1021.3 -19.9 0 ENE
## 4 4 2013 3 1 3 3 6 12 8 300 81 -3.6 1021.8 -19.1 0 NNE
## 5 5 2013 3 1 4 3 3 14 8 300 81 -3.5 1022.3 -19.4 0 N
## 6 6 2013 3 1 5 3 3 10 17 400 71 -4.5 1022.6 -19.5 0 NNW
## WSPM station y1
## 1 0.5 Changping 2013-03-01
## 2 0.7 Changping 2013-03-01
## 3 0.2 Changping 2013-03-01
## 4 1.0 Changping 2013-03-01
## 5 2.1 Changping 2013-03-01
## 6 1.7 Changping 2013-03-01
For this part, I have chosen to analyze two air pollutants- SO2 in mid-December before 2016. I want to check how winter season impacts the air pollutant levels over the span of 4 years.
As the graphs display, Non-stationary mean and variance is evidently observed in the pollutant levels.
As the data appears to have non-stationary variance, I am using BoxCox transformation to transform the time-series data.
Below graphs represent the transformed time-series for SO2 pollutant after the application of BoxCox transformation.
BoxCox transformation helps best to normalize the variance in the time-series data
As the both ADF test and KPSS test indicate mean is non-stationary, because, p-values are greater than 0.05
##
## Augmented Dickey-Fuller Test
##
## data: SO2_var_transform$SO2_box
## Dickey-Fuller = -1.7638, Lag order = 4, p-value = 0.6715
## alternative hypothesis: stationary
##
## KPSS Test for Level Stationarity
##
## data: SO2_var_transform$SO2_box
## KPSS Level = 1.0289, Truncation lag parameter = 3, p-value = 0.01
As below graphs indicate, results of First Difference seem mean and variance stationary.
And validating the results again with ADF and KPSS testing for SO2 pollutant…
##
## Augmented Dickey-Fuller Test
##
## data: SO2_mean_transform$SO2_box_diff[!is.na(SO2_mean_transform$SO2_box_diff)]
## Dickey-Fuller = -3.4458, Lag order = 4, p-value = 0.05548
## alternative hypothesis: stationary
##
## KPSS Test for Level Stationarity
##
## data: SO2_mean_transform$SO2_box_diff[!is.na(SO2_mean_transform$SO2_box_diff)]
## KPSS Level = 0.021481, Truncation lag parameter = 3, p-value = 0.1
SO2 levels time-series is likely to have ‘hourly’ seasonality. Values are higher as the day goes and they start to drop in the evening. (might edit)
As the ACF plot for SO2 shows ‘dampening effect’, we can conclude that it is a 2nd order Auto-Regressive process (AR2).
In this section, I have fitted different ARIMA models to SO2 time-series to find out the best fit model.
As model_5 (arima(SO2_mean_transform$SO2_box,order=c(2,0,0))) has the lowest AIC as 124.56 and lowest BIC as 133.66, model 3 is the best fit model.
##
## Call:
## arima(x = SO2_mean_transform$SO2_box, order = c(0, 1, 1))
##
## Coefficients:
## ma1
## -0.9198
## s.e. 0.0402
##
## sigma^2 estimated as 0.3409: log likelihood = -63.48, aic = 130.96
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.09266072 0.5798083 0.5132462 -25.89082 45.75158 0.6596577
## ACF1
## Training set -0.4002606
##
## Call:
## arima(x = SO2_mean_transform$SO2_box, order = c(0, 1, 0))
##
##
## sigma^2 estimated as 0.858: log likelihood = -95.31, aic = 192.61
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.01702582 0.9198182 0.7672727 -24.92745 60.96428 0.9861494
## ACF1
## Training set -0.5113093
##
## Call:
## arima(x = SO2_mean_transform$SO2_box, order = c(1, 1, 0))
##
## Coefficients:
## ar1
## -0.5038
## s.e. 0.1009
##
## sigma^2 estimated as 0.6342: log likelihood = -84.72, aic = 173.45
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.02566505 0.7907994 0.7114285 -25.48573 57.98106 0.9143747
## ACF1
## Training set -0.4878925
##
## Call:
## arima(x = SO2_mean_transform$SO2_box, order = c(1, 0, 0))
##
## Coefficients:
## ar1 intercept
## -0.2379 1.5194
## s.e. 0.1153 0.0546
##
## sigma^2 estimated as 0.3274: log likelihood = -62, aic = 130
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.001811511 0.5722204 0.5252722 -19.74818 44.82579 0.6751144
## ACF1
## Training set -0.07149676
##
## Call:
## arima(x = SO2_mean_transform$SO2_box, order = c(2, 0, 0))
##
## Coefficients:
## ar1 ar2 intercept
## -0.3269 -0.3220 1.5192
## s.e. 0.1142 0.1146 0.0391
##
## sigma^2 estimated as 0.2944: log likelihood = -58.28, aic = 124.56
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.006518512 0.5425492 0.4532191 -17.9904 39.63995 0.582507
## ACF1
## Training set 0.3027177
## df BIC
## arima(SO2_mean_transform$SO2_box, order = c(0, 1, 1)) 2 135.4809
## arima(SO2_mean_transform$SO2_box, order = c(0, 1, 0)) 1 194.8767
## arima(SO2_mean_transform$SO2_box, order = c(1, 1, 0)) 2 177.9714
## arima(SO2_mean_transform$SO2_box, order = c(1, 0, 0)) 3 136.8302
## arima(SO2_mean_transform$SO2_box, order = c(2, 0, 0)) 4 133.6633
Here, residuals appear to be a white noise.
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,0,0) with non-zero mean
## Q* = 174.01, df = 7, p-value < 2.2e-16
##
## Model df: 3. Total lags used: 10
Conducting Box-Ljung test for residual autocorrelation: There is no residual auto-correlation at the lag.
##
## Box-Ljung test
##
## data: mod5$residuals
## X-squared = 6.8767, df = 1, p-value = 0.008733
##
## Box-Ljung test
##
## data: mod5$residuals
## X-squared = 83.257, df = 5, p-value = 2.22e-16
## Series: SO2_mean_transform$SO2_box
## ARIMA(2,1,0)
##
## Coefficients:
## ar1 ar2
## -1.0098 -0.9781
## s.e. 0.0227 0.0166
##
## sigma^2 = 0.02769: log likelihood = 24.3
## AIC=-42.61 AICc=-42.25 BIC=-35.82
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.02407056 0.1629002 0.1156149 -3.657083 10.92749 0.1485959
## ACF1
## Training set 0.2034958
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,0)
## Q* = 8.5245, df = 8, p-value = 0.384
##
## Model df: 2. Total lags used: 10
##
## Box-Ljung test
##
## data: mod1_auto$residuals
## X-squared = 3.1075, df = 1, p-value = 0.07793
##
## Box-Ljung test
##
## data: mod1_auto$residuals
## X-squared = 5.9029, df = 5, p-value = 0.3158
RMSE is 30.38
## [1] 30.38052
We can observe that the prediction is correct and as the late evening hours, SO2 levels drop.
## List of 2
## $ pred: Time-Series [1:5] from 73 to 77: 2.048 0.903 0.951 2.022 0.894
## $ se : Time-Series [1:5] from 73 to 77: 0.166 0.166 0.166 0.233 0.233
## Joining, by = "hour"
Assesing the Changping SO2 levels over the 5 years of period using Facebook Prophet model:
We can observe the change-points at the beginning (Jan- Feb) and end (Nov - Dec) of each year.
As observed below, we can see that SO2 levels drop around Wednesday in weekly seasonality. While considering monthly seasonality we can see that SO2 levels drop between April to October.
We also observed SO2 levels impacting around New year’s day (1st Jan) and Chinese new year (1st Feb).
It is expected to have positive values for SO2 levels. Maximum SO2 levels observed are 340 and minimum is 10. We can see the same reflecting in below graphs. We also observed predicted values slightly towards negative number.
##Seasonality : With Additive and Multiplicative
## y ds yhat yhat_lower yhat_upper cutoff
## 1 27 2015-03-25 4.291868 -20.52526 30.75234 2015-03-24
## 2 27 2015-03-25 4.291868 -20.10028 29.56012 2015-03-24
## 3 27 2015-03-25 4.291868 -21.47427 31.25539 2015-03-24
## 4 27 2015-03-25 4.291868 -20.38720 30.47535 2015-03-24
## 5 27 2015-03-25 4.291868 -21.75215 28.87214 2015-03-24
## 6 27 2015-03-25 4.291868 -21.45851 30.36092 2015-03-24
## [1] "2015-03-24 GMT" "2015-04-23 GMT" "2015-05-23 GMT" "2015-06-22 GMT"
## [5] "2015-07-22 GMT" "2015-08-21 GMT" "2015-09-20 GMT" "2015-10-20 GMT"
## <ggproto object: Class ScaleDiscrete, Scale, gg>
## aesthetics: colour
## axis_order: function
## break_info: function
## break_positions: function
## breaks: waiver
## call: call
## clone: function
## dimension: function
## drop: TRUE
## expand: waiver
## get_breaks: function
## get_breaks_minor: function
## get_labels: function
## get_limits: function
## guide: legend
## is_discrete: function
## is_empty: function
## labels: waiver
## limits: NULL
## make_sec_title: function
## make_title: function
## map: function
## map_df: function
## n.breaks.cache: NULL
## na.translate: TRUE
## na.value: grey50
## name: Cutoff
## palette: function
## palette.cache: NULL
## position: left
## range: <ggproto object: Class RangeDiscrete, Range, gg>
## range: NULL
## reset: function
## train: function
## super: <ggproto object: Class RangeDiscrete, Range, gg>
## rescale: function
## reset: function
## scale_name: hue
## train: function
## train_df: function
## transform: function
## transform_df: function
## super: <ggproto object: Class ScaleDiscrete, Scale, gg>
From the above information, we can say that Facebook Prophet model best indicates the seasonality in the dataset. It also performs better while predicting SO2 levels for out-of-sample data. Hence, Facebook Prophet model will give better results while predicting SO2 air pollutant levels over the years.