Time Series Dataset : Beijing Air Quality

I have chosen Beijing Air Quality time-series dataset for data analysis. Data can be found on https://archive.ics.uci.edu/ml/datasets/Beijing+Multi-Site+Air-Quality+Data# .

This hourly data set considers 6 main air pollutants and 6 relevant meteorological variables at multiple sites in Beijing.

This data set includes hourly air pollutants data from 12 nationally-controlled air-quality monitoring sites. The air-quality data are from the Beijing Municipal Environmental Monitoring Center. The meteorological data in each air-quality site are matched with the nearest weather station from the China Meteorological Administration. The time period is from March 1st, 2013 to February 28th, 2017.

For this particular assignment I have taken ‘Changping’ monitoring site under study.

Attribute information is as below:

No: row number | year: year of data in this row | month: month of data in this row | day: day of data in this row | hour: hour of data in this row | PM2.5: PM2.5 concentration (ug/m^3) | PM10: PM10 concentration (ug/m^3) | SO2: SO2 concentration (ug/m^3) | NO2: NO2 concentration (ug/m^3) | CO: CO concentration (ug/m^3) | O3: O3 concentration (ug/m^3) | TEMP: temperature (degree Celsius) | PRES: pressure (hPa) | DEWP: dew point temperature (degree Celsius) | RAIN: precipitation (mm) | wd: wind direction | WSPM: wind speed (m/s) | station: name of the air-quality monitoring site

Section 1: Reading the time-series dataset

changping<-read.csv("C:/Users/Hetal Sawant/Desktop/Spring sem/Forecasting/week1/datasets/Beijing air quality data/PRSA_Data_20130301-20170228/PRSA_Data_Changping_20130301-20170228.csv")
head(changping)

Handling Missing values and nulls:

Missing data are denoted as NA in the dataset.In this section, I have omitted the rows which has NA value in them

## [1] 5166
##        No             year          month             day       
##  Min.   :    1   Min.   :2013   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.: 9384   1st Qu.:2014   1st Qu.: 3.000   1st Qu.: 8.00  
##  Median :17910   Median :2015   Median : 7.000   Median :16.00  
##  Mean   :17877   Mean   :2015   Mean   : 6.507   Mean   :15.72  
##  3rd Qu.:26546   3rd Qu.:2016   3rd Qu.:10.000   3rd Qu.:23.00  
##  Max.   :35064   Max.   :2017   Max.   :12.000   Max.   :31.00  
##       hour           PM2.5             PM10             SO2        
##  Min.   : 0.00   Min.   :  3.00   Min.   :  2.00   Min.   :  1.00  
##  1st Qu.: 6.00   1st Qu.: 18.00   1st Qu.: 33.00   1st Qu.:  2.00  
##  Median :11.00   Median : 46.00   Median : 72.00   Median :  7.00  
##  Mean   :11.51   Mean   : 70.31   Mean   : 94.09   Mean   : 15.06  
##  3rd Qu.:18.00   3rd Qu.: 99.00   3rd Qu.:130.00   3rd Qu.: 18.00  
##  Max.   :23.00   Max.   :662.00   Max.   :992.00   Max.   :310.00  
##       NO2               CO              O3                TEMP      
##  Min.   :  2.00   Min.   :  100   Min.   :  0.2142   Min.   :-16.6  
##  1st Qu.: 22.00   1st Qu.:  500   1st Qu.: 15.0000   1st Qu.:  3.1  
##  Median : 36.00   Median :  800   Median : 46.0000   Median : 14.1  
##  Mean   : 44.32   Mean   : 1152   Mean   : 57.4245   Mean   : 13.4  
##  3rd Qu.: 61.00   3rd Qu.: 1400   3rd Qu.: 79.0000   3rd Qu.: 23.1  
##  Max.   :208.00   Max.   :10000   Max.   :429.0000   Max.   : 41.4  
##       PRES             DEWP              RAIN               wd           
##  Min.   : 982.4   Min.   :-35.100   Min.   : 0.00000   Length:32681      
##  1st Qu.: 999.5   1st Qu.:-10.600   1st Qu.: 0.00000   Class :character  
##  Median :1007.7   Median :  1.100   Median : 0.00000   Mode  :character  
##  Mean   :1008.0   Mean   :  1.135   Mean   : 0.06074                     
##  3rd Qu.:1016.3   3rd Qu.: 13.900   3rd Qu.: 0.00000                     
##  Max.   :1036.5   Max.   : 27.200   Max.   :52.10000                     
##       WSPM          station         
##  Min.   : 0.000   Length:32681      
##  1st Qu.: 1.000   Class :character  
##  Median : 1.500   Mode  :character  
##  Mean   : 1.866                     
##  3rd Qu.: 2.300                     
##  Max.   :10.000
## 'data.frame':    32681 obs. of  18 variables:
##  $ No     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ year   : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month  : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ day    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ hour   : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ PM2.5  : num  3 3 3 3 3 3 4 3 9 11 ...
##  $ PM10   : num  6 3 3 6 3 3 6 6 25 29 ...
##  $ SO2    : num  13 6 22 12 14 10 12 25 13 5 ...
##  $ NO2    : num  7 6 13 8 8 17 22 39 42 18 ...
##  $ CO     : int  300 300 400 300 300 400 500 600 700 500 ...
##  $ O3     : num  85 85 74 81 81 71 65 48 46 73 ...
##  $ TEMP   : num  -2.3 -2.5 -3 -3.6 -3.5 -4.5 -4.5 -2.1 -0.2 0.6 ...
##  $ PRES   : num  1021 1021 1021 1022 1022 ...
##  $ DEWP   : num  -19.7 -19 -19.9 -19.1 -19.4 -19.5 -19.5 -20 -20.5 -20.4 ...
##  $ RAIN   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ wd     : chr  "E" "ENE" "ENE" "NNE" ...
##  $ WSPM   : num  0.5 0.7 0.2 1 2.1 1.7 1.8 2.5 2.8 3.8 ...
##  $ station: chr  "Changping" "Changping" "Changping" "Changping" ...
##  - attr(*, "na.action")= 'omit' Named int [1:2383] 27 28 29 123 124 125 179 220 316 412 ...
##   ..- attr(*, "names")= chr [1:2383] "27" "28" "29" "123" ...

Section 2: Visualizing the time-series dataset

#NO2

As shown in below histogram, concentration of NO2, PM2.5 and PM10 pollutants is is right skewed over the period of 5 years.

SO2

From the SO2 concentration graph shown below, we can see that there is an outlier (=265) present in the month of March

Standard deviation (Dispersion in the SO2 concentration values) is 21.05757.

## [1] 21.05757

CO

From below graphs, we can see that there are outlier values present for CO concentration in the month of February and October.

Dispersion value is 11.647 for CO concentration.

## [1] 1105.647

O3

Outliers are not observed in Ozone layer.

Standard deviation for Ozone is 53.7926.

## [1] 53.7926

SO2 concentration over the 5 years

Below is the presentation of aggregate SO2 concentration in the air observed at 5 am and 5 pm respectively over the 5 years. Over the years, concentration has decreased. Also we can observe that SO2 is more concentrated at 5 pm than 5 am due to various factors like vehicles and amount emitted from different factories etc.

NO2 concentration over the 5 years

Below is the presentation of aggregate NO2 concentration in the air observed at 3 am and 3 pm respectively over the 5 years. We can observe that NO2 is more concentrated at 3 am than 3 pm.

Part 3:Regression model fit

## 
## Call:
## lm(formula = SO2 ~ year, data = clean_changping)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.716 -11.901  -6.994   3.284 296.099 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7887.06791  196.56323   40.12   <2e-16 ***
## year          -3.90728    0.09756  -40.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.56 on 32679 degrees of freedom
## Multiple R-squared:  0.04678,    Adjusted R-squared:  0.04675 
## F-statistic:  1604 on 1 and 32679 DF,  p-value: < 2.2e-16

Here, RSE is 20.56 which signifies the estimate of standard deviation of ε (error term). Also p-value is < 0.0001 which signifies that there is an association between the predictor and response i.e. As the times is passed SO2 concentration levels have increased.