I have chosen Beijing Air Quality time-series dataset for data analysis. Data can be found on https://archive.ics.uci.edu/ml/datasets/Beijing+Multi-Site+Air-Quality+Data# .
This hourly data set considers 6 main air pollutants and 6 relevant meteorological variables at multiple sites in Beijing.
This data set includes hourly air pollutants data from 12 nationally-controlled air-quality monitoring sites. The air-quality data are from the Beijing Municipal Environmental Monitoring Center. The meteorological data in each air-quality site are matched with the nearest weather station from the China Meteorological Administration. The time period is from March 1st, 2013 to February 28th, 2017.
For this particular assignment I have taken ‘Changping’ monitoring site under study.
Attribute information is as below:
No: row number | year: year of data in this row | month: month of data in this row | day: day of data in this row | hour: hour of data in this row | PM2.5: PM2.5 concentration (ug/m^3) | PM10: PM10 concentration (ug/m^3) | SO2: SO2 concentration (ug/m^3) | NO2: NO2 concentration (ug/m^3) | CO: CO concentration (ug/m^3) | O3: O3 concentration (ug/m^3) | TEMP: temperature (degree Celsius) | PRES: pressure (hPa) | DEWP: dew point temperature (degree Celsius) | RAIN: precipitation (mm) | wd: wind direction | WSPM: wind speed (m/s) | station: name of the air-quality monitoring site
changping<-read.csv("C:/Users/Hetal Sawant/Desktop/Spring sem/Forecasting/week1/datasets/Beijing air quality data/PRSA_Data_20130301-20170228/PRSA_Data_Changping_20130301-20170228.csv")
head(changping)
Missing data are denoted as NA in the dataset.In this section, I have omitted the rows which has NA value in them
## [1] 5166
## No year month day
## Min. : 1 Min. :2013 Min. : 1.000 Min. : 1.00
## 1st Qu.: 9384 1st Qu.:2014 1st Qu.: 3.000 1st Qu.: 8.00
## Median :17910 Median :2015 Median : 7.000 Median :16.00
## Mean :17877 Mean :2015 Mean : 6.507 Mean :15.72
## 3rd Qu.:26546 3rd Qu.:2016 3rd Qu.:10.000 3rd Qu.:23.00
## Max. :35064 Max. :2017 Max. :12.000 Max. :31.00
## hour PM2.5 PM10 SO2
## Min. : 0.00 Min. : 3.00 Min. : 2.00 Min. : 1.00
## 1st Qu.: 6.00 1st Qu.: 18.00 1st Qu.: 33.00 1st Qu.: 2.00
## Median :11.00 Median : 46.00 Median : 72.00 Median : 7.00
## Mean :11.51 Mean : 70.31 Mean : 94.09 Mean : 15.06
## 3rd Qu.:18.00 3rd Qu.: 99.00 3rd Qu.:130.00 3rd Qu.: 18.00
## Max. :23.00 Max. :662.00 Max. :992.00 Max. :310.00
## NO2 CO O3 TEMP
## Min. : 2.00 Min. : 100 Min. : 0.2142 Min. :-16.6
## 1st Qu.: 22.00 1st Qu.: 500 1st Qu.: 15.0000 1st Qu.: 3.1
## Median : 36.00 Median : 800 Median : 46.0000 Median : 14.1
## Mean : 44.32 Mean : 1152 Mean : 57.4245 Mean : 13.4
## 3rd Qu.: 61.00 3rd Qu.: 1400 3rd Qu.: 79.0000 3rd Qu.: 23.1
## Max. :208.00 Max. :10000 Max. :429.0000 Max. : 41.4
## PRES DEWP RAIN wd
## Min. : 982.4 Min. :-35.100 Min. : 0.00000 Length:32681
## 1st Qu.: 999.5 1st Qu.:-10.600 1st Qu.: 0.00000 Class :character
## Median :1007.7 Median : 1.100 Median : 0.00000 Mode :character
## Mean :1008.0 Mean : 1.135 Mean : 0.06074
## 3rd Qu.:1016.3 3rd Qu.: 13.900 3rd Qu.: 0.00000
## Max. :1036.5 Max. : 27.200 Max. :52.10000
## WSPM station
## Min. : 0.000 Length:32681
## 1st Qu.: 1.000 Class :character
## Median : 1.500 Mode :character
## Mean : 1.866
## 3rd Qu.: 2.300
## Max. :10.000
## 'data.frame': 32681 obs. of 18 variables:
## $ No : int 1 2 3 4 5 6 7 8 9 10 ...
## $ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int 3 3 3 3 3 3 3 3 3 3 ...
## $ day : int 1 1 1 1 1 1 1 1 1 1 ...
## $ hour : int 0 1 2 3 4 5 6 7 8 9 ...
## $ PM2.5 : num 3 3 3 3 3 3 4 3 9 11 ...
## $ PM10 : num 6 3 3 6 3 3 6 6 25 29 ...
## $ SO2 : num 13 6 22 12 14 10 12 25 13 5 ...
## $ NO2 : num 7 6 13 8 8 17 22 39 42 18 ...
## $ CO : int 300 300 400 300 300 400 500 600 700 500 ...
## $ O3 : num 85 85 74 81 81 71 65 48 46 73 ...
## $ TEMP : num -2.3 -2.5 -3 -3.6 -3.5 -4.5 -4.5 -2.1 -0.2 0.6 ...
## $ PRES : num 1021 1021 1021 1022 1022 ...
## $ DEWP : num -19.7 -19 -19.9 -19.1 -19.4 -19.5 -19.5 -20 -20.5 -20.4 ...
## $ RAIN : num 0 0 0 0 0 0 0 0 0 0 ...
## $ wd : chr "E" "ENE" "ENE" "NNE" ...
## $ WSPM : num 0.5 0.7 0.2 1 2.1 1.7 1.8 2.5 2.8 3.8 ...
## $ station: chr "Changping" "Changping" "Changping" "Changping" ...
## - attr(*, "na.action")= 'omit' Named int [1:2383] 27 28 29 123 124 125 179 220 316 412 ...
## ..- attr(*, "names")= chr [1:2383] "27" "28" "29" "123" ...
#NO2
As shown in below histogram, concentration of NO2, PM2.5 and PM10 pollutants is is right skewed over the period of 5 years.
From the SO2 concentration graph shown below, we can see that there is an outlier (=265) present in the month of March
Standard deviation (Dispersion in the SO2 concentration values) is 21.05757.
## [1] 21.05757
From below graphs, we can see that there are outlier values present for CO concentration in the month of February and October.
Dispersion value is 11.647 for CO concentration.
## [1] 1105.647
Outliers are not observed in Ozone layer.
Standard deviation for Ozone is 53.7926.
## [1] 53.7926
Below is the presentation of aggregate SO2 concentration in the air observed at 5 am and 5 pm respectively over the 5 years. Over the years, concentration has decreased. Also we can observe that SO2 is more concentrated at 5 pm than 5 am due to various factors like vehicles and amount emitted from different factories etc.
Below is the presentation of aggregate NO2 concentration in the air observed at 3 am and 3 pm respectively over the 5 years. We can observe that NO2 is more concentrated at 3 am than 3 pm.
##
## Call:
## lm(formula = SO2 ~ year, data = clean_changping)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.716 -11.901 -6.994 3.284 296.099
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7887.06791 196.56323 40.12 <2e-16 ***
## year -3.90728 0.09756 -40.05 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.56 on 32679 degrees of freedom
## Multiple R-squared: 0.04678, Adjusted R-squared: 0.04675
## F-statistic: 1604 on 1 and 32679 DF, p-value: < 2.2e-16
Here, RSE is 20.56 which signifies the estimate of standard deviation of ε (error term). Also p-value is < 0.0001 which signifies that there is an association between the predictor and response i.e. As the times is passed SO2 concentration levels have increased.