Beijing Air Quality: Estimating particulate load from concentration of common pollutants.

a multiple regression analysis done in the tidyverse

by Bonnie Cooper

Small and fine particulates have become a more recent measure of air quality. The following code performs a regression analysis to model particulate load by other pollutants typically used to measure air quality such as \(O_3\) (ozone), CO (carbon monoxide), \(NO_2\) (Nitrogen Dioxide) and \(SO_2\) (Sulfer Dioxide). Air quality data from 2 districts of Beijing will be used in the analysis. Data from Guanyan the Xicheng district will be used to train the data while data from the Dongsi in the neighboring district of Dongcheng will be used to test the model. Additionally, we will deploy the model to predict particulates loads for Huairou, a Beijing district with a much lower population density than the city center. The data is part of a larger set hosted on the UCI Machine Learning Repository

District	Pop. Density (/km^2)
XiCheng 西城区	26,731
DongCheng 东城区	22,635
Huairou 怀柔区	146

library( tidyverse ) #tidyverse to model our data & other functions
library( ggplot2 ) #basic plotting of our lin. model fit + data
library( ggfortify ) #plot diagnostics of our lin. model
library( tidymodels ) #lm model & more
library( relaimpo ) #calculating relative importance
library(reshape2)
library( hrbrthemes )

load the .csv data for 3 Beijing districts into the r environment as data.frames

guanyan_url <- 'https://raw.githubusercontent.com/SmilodonCub/DATA605/master/Guanyuan_dat.csv'
guanyan_df <- read.csv( guanyan_url ) #read data in as an r dataframe
dongsi_url <- 'https://raw.githubusercontent.com/SmilodonCub/DATA605/master/Dongsi_dat.csv'
dongsi_df <- read.csv( dongsi_url ) #read data in as an r dataframe
huairou_url <- 'https://raw.githubusercontent.com/SmilodonCub/DATA605/master/Huairou_dat.csv'
huairou_df <- read.csv( huairou_url ) #read data in as an r dataframe

consolidate the data into a single dataframe:

beijing_3d_df <- rbind( guanyan_df, dongsi_df, huairou_df ) %>%
  rename( PM2_5 = PM2.5 ) #some packages have issues with '.' in the label
glimpse( beijing_3d_df )

## Rows: 105,192
## Columns: 18
## $ No      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
## $ year    <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month   <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
## $ day     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ hour    <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ PM2_5   <dbl> 4, 4, 3, 3, 3, 3, 6, 3, 3, 7, 3, 3, 7, 12, 14, 12, 13, 9, 10,…
## $ PM10    <dbl> 4, 4, 3, 6, 6, 6, 6, 3, 6, 11, 8, 6, 14, 14, 16, 21, 22, 28, …
## $ SO2     <dbl> 14, 13, 10, 7, 5, 6, 6, 7, 9, 9, 11, 8, 6, 4, 4, 3, 5, 6, 8, …
## $ NO2     <dbl> 20, 17, 19, 24, 14, 14, 20, 26, 37, 30, 26, 23, 23, 20, 21, 2…
## $ CO      <int> 300, 300, 300, 400, 400, 400, 400, 400, 500, 400, 400, 300, 3…
## $ O3      <dbl> 69, 72, 69, 62, 71, 71, 66, 61, 50, 58, 65, 70, 73, 78, 79, 7…
## $ TEMP    <dbl> -0.7, -1.1, -1.1, -1.4, -2.0, -2.2, -2.6, -1.6, 0.1, 1.2, 1.9…
## $ PRES    <dbl> 1023.0, 1023.2, 1023.5, 1024.5, 1025.2, 1025.6, 1026.5, 1027.…
## $ DEWP    <dbl> -18.8, -18.2, -18.2, -19.4, -19.5, -19.6, -19.1, -19.1, -19.2…
## $ RAIN    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ wd      <chr> "NNW", "N", "NNW", "NW", "N", "N", "NNE", "NNW", "NNW", "N", …
## $ WSPM    <dbl> 4.4, 4.7, 5.6, 3.1, 2.0, 3.7, 2.5, 3.8, 4.1, 2.6, 3.6, 3.7, 5…
## $ station <chr> "Guanyuan", "Guanyuan", "Guanyuan", "Guanyuan", "Guanyuan", "…

The features in the data that give the values for air particulates are all different components used to measure air polution. We can visualize the correlations between these attributes:

polutants <- beijing_3d_df %>% 
  dplyr::select( PM2_5:O3 ) %>%
  drop_na()

polutants_cov <- round( cor( polutants ), 2 ) 
polcov_melted <- melt( polutants_cov )

ggplot( data = polcov_melted, aes(x=Var1, y=Var2, fill=value), col = cm.colors() ) + 
  geom_tile() +
  scale_fill_distiller(palette = "RdBu") +
  theme_ipsum() +
  ggtitle( 'Correlation Heat Map' ) +
  geom_text( aes( Var2, Var1, label = value ), color = 'black', size=4 )

PM2.5 (renamed here as PM2_5) will now be modelled by the other attributes in the dataset as a linear multiple regression. PM2.5 refers to the measure of microscopic particles less than or equal to 2.5 microns. These very small particulates have been shown to have impacts on health and to cause disease and are now a common measure of air polution.

Use step-wise regression methods to develop a linear multiple regression model using the data from Guanyan

guanyan_filt <- beijing_3d_df %>% 
  filter( station == 'Guanyuan' ) %>%
  dplyr::select( !c( station, No ) ) %>%
  drop_na()

int_only <- lm( PM2_5 ~ 1, data = guanyan_filt )
all_features <- lm( PM2_5 ~ . -wd, data = guanyan_filt )
stepwise <- stats::step( int_only, scope = list( lower = int_only, upper = all_features ),
                  direction = 'both', trace = 0, steps = 1000 )
sigcoeffs <- names( unlist( stepwise[1] ) )
sigcoeffs

##  [1] "coefficients.(Intercept)" "coefficients.PM10"       
##  [3] "coefficients.CO"          "coefficients.DEWP"       
##  [5] "coefficients.TEMP"        "coefficients.O3"         
##  [7] "coefficients.NO2"         "coefficients.PRES"       
##  [9] "coefficients.month"       "coefficients.day"        
## [11] "coefficients.SO2"         "coefficients.year"

summary( stepwise )

## 
## Call:
## lm(formula = PM2_5 ~ PM10 + CO + DEWP + TEMP + O3 + NO2 + PRES + 
##     month + day + SO2 + year, data = guanyan_filt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -483.30  -13.35    1.15   13.96  230.35 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.505e+03  3.107e+02  -4.842 1.29e-06 ***
## PM10         5.608e-01  2.839e-03 197.548  < 2e-16 ***
## CO           1.920e-02  2.547e-04  75.387  < 2e-16 ***
## DEWP         1.446e+00  2.656e-02  54.443  < 2e-16 ***
## TEMP        -1.051e+00  3.928e-02 -26.744  < 2e-16 ***
## O3           1.147e-01  4.410e-03  26.017  < 2e-16 ***
## NO2          1.549e-01  8.558e-03  18.100  < 2e-16 ***
## PRES         5.741e-01  3.203e-02  17.926  < 2e-16 ***
## month       -4.056e-01  5.685e-02  -7.135 9.87e-13 ***
## day         -1.443e-01  1.835e-02  -7.864 3.83e-15 ***
## SO2          4.710e-02  9.919e-03   4.748 2.06e-06 ***
## year         4.565e-01  1.566e-01   2.915  0.00356 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.03 on 32251 degrees of freedom
## Multiple R-squared:  0.873,  Adjusted R-squared:  0.8729 
## F-statistic: 2.015e+04 on 11 and 32251 DF,  p-value: < 2.2e-16

The above code used both forward and backward step-wise regression approaches to our data; this arrives at the best fit of the data by either adding features (forward) or removing features (backward) from the model and evaluating until the best result if found. Many of the features have coefficients that evaluate to be statistically relevant to the model. The model has a high R-squared value suggesting that these features can be weighted in a linear regression to explain 87.2% of the variance in the data. The p-value is small suggesting statistical significance to the relationship of the attributes to the value of small paticulates. However, the F-statistic is large suggesting a statistical significance.

The following code will calculate the relative importance of the attributes. perhaps the model can be stream-lined by removing features that do not contribute much to the overall fit of the data.

all_features <- lm( PM2_5 ~ . -wd, data = guanyan_filt )
relimportance <- calc.relimp( all_features, type = 'lmg', rela = TRUE )
sort( relimportance$lmg, decreasing = TRUE )

##         PM10           CO          NO2          SO2         DEWP         WSPM 
## 0.4231317405 0.2567354343 0.1646685450 0.0735510770 0.0241546579 0.0204228327 
##         TEMP           O3         PRES        month         year         hour 
## 0.0153722919 0.0126547462 0.0048726327 0.0020848674 0.0011778152 0.0008030963 
##          day         RAIN 
## 0.0002041870 0.0001660760

From the output above, some features do not contribute much to the model, so they will be discarded in the next fit. Using an arbitrary threshold of 1%, we will simplify the number of parameters in the model. This leaves us with PM10, CO, NO2, SO2, DEWP, WSPM, TEMP and O3. Additionally, the categorical variable wd for wind direction will be incorporated.

guanyan_filt <- beijing_3d_df %>% 
  filter( station == 'Guanyuan' ) %>%
  dplyr::select( !c( station, No, day, year, month, RAIN, hour, PRES ) ) %>%
  drop_na()
int_only <- lm( PM2_5 ~ 1, data = guanyan_filt )
all_features <- lm( PM2_5 ~ . , data = guanyan_filt )
stepwise <- stats::step( int_only, scope = list( lower = int_only, upper = all_features ),
                  direction = 'both', trace = 0, steps = 1000 )
sigcoeffs <- names( unlist( stepwise[1] ) )
sigcoeffs

## [1] "coefficients.(Intercept)" "coefficients.PM10"       
## [3] "coefficients.CO"          "coefficients.DEWP"       
## [5] "coefficients.TEMP"        "coefficients.O3"         
## [7] "coefficients.NO2"         "coefficients.SO2"

summary( stepwise )

## 
## Call:
## lm(formula = PM2_5 ~ PM10 + CO + DEWP + TEMP + O3 + NO2 + SO2, 
##     data = guanyan_filt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -486.14  -13.42    1.35   14.02  223.64 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.5585922  0.6108826  -5.825 5.75e-09 ***
## PM10         0.5565078  0.0028333 196.414  < 2e-16 ***
## CO           0.0194165  0.0002511  77.325  < 2e-16 ***
## DEWP         1.2673156  0.0248113  51.078  < 2e-16 ***
## TEMP        -1.3589243  0.0353805 -38.409  < 2e-16 ***
## O3           0.1190134  0.0044086  26.996  < 2e-16 ***
## NO2          0.1501640  0.0086094  17.442  < 2e-16 ***
## SO2          0.0375082  0.0092268   4.065 4.81e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.22 on 32255 degrees of freedom
## Multiple R-squared:  0.8713, Adjusted R-squared:  0.8713 
## F-statistic: 3.12e+04 on 7 and 32255 DF,  p-value: < 2.2e-16

The simplified model loses explanation of a fraction of a percent of variance in the data. However, there is a gain in the f-statistic and hence the statistical significance of the model.

Next, the model predictions will be tested for Dongsi station in the neighboring district of Dongcheng.

dongsi_filt <- beijing_3d_df %>% 
  filter( station == 'Dongsi' ) %>%
  dplyr::select( !c( station, No, day, year, month, RAIN, hour, PRES ) ) %>%
  drop_na()
dongsi_predict <- stepwise %>% predict( dongsi_filt )
test_rsq <- rsq_vec( dongsi_filt$PM2_5, dongsi_predict )
test_rsq

## [1] 0.8618616

The model developed using air quality data from Guanyuan Station in Xicheng was able to explain a comparable amount of the variance in the data for Dongsi station in neighboring DongCheng district of central Beijing. However, these districts are very similar in that they both have very high population density. Will the model perform well for an area with less poeple? Huairou station is located north of the city center and has the lowest population density of all Beijing Districts. The following code will evaluate the Guanyuan based model’s predictions for Huairou:

huairou_filt <- beijing_3d_df %>% 
  filter( station == 'Huairou' ) %>%
  dplyr::select( !c( station, No, day, year, month, RAIN, hour, PRES ) ) %>%
  drop_na()
huairou_predict <- stepwise %>% predict( huairou_filt )
test_rsq <- rsq_vec( huairou_filt$PM2_5, huairou_predict )
test_rsq

## [1] 0.8473711

Even though Huairou has a much lower population density than Xicheng, the model does approximately as well at estimating the variance.