Small and fine particulates have become a more recent measure of air quality. The following code performs a regression analysis to model particulate load by other pollutants typically used to measure air quality such as \(O_3\) (ozone), CO (carbon monoxide), \(NO_2\) (Nitrogen Dioxide) and \(SO_2\) (Sulfer Dioxide). Air quality data from 2 districts of Beijing will be used in the analysis. Data from Guanyan the Xicheng district will be used to train the data while data from the Dongsi in the neighboring district of Dongcheng will be used to test the model. Additionally, we will deploy the model to predict particulates loads for Huairou, a Beijing district with a much lower population density than the city center. The data is part of a larger set hosted on the UCI Machine Learning Repository
| District | Pop. Density (/km^2) |
|---|---|
| XiCheng 西城区 | 26,731 |
| DongCheng 东城区 | 22,635 |
| Huairou 怀柔区 | 146 |
library( tidyverse ) #tidyverse to model our data & other functions
library( ggplot2 ) #basic plotting of our lin. model fit + data
library( ggfortify ) #plot diagnostics of our lin. model
library( tidymodels ) #lm model & more
library( relaimpo ) #calculating relative importance
library(reshape2)
library( hrbrthemes ) load the .csv data for 3 Beijing districts into the r environment as data.frames
guanyan_url <- 'https://raw.githubusercontent.com/SmilodonCub/DATA605/master/Guanyuan_dat.csv'
guanyan_df <- read.csv( guanyan_url ) #read data in as an r dataframe
dongsi_url <- 'https://raw.githubusercontent.com/SmilodonCub/DATA605/master/Dongsi_dat.csv'
dongsi_df <- read.csv( dongsi_url ) #read data in as an r dataframe
huairou_url <- 'https://raw.githubusercontent.com/SmilodonCub/DATA605/master/Huairou_dat.csv'
huairou_df <- read.csv( huairou_url ) #read data in as an r dataframeconsolidate the data into a single dataframe:
beijing_3d_df <- rbind( guanyan_df, dongsi_df, huairou_df ) %>%
rename( PM2_5 = PM2.5 ) #some packages have issues with '.' in the label
glimpse( beijing_3d_df )## Rows: 105,192
## Columns: 18
## $ No <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ hour <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ PM2_5 <dbl> 4, 4, 3, 3, 3, 3, 6, 3, 3, 7, 3, 3, 7, 12, 14, 12, 13, 9, 10,…
## $ PM10 <dbl> 4, 4, 3, 6, 6, 6, 6, 3, 6, 11, 8, 6, 14, 14, 16, 21, 22, 28, …
## $ SO2 <dbl> 14, 13, 10, 7, 5, 6, 6, 7, 9, 9, 11, 8, 6, 4, 4, 3, 5, 6, 8, …
## $ NO2 <dbl> 20, 17, 19, 24, 14, 14, 20, 26, 37, 30, 26, 23, 23, 20, 21, 2…
## $ CO <int> 300, 300, 300, 400, 400, 400, 400, 400, 500, 400, 400, 300, 3…
## $ O3 <dbl> 69, 72, 69, 62, 71, 71, 66, 61, 50, 58, 65, 70, 73, 78, 79, 7…
## $ TEMP <dbl> -0.7, -1.1, -1.1, -1.4, -2.0, -2.2, -2.6, -1.6, 0.1, 1.2, 1.9…
## $ PRES <dbl> 1023.0, 1023.2, 1023.5, 1024.5, 1025.2, 1025.6, 1026.5, 1027.…
## $ DEWP <dbl> -18.8, -18.2, -18.2, -19.4, -19.5, -19.6, -19.1, -19.1, -19.2…
## $ RAIN <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ wd <chr> "NNW", "N", "NNW", "NW", "N", "N", "NNE", "NNW", "NNW", "N", …
## $ WSPM <dbl> 4.4, 4.7, 5.6, 3.1, 2.0, 3.7, 2.5, 3.8, 4.1, 2.6, 3.6, 3.7, 5…
## $ station <chr> "Guanyuan", "Guanyuan", "Guanyuan", "Guanyuan", "Guanyuan", "…
The features in the data that give the values for air particulates are all different components used to measure air polution. We can visualize the correlations between these attributes:
polutants <- beijing_3d_df %>%
dplyr::select( PM2_5:O3 ) %>%
drop_na()
polutants_cov <- round( cor( polutants ), 2 )
polcov_melted <- melt( polutants_cov )
ggplot( data = polcov_melted, aes(x=Var1, y=Var2, fill=value), col = cm.colors() ) +
geom_tile() +
scale_fill_distiller(palette = "RdBu") +
theme_ipsum() +
ggtitle( 'Correlation Heat Map' ) +
geom_text( aes( Var2, Var1, label = value ), color = 'black', size=4 ) PM2.5 (renamed here as PM2_5) will now be modelled by the other attributes in the dataset as a linear multiple regression. PM2.5 refers to the measure of microscopic particles less than or equal to 2.5 microns. These very small particulates have been shown to have impacts on health and to cause disease and are now a common measure of air polution.
Use step-wise regression methods to develop a linear multiple regression model using the data from Guanyan
guanyan_filt <- beijing_3d_df %>%
filter( station == 'Guanyuan' ) %>%
dplyr::select( !c( station, No ) ) %>%
drop_na()
int_only <- lm( PM2_5 ~ 1, data = guanyan_filt )
all_features <- lm( PM2_5 ~ . -wd, data = guanyan_filt )
stepwise <- stats::step( int_only, scope = list( lower = int_only, upper = all_features ),
direction = 'both', trace = 0, steps = 1000 )
sigcoeffs <- names( unlist( stepwise[1] ) )
sigcoeffs## [1] "coefficients.(Intercept)" "coefficients.PM10"
## [3] "coefficients.CO" "coefficients.DEWP"
## [5] "coefficients.TEMP" "coefficients.O3"
## [7] "coefficients.NO2" "coefficients.PRES"
## [9] "coefficients.month" "coefficients.day"
## [11] "coefficients.SO2" "coefficients.year"
##
## Call:
## lm(formula = PM2_5 ~ PM10 + CO + DEWP + TEMP + O3 + NO2 + PRES +
## month + day + SO2 + year, data = guanyan_filt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -483.30 -13.35 1.15 13.96 230.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.505e+03 3.107e+02 -4.842 1.29e-06 ***
## PM10 5.608e-01 2.839e-03 197.548 < 2e-16 ***
## CO 1.920e-02 2.547e-04 75.387 < 2e-16 ***
## DEWP 1.446e+00 2.656e-02 54.443 < 2e-16 ***
## TEMP -1.051e+00 3.928e-02 -26.744 < 2e-16 ***
## O3 1.147e-01 4.410e-03 26.017 < 2e-16 ***
## NO2 1.549e-01 8.558e-03 18.100 < 2e-16 ***
## PRES 5.741e-01 3.203e-02 17.926 < 2e-16 ***
## month -4.056e-01 5.685e-02 -7.135 9.87e-13 ***
## day -1.443e-01 1.835e-02 -7.864 3.83e-15 ***
## SO2 4.710e-02 9.919e-03 4.748 2.06e-06 ***
## year 4.565e-01 1.566e-01 2.915 0.00356 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.03 on 32251 degrees of freedom
## Multiple R-squared: 0.873, Adjusted R-squared: 0.8729
## F-statistic: 2.015e+04 on 11 and 32251 DF, p-value: < 2.2e-16
The above code used both forward and backward step-wise regression approaches to our data; this arrives at the best fit of the data by either adding features (forward) or removing features (backward) from the model and evaluating until the best result if found. Many of the features have coefficients that evaluate to be statistically relevant to the model. The model has a high R-squared value suggesting that these features can be weighted in a linear regression to explain 87.2% of the variance in the data. The p-value is small suggesting statistical significance to the relationship of the attributes to the value of small paticulates. However, the F-statistic is large suggesting a statistical significance.
The following code will calculate the relative importance of the attributes. perhaps the model can be stream-lined by removing features that do not contribute much to the overall fit of the data.
all_features <- lm( PM2_5 ~ . -wd, data = guanyan_filt )
relimportance <- calc.relimp( all_features, type = 'lmg', rela = TRUE )
sort( relimportance$lmg, decreasing = TRUE )## PM10 CO NO2 SO2 DEWP WSPM
## 0.4231317405 0.2567354343 0.1646685450 0.0735510770 0.0241546579 0.0204228327
## TEMP O3 PRES month year hour
## 0.0153722919 0.0126547462 0.0048726327 0.0020848674 0.0011778152 0.0008030963
## day RAIN
## 0.0002041870 0.0001660760
From the output above, some features do not contribute much to the model, so they will be discarded in the next fit. Using an arbitrary threshold of 1%, we will simplify the number of parameters in the model. This leaves us with PM10, CO, NO2, SO2, DEWP, WSPM, TEMP and O3. Additionally, the categorical variable wd for wind direction will be incorporated.
guanyan_filt <- beijing_3d_df %>%
filter( station == 'Guanyuan' ) %>%
dplyr::select( !c( station, No, day, year, month, RAIN, hour, PRES ) ) %>%
drop_na()
int_only <- lm( PM2_5 ~ 1, data = guanyan_filt )
all_features <- lm( PM2_5 ~ . , data = guanyan_filt )
stepwise <- stats::step( int_only, scope = list( lower = int_only, upper = all_features ),
direction = 'both', trace = 0, steps = 1000 )
sigcoeffs <- names( unlist( stepwise[1] ) )
sigcoeffs## [1] "coefficients.(Intercept)" "coefficients.PM10"
## [3] "coefficients.CO" "coefficients.DEWP"
## [5] "coefficients.TEMP" "coefficients.O3"
## [7] "coefficients.NO2" "coefficients.SO2"
##
## Call:
## lm(formula = PM2_5 ~ PM10 + CO + DEWP + TEMP + O3 + NO2 + SO2,
## data = guanyan_filt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -486.14 -13.42 1.35 14.02 223.64
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.5585922 0.6108826 -5.825 5.75e-09 ***
## PM10 0.5565078 0.0028333 196.414 < 2e-16 ***
## CO 0.0194165 0.0002511 77.325 < 2e-16 ***
## DEWP 1.2673156 0.0248113 51.078 < 2e-16 ***
## TEMP -1.3589243 0.0353805 -38.409 < 2e-16 ***
## O3 0.1190134 0.0044086 26.996 < 2e-16 ***
## NO2 0.1501640 0.0086094 17.442 < 2e-16 ***
## SO2 0.0375082 0.0092268 4.065 4.81e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.22 on 32255 degrees of freedom
## Multiple R-squared: 0.8713, Adjusted R-squared: 0.8713
## F-statistic: 3.12e+04 on 7 and 32255 DF, p-value: < 2.2e-16
The simplified model loses explanation of a fraction of a percent of variance in the data. However, there is a gain in the f-statistic and hence the statistical significance of the model.
Next, the model predictions will be tested for Dongsi station in the neighboring district of Dongcheng.
dongsi_filt <- beijing_3d_df %>%
filter( station == 'Dongsi' ) %>%
dplyr::select( !c( station, No, day, year, month, RAIN, hour, PRES ) ) %>%
drop_na()
dongsi_predict <- stepwise %>% predict( dongsi_filt )
test_rsq <- rsq_vec( dongsi_filt$PM2_5, dongsi_predict )
test_rsq## [1] 0.8618616
The model developed using air quality data from Guanyuan Station in Xicheng was able to explain a comparable amount of the variance in the data for Dongsi station in neighboring DongCheng district of central Beijing. However, these districts are very similar in that they both have very high population density. Will the model perform well for an area with less poeple? Huairou station is located north of the city center and has the lowest population density of all Beijing Districts. The following code will evaluate the Guanyuan based model’s predictions for Huairou:
huairou_filt <- beijing_3d_df %>%
filter( station == 'Huairou' ) %>%
dplyr::select( !c( station, No, day, year, month, RAIN, hour, PRES ) ) %>%
drop_na()
huairou_predict <- stepwise %>% predict( huairou_filt )
test_rsq <- rsq_vec( huairou_filt$PM2_5, huairou_predict )
test_rsq## [1] 0.8473711
Even though Huairou has a much lower population density than Xicheng, the model does approximately as well at estimating the variance.