Federal authorities wish to use Boston suburb data to predict pollution levels in other urban and suburban areas using socioeconomic indicators as predictors and nitrogen oxides concentration in parts per 10 million as the response variable.

library(MASS)
data(Boston)
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Variable Definitions

Variable selection

For the model to be useful nationwide, predictors that are specific to Boston will be removed

library(dplyr)
data <- Boston %>%
          select(-chas, -dis)

Checking for NA values in the data set

sum(is.na(data))
## [1] 0

There are no missing values in the data set

Splitting data into training and test sets

set.seed(1)
trainid <- sample(1:nrow(data), nrow(data)/2, replace=F)

train <- data[trainid,]
test <- data[-trainid,]

dim(data)
## [1] 506  12
dim(train)
## [1] 253  12
dim(test)
## [1] 253  12

The data set has been split using half for training the models and half for testing

Least Squares Multipe Regression with all variables

lm.fit <- lm(nox~ ., data = train)
summary(lm.fit)
## 
## Call:
## lm(formula = nox ~ ., data = train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.124753 -0.037328 -0.003285  0.031815  0.172038 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.383e-01  7.174e-02  10.291  < 2e-16 ***
## crim        -5.733e-04  5.426e-04  -1.057 0.291736    
## zn          -6.601e-04  2.226e-04  -2.965 0.003326 ** 
## indus        6.307e-03  1.003e-03   6.290 1.48e-09 ***
## rm          -1.030e-02  7.641e-03  -1.348 0.179010    
## age          1.458e-03  2.116e-04   6.889 4.86e-11 ***
## rad          4.425e-03  1.158e-03   3.822 0.000169 ***
## tax          5.110e-06  6.590e-05   0.078 0.938250    
## ptratio     -1.742e-02  2.274e-03  -7.657 4.59e-13 ***
## black        1.730e-05  4.608e-05   0.375 0.707662    
## lstat        6.816e-04  1.030e-03   0.662 0.508855    
## medv        -8.689e-04  7.509e-04  -1.157 0.248386    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05767 on 241 degrees of freedom
## Multiple R-squared:  0.7774, Adjusted R-squared:  0.7673 
## F-statistic: 76.53 on 11 and 241 DF,  p-value: < 2.2e-16

The multiple regression model with no feature selection identifies ptratio, age, indus, rad and zn as the most significant predictors.

lm.pred <- predict(lm.fit, newdata = test, type = "response")
MSE.lm <- mean((lm.pred - test$nox)^2)
MSE.lm
## [1] 0.003672847

The linear model predictions differ from the recorded pollution levels by on average 0.003673 parts per 10 million

A model that uses only the most significant predictors may improve prediction accuracy

lm.fit2 <- lm(nox ~ ptratio + rad + age + indus + zn, data = train)
summary(lm.fit2)
## 
## Call:
## lm(formula = nox ~ ptratio + rad + age + indus + zn, data = train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.131256 -0.034661  0.001782  0.029558  0.173266 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.6151550  0.0408598  15.055  < 2e-16 ***
## ptratio     -0.0150688  0.0021283  -7.080 1.49e-11 ***
## rad          0.0041092  0.0005924   6.937 3.50e-11 ***
## age          0.0015047  0.0001954   7.701 3.25e-13 ***
## indus        0.0072576  0.0008544   8.494 1.89e-15 ***
## zn          -0.0006891  0.0002205  -3.125  0.00199 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0586 on 247 degrees of freedom
## Multiple R-squared:  0.7645, Adjusted R-squared:  0.7597 
## F-statistic: 160.4 on 5 and 247 DF,  p-value: < 2.2e-16

The percentage of variance in nox explained by the model has decreased from 76.7% to 76% when the terms with p-values >0.05 are removed so I will leave them in.

Best Subset Selection

library(leaps)
regfit.full <- regsubsets(nox~., data=train, nvmax = 12)
regfit.summary <- summary(regfit.full)
regfit.summary
## Subset selection object
## Call: regsubsets.formula(nox ~ ., data = train, nvmax = 12)
## 11 Variables  (and intercept)
##         Forced in Forced out
## crim        FALSE      FALSE
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## black       FALSE      FALSE
## lstat       FALSE      FALSE
## medv        FALSE      FALSE
## 1 subsets of each size up to 11
## Selection Algorithm: exhaustive
##           crim zn  indus rm  age rad tax ptratio black lstat medv
## 1  ( 1 )  " "  " " "*"   " " " " " " " " " "     " "   " "   " " 
## 2  ( 1 )  " "  " " "*"   " " "*" " " " " " "     " "   " "   " " 
## 3  ( 1 )  " "  " " "*"   " " "*" "*" " " " "     " "   " "   " " 
## 4  ( 1 )  " "  " " "*"   " " "*" "*" " " "*"     " "   " "   " " 
## 5  ( 1 )  " "  " " "*"   "*" "*" "*" " " "*"     " "   " "   " " 
## 6  ( 1 )  " "  "*" "*"   "*" "*" "*" " " "*"     " "   " "   " " 
## 7  ( 1 )  " "  "*" "*"   "*" "*" "*" " " "*"     " "   " "   "*" 
## 8  ( 1 )  "*"  "*" "*"   "*" "*" "*" " " "*"     " "   " "   "*" 
## 9  ( 1 )  "*"  "*" "*"   "*" "*" "*" " " "*"     " "   "*"   "*" 
## 10  ( 1 ) "*"  "*" "*"   "*" "*" "*" " " "*"     "*"   "*"   "*" 
## 11  ( 1 ) "*"  "*" "*"   "*" "*" "*" "*" "*"     "*"   "*"   "*"

Best subset selection chooses slightly different variables to the full multiple least squares model. Here the 5-variable model with the smallest RSS contains the variable rm instead of zn

Feature Selection

par(mfrow=c(3,1))

plot(regfit.summary$cp, type="l", xlab="Number Of Variables", ylab="Best Subset Cp")
min.cp <- which.min(regfit.summary$cp)
points(min.cp, regfit.summary$cp[min.cp], pch=16, col="red")

plot(regfit.summary$bic, type="l", xlab="Number Of Variables", ylab="Best Subset BIC")
min.bic <- which.min(regfit.summary$bic)
points(min.bic, regfit.summary$bic[min.bic], pch=16, col="red")

plot(regfit.summary$adjr2, type="l", xlab="Number Of Variables", ylab="Best Subset Adj R2")
max.adjr2 <- which.max(regfit.summary$adjr2)
points(max.adjr2, regfit.summary$adjr2[max.adjr2], pch=16, col="red")

BIC, Cp and Adjusted R2 each select a different model size so I will test the models with 6, 7 and 8 coefficients

Calculating Test MSE

sixvar.coefs <- coef(regfit.full, 6)

modmatrix <- model.matrix(nox~., data=test)
pred <- modmatrix[,names(sixvar.coefs)] %*% sixvar.coefs
MSE.bestsubset.6coefs <- mean((pred - test$nox)^2)

MSE.bestsubset.6coefs
## [1] 0.003716281
sevenvar.coefs <- coef(regfit.full, 7)

modmatrix <- model.matrix(nox~., data=test)
pred <- modmatrix[,names(sevenvar.coefs)] %*% sevenvar.coefs
MSE.bestsubset.7coefs <- mean((pred - test$nox)^2)

MSE.bestsubset.7coefs
## [1] 0.003631656
eightvar.coefs <- coef(regfit.full, 8)

modmatrix <- model.matrix(nox~., data=test)
pred <- modmatrix[,names(eightvar.coefs)] %*% eightvar.coefs
MSE.bestsubset.8coefs <- mean((pred - test$nox)^2)

MSE.bestsubset.8coefs
## [1] 0.003638238

The Best Subset Selection model with 7 coefficients returns the smallest test MSE so this one will be tested against the other modelling methods

MSE.bestsubset <- MSE.bestsubset.7coefs

Ridge Regression

library(glmnet)

Defining Model Matrix

xmat.train <- model.matrix(nox~., data=train)[,-1]
y.train <- train$nox
xmat.test <- model.matrix(nox~., data=test) [,-1]
y.test <- test$nox

ridge.fit <- cv.glmnet(xmat.train, y.train, alpha=0)

best.lambda <- ridge.fit$lambda.min
best.lambda
## [1] 0.01024846

Calculating Test MSE

ridge.pred <- predict(ridge.fit, newx = xmat.test, s=best.lambda)
MSE.ridge <- mean((ridge.pred-y.test)^2)
MSE.ridge
## [1] 0.003496171

The Ridge Regression model predictions differ from the true value for nox by 0.003496 on average

Lasso Fit

lasso.fit <- cv.glmnet(xmat.train, y.train, alpha=1)

best.lambda <- lasso.fit$lambda.min
best.lambda
## [1] 0.0008121731

Calculating Test MSE

lasso.pred <- predict(lasso.fit, s=best.lambda, newx = xmat.test)
MSE.lasso <- mean((lasso.pred-y.test)^2)
MSE.lasso
## [1] 0.003579346

The Lasso model predictions differ from the true value for nox by 0.003579 on average

Principal Components Regression

library(pls)
pcr.fit <- pcr(nox~., data=train, validation="CV", scale=TRUE)

validationplot(pcr.fit, val.type = "MSEP", main = "Modelling nox")

summary(pcr.fit)
## Data:    X dimension: 253 11 
##  Y dimension: 253 1
## Fit method: svdpc
## Number of components considered: 11
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV          0.1198  0.07871  0.07908  0.07384  0.06978  0.06412  0.06001
## adjCV       0.1198  0.07867  0.07906  0.07371  0.06950  0.06398  0.05987
##        7 comps  8 comps  9 comps  10 comps  11 comps
## CV     0.05987  0.05997  0.06039   0.06017   0.05979
## adjCV  0.05974  0.05984  0.06024   0.05999   0.05960
## 
## TRAINING: % variance explained
##      1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X      50.52    62.14    71.21    78.49    84.95    89.83    93.46
## nox    57.13    57.42    63.98    68.96    73.48    76.61    76.81
##      8 comps  9 comps  10 comps  11 comps
## X      95.77    97.92     99.46    100.00
## nox    76.82    76.87     77.23     77.74

The models with 7 components and 11 components yield very similar cross-validated training MSE so I will test them both

pcr.pred7 <- predict(pcr.fit, newdata = test, ncomp = 7)
pcr.pred11 <- predict(pcr.fit, newdata = test, ncomp = 11)

MSE.pcr7 <- mean((pcr.pred7-y.test)^2)
MSE.pcr11 <- mean((pcr.pred11-y.test)^2)

MSE.pcr7
## [1] 0.003729978
MSE.pcr11
## [1] 0.003672847

PCR model with 11 components yields the smallest test error rate. Its predictions differ from the true value for nox by 0.003673 on average

errors <- data.frame(method = c("Least Squares", "Best Subset", "Ridge","Lasso","PCR"), 
                     MeanSquaredError = c(MSE.lm, MSE.bestsubset, MSE.ridge, MSE.lasso, MSE.pcr11))
library(ggplot2)
library(ggthemes)
ggplot(errors, aes(x=method, y=MeanSquaredError)) + geom_bar(stat="identity") + 
        labs(title = "Modelling Method Test Mean Squared Errors", y="Mean Squared Errors", x="Method") + coord_cartesian(ylim = c(0.002, 0.004)) +   theme_fivethirtyeight()

Conclusion

Ridge regression yields the smallest Test Mean Squared Error so I would propose that this model be used to predict nox with the following formula:

\[ Nitrogen~Oxide~Concentration~(parts~ per~10~million) = \\ 0.6728 \\ - 0.0002757 \times per~capita~crime~rate \\ - 0.0006834 \times proportion~of~residential~land~zoned~for~lots~over~25,000~sq.ft. \\ + 0.0054385 \times proportion~of~non-retail~business~acres \\ - 0.0097889 \times average~number~of~rooms~per~dwelling \\ + 0.0013280 \times proportion~of~owner-occupied~units~built~prior~to~1940 \\ + 0.0028582 \times index~of~accessibility~to~radial~highways \\ + 0.0000754 \times full~value~property~tax~rate~per~10,000~USD \\ - 0.0144613 \times pupil-teacher~ratio \\ + 0.0000016 \times 1000(Bk - 0.63)^2~where~Bk~is~the~proportion~of~black~citizens \\ + 0.0012969 \times percentage~of~population~classified~as~'low~status' \\ - 0.0004152 \times median~value~of~owner-occupied~homes~in~1000s~USD \]

The findings show that the largest contributing factors to Nitrogen Oxide levels in a given town are the proprtion of business acres that are non-retail, the proximity of the town to radial highways and the proportion of owner-occupied units built prior to 1940. A higher incidence of these features is strongly associated with increased levels of pollution

The largest mitigating factors are pupil-teacher ratio, average number of rooms per dwelling and the proportion of residential land zoned for lots over 25,000 sq.ft. Towns with large pupil-teacher ratios and large residential dwellings are associated with lower levels of pollution.