Federal authorities wish to use Boston suburb data to predict pollution levels in other urban and suburban areas using socioeconomic indicators as predictors and nitrogen oxides concentration in parts per 10 million as the response variable.
library(MASS)
data(Boston)
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
For the model to be useful nationwide, predictors that are specific to Boston will be removed
library(dplyr)
data <- Boston %>%
select(-chas, -dis)
sum(is.na(data))
## [1] 0
There are no missing values in the data set
set.seed(1)
trainid <- sample(1:nrow(data), nrow(data)/2, replace=F)
train <- data[trainid,]
test <- data[-trainid,]
dim(data)
## [1] 506 12
dim(train)
## [1] 253 12
dim(test)
## [1] 253 12
The data set has been split using half for training the models and half for testing
lm.fit <- lm(nox~ ., data = train)
summary(lm.fit)
##
## Call:
## lm(formula = nox ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.124753 -0.037328 -0.003285 0.031815 0.172038
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.383e-01 7.174e-02 10.291 < 2e-16 ***
## crim -5.733e-04 5.426e-04 -1.057 0.291736
## zn -6.601e-04 2.226e-04 -2.965 0.003326 **
## indus 6.307e-03 1.003e-03 6.290 1.48e-09 ***
## rm -1.030e-02 7.641e-03 -1.348 0.179010
## age 1.458e-03 2.116e-04 6.889 4.86e-11 ***
## rad 4.425e-03 1.158e-03 3.822 0.000169 ***
## tax 5.110e-06 6.590e-05 0.078 0.938250
## ptratio -1.742e-02 2.274e-03 -7.657 4.59e-13 ***
## black 1.730e-05 4.608e-05 0.375 0.707662
## lstat 6.816e-04 1.030e-03 0.662 0.508855
## medv -8.689e-04 7.509e-04 -1.157 0.248386
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05767 on 241 degrees of freedom
## Multiple R-squared: 0.7774, Adjusted R-squared: 0.7673
## F-statistic: 76.53 on 11 and 241 DF, p-value: < 2.2e-16
The multiple regression model with no feature selection identifies ptratio, age, indus, rad and zn as the most significant predictors.
lm.pred <- predict(lm.fit, newdata = test, type = "response")
MSE.lm <- mean((lm.pred - test$nox)^2)
MSE.lm
## [1] 0.003672847
The linear model predictions differ from the recorded pollution levels by on average 0.003673 parts per 10 million
A model that uses only the most significant predictors may improve prediction accuracy
lm.fit2 <- lm(nox ~ ptratio + rad + age + indus + zn, data = train)
summary(lm.fit2)
##
## Call:
## lm(formula = nox ~ ptratio + rad + age + indus + zn, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.131256 -0.034661 0.001782 0.029558 0.173266
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.6151550 0.0408598 15.055 < 2e-16 ***
## ptratio -0.0150688 0.0021283 -7.080 1.49e-11 ***
## rad 0.0041092 0.0005924 6.937 3.50e-11 ***
## age 0.0015047 0.0001954 7.701 3.25e-13 ***
## indus 0.0072576 0.0008544 8.494 1.89e-15 ***
## zn -0.0006891 0.0002205 -3.125 0.00199 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0586 on 247 degrees of freedom
## Multiple R-squared: 0.7645, Adjusted R-squared: 0.7597
## F-statistic: 160.4 on 5 and 247 DF, p-value: < 2.2e-16
The percentage of variance in nox explained by the model has decreased from 76.7% to 76% when the terms with p-values >0.05 are removed so I will leave them in.
library(leaps)
regfit.full <- regsubsets(nox~., data=train, nvmax = 12)
regfit.summary <- summary(regfit.full)
regfit.summary
## Subset selection object
## Call: regsubsets.formula(nox ~ ., data = train, nvmax = 12)
## 11 Variables (and intercept)
## Forced in Forced out
## crim FALSE FALSE
## zn FALSE FALSE
## indus FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## medv FALSE FALSE
## 1 subsets of each size up to 11
## Selection Algorithm: exhaustive
## crim zn indus rm age rad tax ptratio black lstat medv
## 1 ( 1 ) " " " " "*" " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " "*" " " "*" " " " " " " " " " " " "
## 3 ( 1 ) " " " " "*" " " "*" "*" " " " " " " " " " "
## 4 ( 1 ) " " " " "*" " " "*" "*" " " "*" " " " " " "
## 5 ( 1 ) " " " " "*" "*" "*" "*" " " "*" " " " " " "
## 6 ( 1 ) " " "*" "*" "*" "*" "*" " " "*" " " " " " "
## 7 ( 1 ) " " "*" "*" "*" "*" "*" " " "*" " " " " "*"
## 8 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" " " " " "*"
## 9 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" " " "*" "*"
## 10 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
Best subset selection chooses slightly different variables to the full multiple least squares model. Here the 5-variable model with the smallest RSS contains the variable rm instead of zn
par(mfrow=c(3,1))
plot(regfit.summary$cp, type="l", xlab="Number Of Variables", ylab="Best Subset Cp")
min.cp <- which.min(regfit.summary$cp)
points(min.cp, regfit.summary$cp[min.cp], pch=16, col="red")
plot(regfit.summary$bic, type="l", xlab="Number Of Variables", ylab="Best Subset BIC")
min.bic <- which.min(regfit.summary$bic)
points(min.bic, regfit.summary$bic[min.bic], pch=16, col="red")
plot(regfit.summary$adjr2, type="l", xlab="Number Of Variables", ylab="Best Subset Adj R2")
max.adjr2 <- which.max(regfit.summary$adjr2)
points(max.adjr2, regfit.summary$adjr2[max.adjr2], pch=16, col="red")
BIC, Cp and Adjusted R2 each select a different model size so I will test the models with 6, 7 and 8 coefficients
sixvar.coefs <- coef(regfit.full, 6)
modmatrix <- model.matrix(nox~., data=test)
pred <- modmatrix[,names(sixvar.coefs)] %*% sixvar.coefs
MSE.bestsubset.6coefs <- mean((pred - test$nox)^2)
MSE.bestsubset.6coefs
## [1] 0.003716281
sevenvar.coefs <- coef(regfit.full, 7)
modmatrix <- model.matrix(nox~., data=test)
pred <- modmatrix[,names(sevenvar.coefs)] %*% sevenvar.coefs
MSE.bestsubset.7coefs <- mean((pred - test$nox)^2)
MSE.bestsubset.7coefs
## [1] 0.003631656
eightvar.coefs <- coef(regfit.full, 8)
modmatrix <- model.matrix(nox~., data=test)
pred <- modmatrix[,names(eightvar.coefs)] %*% eightvar.coefs
MSE.bestsubset.8coefs <- mean((pred - test$nox)^2)
MSE.bestsubset.8coefs
## [1] 0.003638238
The Best Subset Selection model with 7 coefficients returns the smallest test MSE so this one will be tested against the other modelling methods
MSE.bestsubset <- MSE.bestsubset.7coefs
library(glmnet)
xmat.train <- model.matrix(nox~., data=train)[,-1]
y.train <- train$nox
xmat.test <- model.matrix(nox~., data=test) [,-1]
y.test <- test$nox
ridge.fit <- cv.glmnet(xmat.train, y.train, alpha=0)
best.lambda <- ridge.fit$lambda.min
best.lambda
## [1] 0.01024846
ridge.pred <- predict(ridge.fit, newx = xmat.test, s=best.lambda)
MSE.ridge <- mean((ridge.pred-y.test)^2)
MSE.ridge
## [1] 0.003496171
The Ridge Regression model predictions differ from the true value for nox by 0.003496 on average
lasso.fit <- cv.glmnet(xmat.train, y.train, alpha=1)
best.lambda <- lasso.fit$lambda.min
best.lambda
## [1] 0.0008121731
lasso.pred <- predict(lasso.fit, s=best.lambda, newx = xmat.test)
MSE.lasso <- mean((lasso.pred-y.test)^2)
MSE.lasso
## [1] 0.003579346
The Lasso model predictions differ from the true value for nox by 0.003579 on average
library(pls)
pcr.fit <- pcr(nox~., data=train, validation="CV", scale=TRUE)
validationplot(pcr.fit, val.type = "MSEP", main = "Modelling nox")
summary(pcr.fit)
## Data: X dimension: 253 11
## Y dimension: 253 1
## Fit method: svdpc
## Number of components considered: 11
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 0.1198 0.07871 0.07908 0.07384 0.06978 0.06412 0.06001
## adjCV 0.1198 0.07867 0.07906 0.07371 0.06950 0.06398 0.05987
## 7 comps 8 comps 9 comps 10 comps 11 comps
## CV 0.05987 0.05997 0.06039 0.06017 0.05979
## adjCV 0.05974 0.05984 0.06024 0.05999 0.05960
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 50.52 62.14 71.21 78.49 84.95 89.83 93.46
## nox 57.13 57.42 63.98 68.96 73.48 76.61 76.81
## 8 comps 9 comps 10 comps 11 comps
## X 95.77 97.92 99.46 100.00
## nox 76.82 76.87 77.23 77.74
The models with 7 components and 11 components yield very similar cross-validated training MSE so I will test them both
pcr.pred7 <- predict(pcr.fit, newdata = test, ncomp = 7)
pcr.pred11 <- predict(pcr.fit, newdata = test, ncomp = 11)
MSE.pcr7 <- mean((pcr.pred7-y.test)^2)
MSE.pcr11 <- mean((pcr.pred11-y.test)^2)
MSE.pcr7
## [1] 0.003729978
MSE.pcr11
## [1] 0.003672847
PCR model with 11 components yields the smallest test error rate. Its predictions differ from the true value for nox by 0.003673 on average
errors <- data.frame(method = c("Least Squares", "Best Subset", "Ridge","Lasso","PCR"),
MeanSquaredError = c(MSE.lm, MSE.bestsubset, MSE.ridge, MSE.lasso, MSE.pcr11))
library(ggplot2)
library(ggthemes)
ggplot(errors, aes(x=method, y=MeanSquaredError)) + geom_bar(stat="identity") +
labs(title = "Modelling Method Test Mean Squared Errors", y="Mean Squared Errors", x="Method") + coord_cartesian(ylim = c(0.002, 0.004)) + theme_fivethirtyeight()
Ridge regression yields the smallest Test Mean Squared Error so I would propose that this model be used to predict nox with the following formula:
\[ Nitrogen~Oxide~Concentration~(parts~ per~10~million) = \\ 0.6728 \\ - 0.0002757 \times per~capita~crime~rate \\ - 0.0006834 \times proportion~of~residential~land~zoned~for~lots~over~25,000~sq.ft. \\ + 0.0054385 \times proportion~of~non-retail~business~acres \\ - 0.0097889 \times average~number~of~rooms~per~dwelling \\ + 0.0013280 \times proportion~of~owner-occupied~units~built~prior~to~1940 \\ + 0.0028582 \times index~of~accessibility~to~radial~highways \\ + 0.0000754 \times full~value~property~tax~rate~per~10,000~USD \\ - 0.0144613 \times pupil-teacher~ratio \\ + 0.0000016 \times 1000(Bk - 0.63)^2~where~Bk~is~the~proportion~of~black~citizens \\ + 0.0012969 \times percentage~of~population~classified~as~'low~status' \\ - 0.0004152 \times median~value~of~owner-occupied~homes~in~1000s~USD \]
The findings show that the largest contributing factors to Nitrogen Oxide levels in a given town are the proprtion of business acres that are non-retail, the proximity of the town to radial highways and the proportion of owner-occupied units built prior to 1940. A higher incidence of these features is strongly associated with increased levels of pollution
The largest mitigating factors are pupil-teacher ratio, average number of rooms per dwelling and the proportion of residential land zoned for lots over 25,000 sq.ft. Towns with large pupil-teacher ratios and large residential dwellings are associated with lower levels of pollution.