1 ABSTRACT
The main purpose of this project is to fit a MARS (Multuivariate Adaptive Regression Spline) on the Boston housing data.
A Multiple Linear Regression model was initally fit, which gave an adj-Rsq value of around 0.74. However, upon conducting residual diagnostics, it was found that the residuals had misspecifed mean structure against almost all predictor variables like rm, lstat, etc.
To remedy this, a transformation was required for every predictor variable where the residuals showed misspecified mean structure test. However, this is a reiterating process which requires a lot of expertise and energy. Thus, a Linear Regression fit is inappropriate because it does not pass its residual diagnostics tests.
Another way to deal with the misspecified mean structure is to use non-parametric methods like Regression Trees or MARS. Hence, I decided to use MARS on the Boston housing data.
2 Data Loading and Introduction
# for boston housing data
data(boston, package = "pdp")
# rows x columns
dim(boston)
## [1] 506 16
# looking at first few observations
head(boston,5)
## lon lat cmedv crim zn indus chas nox rm age dis rad tax
## 1 -70.955 42.2550 24.0 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296
## 2 -70.950 42.2875 21.6 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242
## 3 -70.936 42.2830 34.7 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242
## 4 -70.928 42.2930 33.4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222
## 5 -70.922 42.2980 36.2 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222
## ptratio b lstat
## 1 15.3 396.90 4.98
## 2 17.8 396.90 9.14
## 3 17.8 392.83 4.03
## 4 18.7 394.63 2.94
## 5 18.7 396.90 5.33
Let us look at the data dictionary for Boston housing data.
| Column | Description |
|---|---|
| lon | Longitude of census tract |
| lat | Latitude of census tract |
| cmedv | Corrected median value of owner-occupied homes in USD 1000’s |
| crim | Per capita crime rate by town |
| zn | Proportion of residential land zoned for lots over 25,000 sq.ft |
| indus | Proportion of non-retail business acres per town |
| chas | Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) |
| nox | Nitric oxides concentration (parts per 10 million) |
| rm | Average number of rooms per dwelling |
| age | Proportion of owner-occupied units built prior to 1940 |
| dis | Weighted distances to five Boston employment centers |
| rad | Index of accessibility to radial highways |
| tax | Full-value property-tax rate per USD 10,000 |
| ptratio | Pupil-teacher ratio by town |
| b \(1000(B - 0.63)^2\) | where B is the proportion of blacks by town |
| lstat | Percentage of lower status of the population |
3 A Linear Model
Let us begin with a full linear model fit and observe the summary.
# fit an MLR model
boston_linear_full <- lm(cmedv ~ ., data = boston)
# in-sample MSE (Mean Square Error)
(In_sample_RMSE <- sigma(boston_linear_full))
## [1] 4.700192
#summary
summary(boston_linear_full)
##
## Call:
## lm(formula = cmedv ~ ., data = boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5831 -2.7643 -0.5994 1.7482 26.0822
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.350e+02 3.032e+02 -1.435 0.152029
## lon -3.935e+00 3.372e+00 -1.167 0.243770
## lat 4.495e+00 3.669e+00 1.225 0.221055
## crim -1.045e-01 3.261e-02 -3.206 0.001436 **
## zn 4.657e-02 1.374e-02 3.390 0.000755 ***
## indus 1.524e-02 6.175e-02 0.247 0.805106
## chas1 2.578e+00 8.650e-01 2.980 0.003024 **
## nox -1.582e+01 4.005e+00 -3.951 8.93e-05 ***
## rm 3.754e+00 4.166e-01 9.011 < 2e-16 ***
## age 2.468e-03 1.335e-02 0.185 0.853440
## dis -1.400e+00 2.088e-01 -6.704 5.61e-11 ***
## rad 3.067e-01 6.658e-02 4.607 5.23e-06 ***
## tax -1.289e-02 3.727e-03 -3.458 0.000592 ***
## ptratio -8.771e-01 1.363e-01 -6.436 2.92e-10 ***
## b 9.176e-03 2.663e-03 3.446 0.000618 ***
## lstat -5.374e-01 5.042e-02 -10.660 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.7 on 490 degrees of freedom
## Multiple R-squared: 0.7458, Adjusted R-squared: 0.738
## F-statistic: 95.82 on 15 and 490 DF, p-value: < 2.2e-16
INTERPRETATION:
The RMSE value of 4.70019 USD (in 1000’s) indicates that the model can on average give an estimate of new median housing price, given other variables are provided, within an error range of 4.70019 USD. This means that on average the newly predicted price could be away from the actual price by or less than 4.70019 USD (in 1000’s).
Adjusted Rsq value of 0.738 indicates a moderately good fit and the overall model fit seems significant as per the p-value: < 2.2e-16.
However, let us perform the residual diagnostics to find out if we are actually fulfilling the assumptions for fitting a Linear Model. The main test that we are looking for is if the residuals do not have missspecified mean structures against atleast all important predictor variables.
# residual diagnostics
library(broom)
library(dplyr)
# metrics useful for regression diagnostics
residual_diagnostics <- boston_linear_full %>% broom::augment() %>% mutate(row_num = 1:n())
head(residual_diagnostics)
## # A tibble: 6 x 23
## cmedv lon lat crim zn indus chas nox rm age dis rad
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <int>
## 1 24 -71.0 42.3 0.00632 18 2.31 0 0.538 6.58 65.2 4.09 1
## 2 21.6 -71.0 42.3 0.0273 0 7.07 0 0.469 6.42 78.9 4.97 2
## 3 34.7 -70.9 42.3 0.0273 0 7.07 0 0.469 7.18 61.1 4.97 2
## 4 33.4 -70.9 42.3 0.0324 0 2.18 0 0.458 7.00 45.8 6.06 3
## 5 36.2 -70.9 42.3 0.0690 0 2.18 0 0.458 7.15 54.2 6.06 3
## 6 28.7 -70.9 42.3 0.0298 0 2.18 0 0.458 6.43 58.7 6.06 3
## # ... with 11 more variables: tax <int>, ptratio <dbl>, b <dbl>, lstat <dbl>,
## # .fitted <dbl>, .resid <dbl>, .hat <dbl>, .sigma <dbl>, .cooksd <dbl>,
## # .std.resid <dbl>, row_num <int>
# for visualization
library(ggplot2)
library(ggpubr) #for multiple ggplots
# missspecified mean structure: lstat
P1 <- ggplot(data = residual_diagnostics) +
geom_point(aes(y = .std.resid ,x = lstat)) +
geom_smooth(aes(y = .std.resid ,x = lstat), method = "loess", se = FALSE) +
geom_hline(yintercept = c(-2,2)) +
theme_bw()
# missspecified mean structure: rm
P2 <- ggplot(data = residual_diagnostics) +
geom_point(aes(y = .std.resid, x = rm)) +
geom_smooth(aes(y = .std.resid, x = rm), method = "loess", se = FALSE) +
geom_hline(yintercept = c(-2,2)) +
theme_bw()
# 11 figures arranged in 1 rows and 2 columns
annotate_figure(ggarrange(P1, P2, ncol = 2, nrow = 1),
top = text_grob("Misspecified Mean Structure"))
Looking at the above plots for std.residual vs lstat and std.residual vs rm we can say that the regression function is non linear. We would want the blue line to stay almost constantly around zero i.e. it should have a mean value of 0.
Hence, due to the violation of Linear Regression assumptions we cannot deploy this model for production. To remedy this problem, we can use either transformations on the predictor variables like rm and lstat or we could use alternative non-parametric modelling techniques like Regression Trees, MARS etc.
For this project let us choose MARS (Multivariate Adaptive Regression Spline) as our modelling technique.
4 THE MARS MODEL
4.1 Initial fit
Let us begin by fitting a simple MARS model with default parameters (degree as 1) and observe the summary.
# let us use MARS (Multivariate Adaptive Regression Spline)
library(earth) # for MARS
library(pdp) # for partial dependence plots
library(vip) # for variable importance plots
# a simple fit
boston_mars <- earth(cmedv ~ ., data = boston,
degree = 1 # tuning parameter
)
# MARS summary
summary(boston_mars)
## Call: earth(formula=cmedv~., data=boston, degree=1)
##
## coefficients
## (Intercept) -2.743460
## h(-70.99-lon) 23.712668
## h(lon- -70.99) 38.145330
## h(lat-42.2835) 43.008986
## h(4.42228-crim) -0.572711
## h(crim-4.42228) -0.409119
## h(crim-18.811) 0.330218
## h(0.488-nox) -24.939405
## h(nox-0.488) -19.484251
## h(rm-6.431) 5.998860
## h(rm-7.313) 8.569941
## h(rm-7.929) -18.441106
## h(dis-1.5106) 34.885916
## h(2.4358-dis) 36.199897
## h(dis-2.4358) -35.834647
## h(300-tax) 0.029591
## h(14.7-ptratio) 2.136646
## h(ptratio-14.7) -0.458695
## h(386.75-b) -0.005791
## h(b-386.75) -0.116258
## h(6.12-lstat) 2.105861
## h(lstat-6.12) -0.609421
## h(lstat-23.79) 0.505551
##
## Selected 23 of 26 terms, and 10 of 15 predictors
## Termination condition: Reached nk 31
## Importance: rm, lstat, ptratio, dis, crim, nox, tax, lon, b, lat, ...
## Number of terms at each degree of interaction: 1 22 (additive model)
## GCV 11.46153 RSS 4813.864 GRSq 0.8643274 RSq 0.8869394
OBSERVATIONS:
Rsq of the best model as per default parameters is 0.887. This indicates that there has been an improvement in comparison to the linear model.
Unlike the case of Linear Regression, p-values or confidence intervals are not provided in the summary of MARS for the coefficients.
MARS has forward pass where variable + knot combination that gives the greatest improvement to the current model are added and pruning pass where basis functions are removed one at a time. Thus, in the summary we can observe something like this :
(crim-4.42228)along with its coefficient.
4.2 Tuning the parameter: degree
Let us tune the parameter degree, to obtain its optimal values for modelling. For doing the same, I shall use the caret package. The parameter is optimized keeping Rsquared as the metric (the higher the better) using 5-fold Cross Validation repeated thrice.
# tuning parameters
library(caret)
getModelInfo("earth")$earth$parameters
## parameter class label
## 1 nprune numeric #Terms
## 2 degree numeric Product Degree
# Tune a MARS model. Setting nprune max as 100
set.seed(103) # for reproducibility
boston_mars_tune <- train(
x = subset(boston, select = -cmedv),
y = boston$cmedv,
method = "earth",
metric = "Rsquared",
trControl = trainControl(method = "repeatedcv",
number = 5, repeats = 3),
tuneGrid = expand.grid(degree = 1:5, nprune = 100) # tuning only degree and not nprune
)
# Print model tuning summary
print(boston_mars_tune)
## Multivariate Adaptive Regression Spline
##
## 506 samples
## 15 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 406, 404, 404, 405, 405, 405, ...
## Resampling results across tuning parameters:
##
## degree RMSE Rsquared MAE
## 1 3.660745 0.8398051 2.514046
## 2 3.657539 0.8497680 2.318028
## 3 3.755125 0.8414312 2.370777
## 4 3.764644 0.8402414 2.381942
## 5 3.764644 0.8402414 2.381942
##
## Tuning parameter 'nprune' was held constant at a value of 100
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were nprune = 100 and degree = 2.
# Plot model tuning summary
ggplot(boston_mars_tune) + theme_light()
The above plot indicates that degree 2 provides the highest Rsquared was obtained for degree equal to 2. Corresponding RMSE and MAE values are low as well.
4.3 The best model
We know that best degree parameter is 2. Let us find out the best model using pmethod as cv. The selection of number of terms, and predictors will be determined by “cv” or Cross Validation (10 folds done thrice) rather than by default standard GCV statistic.
# the best parameter: degree = 2
set.seed(103)
fit1 <- earth(cmedv~., data=boston, ncross=3, nfold=10, degree = 2, pmethod="cv",keepxy=TRUE)
summary(fit1)
## Call: earth(formula=cmedv~., data=boston, pmethod="cv", keepxy=TRUE, degree=2,
## nfold=10, ncross=3)
##
## coefficients
## (Intercept) 28.96957
## h(42.2248-lat) -18.68634
## h(lat-42.2248) -15.76396
## h(crim-4.42228) -0.11324
## h(6.431-rm) -0.45963
## h(rm-6.431) 10.30938
## h(age-18.5) -0.07599
## h(1.3567-dis) 941.03891
## h(dis-1.3567) -0.50888
## h(lstat-6.12) -0.79467
## h(lstat-22.11) 0.62181
## h(1.3567-dis) * b -2.34986
## h(-70.965-lon) * h(age-18.5) 0.31074
## h(lon- -70.965) * h(age-18.5) 1.32895
## h(4.42228-crim) * h(tax-224) -0.00158
## h(4.42228-crim) * h(224-tax) 0.01795
## h(0.624-nox) * h(rm-6.431) -5.26121
## h(nox-0.624) * h(rm-6.431) -88.36153
## h(0.713-nox) * h(lstat-6.12) 2.05404
## h(nox-0.713) * h(lstat-6.12) 1.02961
## h(6.431-rm) * h(dis-1.8209) -0.49396
## h(rm-6.431) * h(ptratio-18.6) -3.05364
## h(rm-6.431) * h(18.6-ptratio) 0.62770
## h(307-tax) * h(6.12-lstat) 0.02077
## h(tax-307) * h(6.12-lstat) 0.02198
##
## Selected 25 of 29 terms, and 11 of 15 predictors (pmethod="cv")
## Termination condition: Reached nk 31
## Importance: rm, lstat, nox, tax, dis, ptratio, b, crim, lon, age, lat, ...
## Number of terms at each degree of interaction: 1 10 14
## GRSq 0.9096501 RSq 0.929844 mean.oof.RSq 0.8352213 (sd 0.0827)
##
## pmethod="backward" would have selected:
## 22 terms 11 preds, GRSq 0.9113226 RSq 0.928802 mean.oof.RSq 0.8326279
par(mfrow = c(1, 2))
#model selection
q1 <- plot(fit1, which=1,
col.mean.infold.rsq="blue", col.infold.rsq="lightblue",
col.grsq=0, col.rsq=0, col.vline=0, col.oof.vline=0)
# model selection
q2 <- plotres(fit1, which=1, info = TRUE)
INTERPRETATION:
The plot on the left indicates training and testing performance (
Rsqon Y axis) obtained from the 10 fold cross validation performed thrice. The performance on training data (blue curve) increases as we increase model complexity; on independent data the performance (pink curve) peaks and then decreases.The plot on the right shows the best model selection (25 of 29 terms, 11 of 15 predictors using pmethod=“cv”) using green line, indicating optimal terms as 25.
5 Residual Analysis
5.1 Residual vs Fitted values
# residuals vs fitted
plot(fit1, which = 3) # which = 3 for residuals vs fitted
The Residuals vs Fitted graph shows the residual for each value of the predicted response. The red line is a lowess fit. In this instance, the red line is almost constant and lying around 0. Thus, the mean of residuals is almost 0.
Visually, the variance of the residuals is constant and does not spread along the fitted values. Thus, this model is fulfilling the homoscedasticity test. However, in earth, constant variance of the residuals isn’t as important as it is in linear models.
However, this plot does indicate some cases of outliers. Those residuals are marked with numbers (372, 427, 407) indicating extreme values, or leverage / influential points.
5.2 Cumulative Distribution of residuals
plotres(fit1, which = 2, info = TRUE) # which = 2 for cumulative distribution
The Cumulative Distribution graph shows the cumulative distribution of the absolute values of residuals. What we would ideally like to see is a graph that starts at 0 and shoots up quickly to 1.
We see that 95% of the absolute values of residuals are less than about 4.61 (look at the vertical gray line for 95%). So in the training data, 95% of the time the predicted value is within 4.61 units of the observed value.
5.3 Residual QQ plot
plotres(fit1, which = 4, info = TRUE) # which = 4 for residual qqplot
The QQ (quantile-quantile) plot compares the distribution of the residuals to a normal distribution. If the residuals are distributed normally they will lie on the line.
Here, the distribution is approximately normal (look at the actual vs normal at the bottom) with heavy tailed data. Normality of the residuals often isn’t too important for earth models, but the graph is useful for discovering outlying residuals and other anomalies.
As seen earlier, observation points 427, 407, 372 do seem to be outliers/leverage/influential points. Ideally, they should be removed from the data before modelling is implemented or further investigation should be done after those observation points. For simplicity, we ignore those outliers and continue ahead.
The best model passes the residual analyses tests. We can proceed ahead to communicate our results to the stakeholders.
6 Stakeholder communication
We decide to present the following to our client:
- Cross Validation performance of the best model on entire dataset
- Important variables as per the earth fit
- Association of top variables with response variable -
cmedv
6.1 Cross Validation performance of the best model
A simple 10 fold cross validation has been performed on the data and following metrics were calculated to assess the performance of the best model.
cv <- modelr::crossv_kfold(boston, 10)
models <- purrr::map(cv$train, ~ fit1)
#RMSE
error_RMSE <- purrr::map2_dbl(models, cv$test, modelr::rmse)
#MAE
error_MAE <- purrr::map2_dbl(models, cv$test, modelr::mae)
#MAPE
error_MAPE <- purrr::map2_dbl(models, cv$test, modelr::mape)
# summary of errors
tibble("RMSE" = mean(error_RMSE),
"MAE" = mean(error_MAE),
"MAPE" = mean(error_MAPE))
## # A tibble: 1 x 3
## RMSE MAE MAPE
## <dbl> <dbl> <dbl>
## 1 2.41 1.79 0.0958
The above are the metrics for the best model.
- RMSE interpretation: Typical errors are around, on average, 2.41 USD (in 1000’s) away from actual value.
- MAPE interpretation: Typical errors are around 9.6 % of the actual value.
Ideally one does not perform hyperparameter tuning, model selection and cross validation on the same dataset. However, due to low number of observations for this dataset, I have performed the above three on the same dataset. This might lead to biased results.
6.2 Important Variables
The following plot shows us the relative importance as per the best model fit by earth.
# Variable importance plot
vip(
fit1,
num_features = 15
)
To properly understand the association of these variables with cmedv, let us look at the partial dependence plots.
6.3 Association of the top three variables (rm, lstat and nox) on cmedv
The following plots show association of predictor variables one at a time with cmedv.
# Partial dependence of cmedv on rm
p1 <- fit1 %>%
pdp::partial(pred.var = "rm") %>%
autoplot(color = "red2", size = 1) +
geom_point(data = boston, aes(x = rm, y = cmedv), alpha = 0.1) +
theme_light()
# Partial dependence of cmedv on lstat
p2 <- fit1 %>%
pdp::partial(pred.var = "lstat") %>%
autoplot(color = "red2", size = 1) +
geom_point(data = boston, aes(x = lstat, y = cmedv), alpha = 0.1) +
theme_light()
# Partial dependence of cmedv on nox
p3 <- fit1 %>%
pdp::partial(pred.var = "nox") %>%
autoplot(color = "red2", size = 1) +
geom_point(data = boston, aes(x = nox, y = cmedv), alpha = 0.1) +
theme_light()
# 3 figures arranged in 2 rows and 2 columns
annotate_figure(ggarrange(p1, p2, p3, ncol = 2, nrow = 2),
top = text_grob("Partial dependence plots"))
cmedvincreases asrmincreases. Both these variables seem to be directly proportional. This is logically coherent with the fact that as the average number of rooms increase, the housing price will increase.Opposite is the case with
lstat. They seem to be indirectly proportional. The more the percentage of lower status of population in that area, lower will be the housing price.noxandcmedvare indirectly proportional. This is logically coherent with the fact that housing prices will be high (and buyers would be ready to pay more) where emissions (Nitric oxides concentration due to pollution) are lower.
Let us look at the combined effect of two variables (like rm and lstat on cmedv).
# Partial dependence of cmedv on rm and lstat
fit1 %>%
pdp::partial(pred.var = c("rm", "lstat"), chull = TRUE) %>%
autoplot() +
theme_light()
This 2-D plot gives us an idea about how the association of two variables at a time on cmedv. On the X- Axis we have rm and on the Y-Axis we have lstat. The variation is house prices cmedv is indicated with the help of color scale (yellow indicating high prices and dark violet indicating low prices).
The following 3-D plot is a good visualization tool to understand the association but is difficult to interpet.
# Partial dependence of cmedv on both rm and nox
pd <- pdp::partial(
fit1,
pred.var = c("rm", "nox"),
chull = TRUE
)
# Interactive 3-D plot
plotly::plot_ly(
x = ~rm,
y = ~nox,
z = ~yhat,
data = pd,
type = "mesh3d"
)
7 CONCLUSION:
With this exercise we can achieve the following conclusions :
MARS is one of the best non linear modelling techniques as it automatically handles:
- Variable selection
- Nonlinear relationships
- Variable interactions
- Variable importance
Also, it has the following features which makes it a competent modelling technique contender:
- Competitive predictive performance
- Easier to interpret
- Faster training times
- Incredibly easy to tune
- Much easier to productionize!!
- Can naturally handle many types of response variables (not just Gaussian)